A Modified Energy Statistic for Unsupervised Anomaly Detection

Rupam Mukherjee

GE Research, Bangalore, Karnataka, 560066, India [email protected], [email protected]

ABSTRACT value. Anomaly is flagged when the exceeds a threshold and an alert is generated, thereby preventing sud- For prognostics in industrial applications, the degree of an- den unplanned failures. Although as is captured in (Goldstein omaly of a test point from a baseline cluster is estimated using & Uchida, 2016), anomaly may be both supervised and un- a metric. Among different statistical dis- supervised in nature depending on the availability of labelled tance metrics, energy distance is an interesting concept based ground-truth data, in this paper anomaly detection is consid- on Newton’s Law of Gravitation, promising simpler comput- ered to be the unsupervised version which is more convenient ation than classical distance metrics. In this paper, we re- and more practical in many industrial systems. view the state of the art formulations of energy distance and point out several reasons why they are not directly applica- Choice of the statistical distance formulation has an important ble to the anomaly-detection problem. Thereby, we propose impact on the effectiveness of RUL estimation. In literature, a new energy-based metric called the -statistic which ad- statistical are described to be of two types as fol- P dresses these issues, is applicable to anomaly detection and lows. retains the computational simplicity of the energy distance. We also demonstrate its effectiveness on a real-life data-set. 1. Divergence measures: These estimate the distance (or, similarity) between probability distributions. Some com- 1. INTRODUCTION mon divergence measures are Kullback-Leibler divergence, Jensen-Shannon divergence and Hellinger distance. Prognostics is a critical requirement in many industrial fields today owing to the potential of cost savings and operational 2. Distance measures: These measures estimate the dis- efficiency entitlement through elimination of unscheduled fail- tance between a single point and a distribution by com- ures and shut-downs. One of the many important modules paring it with a sample drawn from the distribution. The used in a typical Prognostics and Health Management (PHM) most well-known measure in this category is Mahalanobis application is a health indicator module for the system un- distance (Mahalanobis, 1936). A few other distance mea- der consideration which not only estimates a health metric sures are Bhattacharya distance (Bhattacharyya, 1943) but also tracks the evolution of this metric with time. The and the energy distance. health indicator is estimated from a collection of sensor read- ings at different timestamps. Generally, one of many statis- For the industrial anomaly detection problem, it is mostly the tical distances of a test-point from a baseline cluster may be second category which is more significant because most of used as this health indicator. The data-points can be multi- the times, we end up comparing a single point with a baseline dimensional resulting from multiple sensor outputs that can cluster. be tapped from the system under monitoring. This statisti- In this paper, we are interested in the class of distances called cal distance or some function of it may be used as the health energy distance. These are interesting because they are based metric or the health indicator. on the notion of Newton’s law of gravitational energy and Time trending of the health metric as a function of histori- considers statistical observations as celestial objects having cal and future forecast usage patterns give us a visibility into gravitational pull between each other. A distance metric called the Remaining Useful Life (RUL) which is the ultimate tar- -statistic may be written to represent the energy distance be- E get of PHM. However, even before hitting the ultimate target tween distributions.The -statistic can be used to test the sta- E of RUL, doing anomaly detection as an intermediate step has tistical hypothesis of equality of two distributions. This con- cept was proposed and developed in (Szekely,´ Rizzo, et al., Rupam Mukherjee et al. This is an open-access article distributed under the 2004; Szekely´ & Rizzo, 2013; Szekely & Rizzo, 2017) where terms of the Creative Commons Attribution 3.0 United States License, which it was shown the -statistic is more general and powerful than permits unrestricted use, distribution, and reproduction in any medium, pro- E vided the original author and source are credited. many classical statistics. https://doi.org/10.36001/IJPHM.2021.v12i1.1323

International Journal of Prognostics and Health Management, ISSN2153-2648, 2021 000 1 INTERNATIONAL JOURNAL OF PROGNOSTICS AND HEALTH MANAGEMENT

Application of energy distances to single and multi sample tively. Let X = X ,X ,...,X be a random sample of S { 1 2 n1 } goodness of fit have been explored in (Rizzo, 2002b, 2002a; size n1 drawn from the density function fX . Similarly, YS = Szekely´ et al., 2004; Baringhaus & Franz, 2004; Szekely´ & Y ,Y ,...,Y is a random sample of size n drawn from { 1 2 n2 } 2 Rizzo, 2005; Rizzo, 2009; Yang, 2012). Several other ap- the density function f . The energy distance (X, Y ) be- Y E plications have been shown in (Szekely, Rizzo, et al., 2005; tween the distributions fX and fY has been defined in (Szekely´ Szekely´ & Rizzo, 2009; Feuerverger, 1993; Matteson & James, et al., 2004; Szekely,´ 1989, 2002). It is a population statistic 2014; Kim, Marzban, Percival, & Stuetzle, 2009). for the pair of random variables X and Y . In this paper, we are interested in exploring it further be- Now, a sample statistic may be used as an estimate for En1n2 cause of its ability to work with Euclidean distances between the population statistic (X, Y ). It may be calculated from E data-points rather than with the data-points themselves. This the pair of samples (XS,YS) as defined in (Szekely´ et al., ability leads to reduced computational complexity compared 2004; Szekely,´ 1989, 2002) and has the following form. to the Mahalanobis distance and makes the -statistic poten- E tially advantageous for memory and computation challenged n n systems. n n 2 1 2 (X ,Y ,Y ,...Y )= 1 2 En1n2 S 1 2 n2 n + n n n However, despite the aforementioned advantages, we found 1 2 " 1 2 i=1 m=1 X X some problems with the -statistic when applied to anom- n1 n1 E 1 1 aly detection. These problems arise from the fact that the - X Y X X E || i m||2 n2 || i j||2 n2 ⇥ statistic has been developed primarily for comparing equality 1 i=1 j=1 2 X X between two distributions whereas for anomaly detection we n2 n2 need to compare a single test point with a distribution (base- Y Y . (1) || l m||2 line). In this paper, we analyse these problems and propose l=1 m=1 # X X a new metric called the modified energy distance ( -statistic) P For a d-dimensional space, each observation Xi and Yj is a based on the notion of Newton’s law which addresses these d 1 array. Mathematically, the entire sample may be rep- problems and is more suitable to be used in an anomaly de- ⇥ resented as a d n matrix for X and d n matrix for tection problem. ⇥ 1 S ⇥ 2 YS. Here, we would like to mention that detailed comparison of In industrial systems, usually after a baseline data-set is ac- the proposed metric against all other classical distance met- cumulated, future test-points are compared against it and the rics is a task we would attempt in future work. In this work, baseline itself is not frequently replaced except when there is we focus on establishing the improvements over the -statistic. E a need to re-establish the baseline. Thus, although the base- This paper is laid out as follows. In Section 2, we briefly line cluster varies depending on which time instant it was ac- review the -statistic and mention the reasons why it is com- quired from sensor readings and hence has a random nature, E putationally simpler than other techniques. In Section 3, we it is considered a constant in the anomaly analysis once it is explain the issues which make the -statistic less suitable captured and saved. Hence, in subsequent analysis, it is con- E for anomaly detection and in Section 4, we propose the - sidered invariant. Going forward, in this paper, we will be P statistic and through the use of synthetic data, show that it referring to the sample XS as the baseline cluster (or base- addresses these issues. In Sections 5.1 and 5.2, we derive a line, for simplicity). method for estimating probability from the -statistic using a P For the anomaly detection problem, we want to compare a chosen parametric distribution. In Section 5.4, we show how single point against a baseline cluster having many members. the performance of this new metric compares in terms of dis- Hence, in (1), we assume that Y has sample-size 1 and con- crimination performance against that of the Mahalanobis Dis- S tains only one member Y . We discard the notations n and tance, a classical multi-dimensional distance metric. In Sec- 1 1 n as n =1. We represent n by n going forward as the tion 6, we demonstrate a simple application of the -statistic 2 2 1 P subscript in n is no longer needed. For sufficiently large sam- on real-life data. In Section 7, we show how the training time ple size for X , n n +1. Also, we use y in place of Y of the proposed metric compares against that of Mahalanobis S 1 ⇡ 1 1 in future analysis since the subscript is not crucial for clarity Distance for an incremental baseline update approach. Fi- in representing a single member set Y . nally, in Section 8, we summarize the observations. S

2. REVIEW OF ENERGY DISTANCE AND ITS COMPUTA- TIONAL SIMPLICITY Let X and Y be two independent real-valued random vari- ables with probability density functions fX and fY respec-

2 INTERNATIONAL JOURNAL OF PROGNOSTICS AND HEALTH MANAGEMENT

From (1), we write a simplified form for as cluster based on when that data set was captured in time, En En1n2 but once they are collected for any machinery, they are usu- (X ,Y )= (X ,y) En S 1 En S ally not changed during the comparison or anomaly detection 2 n phase. Thus, for all practical purposes, they are same as a set X y ⇡ n || i ||2 of multi-dimensional real-valued points. i=1 X 1 n n From the way anomaly detection is usually implemented in Xi Xj . (2) industrial systems, we intuitively desire the following condi- n2 || ||2 i=1 j=1 tions from any acceptable anomaly measure (say ⇤) which X X En may be used in a practical anomaly detection system. Reference (Szekely´ et al., 2004) mentions that for (X ,y) En S The desired characteristics with respect to cluster size and to be a valid distance metric, there must exist a c for every ↵ distance from centroid along with their mathematical expres- ↵ (0, 1) such that 2 sions are as follows.

P (Z>c↵)=↵. (3) Here, Z is a random variable. A random sample z drawn from 1. Relationship with cluster size the distribution of Z is a function of the baseline XS and a random sample x drawn from the distribution fX defined earlier. It takes a form (a) For the test-point y situated at a given distance from the centroid of the baseline cluster and outside the z = (X ,x). (4) En S baseline, it should appear less anomalous if it is closer to the outer boundary of the baseline cluster. As mentioned in Section 1 and in (Szekely´ et al., 2004; Szekely´ Given the saved baseline X , we define a random & Rizzo, 2013; Szekely & Rizzo, 2017), the -statistic is sim- S E variable R such that pler and more general than classical distance metrics. While writing this paper, on comparing (1) with the classical dis- R = X XS 2 and (5) tance metrics described in (Statistical distance, n.d.), the ad- || || vantages which specifically interested us are R has a probability density function fR whose model parameters may be estimated from the sample X . 1. The Euclidean distances in (1) can be computed in paral- S Here, X is the centroid of the baseline cluster which lel and hence the computation time of -statistic does not S E is nothing but the sample mean. scale with data size if implemented in a parallel manner. For the test-point y, let r = y X . Since, f || S||2 R 2. Covariance matrices are not required in (1). For high- is a function of the baseline cluster XS and from (5) dimensional data, this can take up a significant amount the domain of R is the set of Euclidean distances of of computation and memory all possible samples of X from XS, the probability density of any point with distance r from X may 3. Matrix inversion is not needed. In many classical meth- S be expressed as f (r, X ). ods, inversion of covariance matrices is required. Again, R S for high dimensional data, this can involve significantly This problem may now be cast in the form of a hy- heavy computation. pothesis test as follows.

Null hypothesis(H0): Random samples drawn from 3. SOME GAPS IDENTIFIED IN -STATISTIC E fR will be more extreme than r. y is flagged as an In this section, we analyse the general requirements from any anomaly with respect to XS if the null hypothesis is anomaly detection metric and point out some weaknesses in rejected. the -statistic which prevent it from satisfying some of these E Rejection criterion: The null hypothesis is rejected conditions. Going forward, these weaknesses form the basis if of the proposed innovation in this paper.

pR(XS)=P ( X XS 2 >r)=P (R>r) 3.1. Requirements from an Acceptable Metric for Anom- 1 aly Detection = fR(x, XS)dx < pth (6) Zr In anomaly detection, we usually obtain a single test sam- which is a pre-determined threshold. ple y and compare it against the baseline (say XS) which 0 is a cluster of samples drawn from the distribution fX . Al- Let there be a second cluster XS() which is formed though there is an element of randomness in the baseline by scaling the baseline with respect to its centroid

3 INTERNATIONAL JOURNAL OF PROGNOSTICS AND HEALTH MANAGEMENT

0 such that its kth element XS()k may be written as for data of any dimension.

X 0 () = (X X )+X where (7) S k k S S 3.2. Synthetic Dataset >0. If >1, the number of points, relative ori- In order to examine how n performs with respect to the de- E entation of points and cluster shape are maintained sired characteristics stated in Section 3.1, we consider a d- 0 unchanged between XS and XS() with the second dimensional random variable X which is distributed as a uni- cluster occupying a larger spatial volume. Hence, form ball. We sample from this distribution to create a syn- the test-point will appear closer to the outer bound- thetic data-set following the method described in (Harman & 0 ary of XS() than to that of XS. In this scenario, Lacko, 2010). if ⇤ is a faithful indicator of anomaly, it should ap- En X can be written as pear to be less in the former case. This behaviour may be expressed mathematically as X = r (Z / Z ) Z1/d where (14) b 1 || 1||2 2 0 ⇤(X (),y) < ⇤(XS,y)if>1. (8) Z is sampled from a multi-variate uncorrelated standard nor- En S En 1 mal distribution, Z U(0, 1) and r is the desired radius of 2 ⇠ b Now, we saw that a larger radius of baseline is sup- the ball. posed to make y appear less anomalous. Also, larger For the purpose of this study, we restrict the dimension of X the anomaly, less should be the value of the inte- in (14) to 2 to keep the analysis and visualization simple. If gral in (6) since larger anomaly would mean less net the two dimensions of X are written as X(1) and X(2), they probability of having points more extreme. Hence, may be expressed in a simplified form as a function of two 0 random variables R and ✓ in the following way. pr(XS()) >pr(XS)if>1. (9) X(1) = R cos ✓ and X(2) = R sin ✓ where (b) For an infinitesimally small baseline cluster, any test- R (0,r )1/2 and ✓ U(0, 2⇡) and (15) point would appear to have a very large anomaly, ⇠ b ⇠

irrespective of the value of the Euclidean distance rb is the chosen radius of the example cluster. It may be from the centroid. This is because the distance al- shown from (15) that (✓,R), the probability density func- ways appears large relative to the cluster size. 2 8 tion f✓,R =1/(⇡rb ). Thus, the distribution is uniform in Hence, nature. We also consider a fixed test-observation with loca- 0 tion ✓ =0and R =1. lim n⇤(XS(),y)= (10) 0 E 1 ! With the anomaly metric increasing asymptotically, 3.3. Shortcomings of -statistic E we will have an asymptotic reduction of the integral We sweep r in (14) from 0 to 5 thereby progressively get- value in (6). b ting a different-sized baseline and thereby changing in (7). This implies different probabilities of the fixed test-point be- 2. Relationship with distance from cluster centroid longing to this cluster. We then evaluate the simplified in En (2). (a) If the test-point is further away from the centroid of the baseline cluster, it should appear more anoma- Figures 1 and 2 show the baselines and test-points for two lous and hence, the anomaly metric should increase. different values of r along with computed values of for b En Thus, if there are two test-points y1 and y2, both the cases. Figure 3 shows a continuous plot of how En varies with rb. n⇤(XS,y1) > n⇤(XS,y2)if (11) E E From these plots, the -statistic can be shown to violate the y1 XS > y2 XS . (12) E 2 2 desired characteristics mentioned in Section 3.1 in the follow- ing manner. (b) If the test-point is at the center, the anomaly metric should be close to zero. Hence, 1. Limiting condition behaviour: The -statistic violates E ⇤(X ,y) 0if y X 0. (13) the limiting conditions in the following manners. En S ! S 2 ! It should be noted that (7) and (11) indicate that the ideal (a) It has been shown in Appendix A (59) that the met- anomaly metric should follow a monotonic behaviour with ric n proposed in (2) has the following property. E respect to the baseline size and also distance of the test-point 0 lim n(XS(),y)=2 XS y = . (16) from the baseline cluster. These ideal conditions hold good 0 E 2 6 1 ! 4 INTERNATIONAL JOURNAL OF PROGNOSTICS AND HEALTH MANAGEMENT

flection point beyond which it increases instead of reducing and reaches significantly high values in- stead of very small ones. For the particular case in Figure 2, the value of 2.363 is not appropriate. It En should be very close to zero. Also, the should be En higher in Figure 3 compared to Figure 2. However, the opposite is actually observed.

(c) As stated above, in Section 3.1, ⇤ =0if and Figure 1. Baseline (blue) and test-point (red) with rb =0.5 En only if y = X for a viable distance-metric for and n =1.6107 c E anomaly detection. But this condition is violated by the -statistic in its present form because in (1), E =0if and only if the distributions X and En1n2 Y are identical i.e. X = Y . However the single test-point y considered here forms a single member distribution which cannot be identical to X under any circumstance.

2. Monotonic behaviour: From Figure 3, it is seen that the value of goes through an inflection point as r varies En instead of varying monotonically as required in Section Figure 2. Baseline (blue) and test-point (red) with rb =5.0 3.1 and =2.363 En The issues with -statistic raised in this section have been fur- E This violates the condition required for the ideal ther addressed and a modified algorithm suggested in Section metric ⇤ stated in (10). Thus, in its present 4. Since the target in this paper is anomaly detection, we have En En form cannot function as the ideal metric. also extended the analysis to reach a closed form analytical expression for the probability density function (PDF) based From Figure 3, it is clear that when the baseline on which the p-value of the test-point may be estimated. cluster becomes progressively smaller and close to 0, the variation of n is not asymptotic to infinity E ODIFIED NERGY ISTANCE but hits a finite value equal to the distance of the 4. M E D test-point from the baseline centroid. The reason why energy distance is computationally simple is that it works with distributions of Euclidean distances whereas other distance metrics first fit a multivariate distribution and (b) Appendix A (62) also proves the following. then work our probabilities from that. The -statistic is one of E 0 the first energy-based metric proposed for anomaly detection. lim n(XS(),y)= . (17) E 1 !1 However, we saw in Section 3.3 that it has a few weaknesses. In this section, we borrow the concept of working with Eu- From Figure 3, it is clear that when baseline cluster clidean distances from a study of the -statistic and propose increases in radius, the value of does not always E En an alternative and slightly modified metric based on them. reduce tending towards 0. It goes through an in- The potential energy of a configuration consisting of two par- ticles having masses m1 and m2 respectively and having po- sitions X1 and X2 respectively is the work that needs to be done to move one particle from infinity to its current position against the gravitational force exerted by the other particle already in place. It can be written as Gm m PE(X ,X )= 1 2 . (18) 1 2 X X || 1 2||2 We define the incremental potential energy of y to be the Figure 3. Variation of n with baseline cluster radius (rb), work that needs to be done to move the test-point y from infin- E given a fixed test-point. ity to its present position against the collective force exerted

5 INTERNATIONAL JOURNAL OF PROGNOSTICS AND HEALTH MANAGEMENT

by the baseline sample cluster XS. The incremental potential Here, ✏ is an infinitesimal positive number on the real line. energy may be written as The closer ✏ is to 0, the better is the match between (23) and

n (24). A guideline may be to use P (X ,y)= PE(X ,y) n S k 1 k=1 0 <✏<1e 04 X X . (25) X n k S 2 n k=1 C X = , (19) Xk y The reader may note that this is a guideline and one may use k=1 || ||2 X smaller values of ✏ if they want. However, larger values of where C is a positive constant if we assume that each data- ✏ may not be advisable. We would like the reader to know point in the baseline sample cluster XS as well as the test- that a rigorous evaluation of the impact of the choice of this point y represent particles of equal mass. value on the results needs to be conducted as future work but The net potential energy for assembling the baseline cluster at present, we do not see this as a big risk to this method. XS may be written as We now examine the limiting conditions or 0 and ! ! n which have been discussed with respect to in (16) and 1 En P (XS)= PE(Xk,Xj) (20) (17) respectively. Using the definition of as given in (7), j=2 k

En(XS,y) n(XS,y)= Figure 4. Semi-log plot for variation of n with baseline clus- P En(XS) ter size, given a fixed test-point. P n 1 n 1 = (23) X X X y j=2 k j 2 k 2 X Xk1. 8 We now want to study the impact of r on n for any given D. P

6 INTERNATIONAL JOURNAL OF PROGNOSTICS AND HEALTH MANAGEMENT

Assuming X to be a fixed baseline against which test-points per (29), after a small initial value of r, increases mono- S Pn are compared, we can consider D and XS to be constants. tonically with increase of distance of the test-point from the Hence, we have a simplified functional relationship as fol- baseline centroid. This is consistent with the fact that farther lows. the test point, the more anomalous it is. n(r)= n(XS,r,D). (28) P P We also observe from Figure 5 that the rate of increase of n P Thus, if we consider the space as a high dimensional ball, we is less when the baseline cluster size is larger. This is because are studying the variation of n for points at different radial for any fixed test-point, it will appear less anomalous when P distances along a given radial vector direction. the baseline cluster has a larger radius. In Appendix B, we prove the following regarding the geomet- All of these observations are consistent with being a valid Pn ric properties for the function n() . anomaly statistic. All the discrepancies observed in Figure P 3 are addressed and n satisfies the requirements stated in d n P a finite r⇤ such that r>r⇤, P > 0, (29) Section 3.1. The observations are further validated in Figures dr 9 8 6 and 7 where the -statistic is used to evaluate the cases d n P lim P = constant from (71), (30) earlier analysed in Figures 3 and 2. They show a much lower r dr !1 -statistic value for the case in Figure 7 where the test-point (r) is finite r< , from (72) and (31) P Pn 8 1 is very close to the centroid X. (r) 0 r from (73). (32) Pn 8 Now, we examine the impact of increasing the spatial volume of the baseline cluster. Following the proposition in (7), we examine the impact of increasing . From (23), using the definition of in (7), y 1 (X ,y)= X , + X 1 Pn S1 Pn S S ✓ ✓ ◆◆ lim n(XS1,y)= n(XS, XS) < (33) ) P P 1 Figure 6. Baseline (blue) and test-point (red) with rb =0.5 !1 and n = 1636.5248 if all the dimensions of y are finite. P From (33), for large spatial volume of the baseline i.e., for large values of , (r) will be finite unlike in (17) and in Pn Figure 3 where the -statistic is seen to increase with increas- E ing after a certain value and tends to infinity. Also, in Figure 4, we see that as the baseline cluster becomes progressively larger thereby indicating a reduction in anomaly level, we see a reduction of to a value of 0 instead of the inflection seen Pn in Figure 3. The variation of with distance of test-point from base- Pn line centroid, for a few values of the baseline radius is shown Figure 7. Baseline (blue) and test-point (red) with rb =5.0 and = 455.1863 in Figure 5. It serves as a demonstration of the fact that as Pn The observations in this section establish that the variation of -statistic with baseline size and test-point distance is in P accordance with what is expected intuitively from an anomaly metric. However, to fully establish its usefulness as a valid anomaly metric, we should be able to compute a probability density function (PDF) from this metric.

5. COMPUTATION OF PROBABILITY

5.1. Existence of a Valid Probability Measure using n P Figure 5. Semi-log plot for variation of n with distance from Let r represent the distance of the test-point from the centroid P centroid, for different baseline sizes. of the baseline cluster. Also, in the space of real numbers, let

7 INTERNATIONAL JOURNAL OF PROGNOSTICS AND HEALTH MANAGEMENT

R represent a single-dimensional random variable having a X and Y are equipped with -algebras ⌃ and T respec- density function f from which the r-values are assumed to tively. Also, let be a function from X to Y . Since R Pn be sampled. R has a range defined by (r) is strictly increasing for r>r⇤⇤, max(X ) de- Pn ↵ fined in (41) cannot be greater than r⇤⇤ with c c . ↵  max Range(R)=[0, ). (34) Also, by definition r 0. 1 Since R is a random variable, Z = n(R) is also a ran- Hence, if we consider any open set of consisting of val- P C 1 dom variable as the function of a random variable is itself a ues from [0,c ], from (41), its inverse values ( ) max Pn C random variable. Let fZ be the density function for Z and will map to the region [0,r⇤⇤]. None of it would lie out- z Range(Z). side of this. Mathematically, this may be written as 2 In Appendix B, it has been shown that n() is a continuous 1 P n ( )= x n(x)= ⌃, for T function. Also, from (29), (30) and (32), P C { |P C}2 C2 : X Y is a measurable function. (42) )Pn ! Range(Z) =[0, ) (35) 1 Thus, n(R) is a valid choice for a random variable. We P Thus, the domain and the co-domain of () are both [0, ). can show that c↵, a unique values of ↵1 Pn 1 8 9 As mentioned in (Szekely´ et al., 2004) and (3), for (r) to P ( n(R) [c↵,cmax]=↵1 r⇤⇤ P (Z>c↵)=↵, c↵ Range(Z) and (36) Pn 8 2 and hence, ↵ (0, 1), (37) 2 1 P ( (R) >c )=P (R> (c )) = P (R>r)(say). ↵ being unique for every choice of c . (36) and (37) are char- Pn r Pn r ↵ (44) acteristics necessary for the existence of the cumulative den- From (43) and (44), for r [0,r⇤⇤], sity function (CDF) corresponding to the probability density 2 function f . Z P ( (R) >c )=P ( (R) [c ,c ]) Pn ↵ Pn 2 ↵ max From (29) and (31), r⇤⇤ r⇤ such that 9 +P ( n(R) >cmax) P 1 (r) > (r⇤⇤), r>r⇤⇤ where (38) = ↵1 + P R> n (cmax) Pn Pn 8 P = ↵1 + P (R>r⇤⇤) (r⇤⇤) = max ( (r) r [0,r⇤⇤] ) , P ( n(R) >c↵)=↵ (0, 1) (say). (45) Pn {Pn | 2 } ) P 2 = cmax (say), a finite quantity. (39) In (45), ↵ is uniquely defined for any chosen value of c↵, since ↵ is unique as shown in (43). These relations are true because r and r⇤⇤ are all positive 1 numbers. Also, since probability is a non-negative quantity, it may be shown that

min ( (r) r [0,r⇤⇤] ) 0. (40) 2. Case 2: c↵ >cmax. {Pn | 2 } Since n(r) is monotonically increasing for r>r⇤⇤, 1 P We now, analyse the behaviour of (r) in two zones, i.e., () is well-defined and single-valued. Hence, for any Pn Pn c c and c >c . These are analysed as follows. c↵, ↵  max ↵ max 1 P ( n(R) >c↵)=P (R> n (c↵)) 1. Case 1: c↵ cmax. P P  = P (R>r) (say). (46) For any c↵ [0,cmax], 2 P ( n(R) >c↵)=↵ (0, 1), (say). (47) 1 ) P 2 (c )=X = r (r)=c . (41) Pn ↵ ↵ { |Pn ↵} It must be noted that (46) could be written only because 1 From (29), (r) need not be strictly increasing or strictly n () is well-defined in the chosen regime which is r> Pn P decreasing in nature if r r⇤⇤. Hence the set X may r⇤⇤. In (47), ↵ is unique for every chosen value of c↵  ↵ have more than one member as defined in (41). Since since the CDF of R is well-defined. c↵

8 INTERNATIONAL JOURNAL OF PROGNOSTICS AND HEALTH MANAGEMENT

5.2. Calculating p-value thus less chance of rejecting the null hypothesis. This is be- cause the larger cluster would make the test-point appear less In order to calculate the p-value, we need to first choose a anomalous. suitable form for the PDF f(x) that best represents the data being analysed. Different parametric distributions may be chosen for the PDF f(x). Since (x) > 0 x from (32), we Pn 8 assume that it is sampled from a standard half-normal distri- bution with mean µ =0and standard deviation . We can write a standard half-normal variant of as Pn p = / and p = Z where Z N(0, 1). (48) n Pn n | | ⇠ Here, can be estimated as

1 n = ( (X X ,X ))2. (49) Figure 8. Semi-log plot for variation of p(y) calculated in vn Pn S { i} i u i=1 (50) with distance from the baseline centroid, for different u X t baseline sizes. Here, it must be noted that since XS is defined as a set of ran- dom samples, X X represents the set subtraction op- S { i} eration resulting in the complementary set obtained by elimi- nating Xi from XS. 5.3. Algorithm Steps and Computation Requirement The p-value for the test-point y is We can now summarize the steps involved in the computation

pn of pd(y) in Algorithm 1. t2/2 p(y)=1 2 e dt Z0 pn Algorithm 1: Computation of p-value from n =1 erf . (50) P p2 ✓ ◆ Input: Baseline cluster X Output: p(y) An efficient computation of p(y) would need a compact ap- Data: Test-point y Initialization: flag =0 proximation for the error function erf(x). For this purpose, /* If baseline has not been characterized, we use the Burmann¨ series expansion which converges quickly enter baselining phase */ for real values of x (Schopf¨ & Supancic, 2014). As explained 1 if flag == 0 then n in (Dixon, 1901; Schopf¨ & Supancic, 2014), the error func- 2 Xc =1/n k=1 Xk tion may be approximated as 3 A =0for j =2:n do 4 for k

9 INTERNATIONAL JOURNAL OF PROGNOSTICS AND HEALTH MANAGEMENT

5.4. Comparison of -statistic with Mahalanobis Distance P Since in this paper we are trying to formulate and establish a new statistical metric for anomaly detection, it would be inter- esting to see how it performs with respect to the Mahalanobis Distance (MD) which is one of the established classical tech- niques in this domain. The questions we are trying to answer at this stage are the following. 1. Is there a one-to-one correspondence between MD and pn ? Or, does every value of pn corresponds to one and Figure 10. Relationship between pn and M(Y,X) for dif- only one value of MD ? ferent values of dimension d and a given baseline radius of 0.5. 2. Is it possible to estimate MD from pn for different di- mensions of the data-points and different physical sizes of the baseline clusters ? By answering this question, we old in M(y, X) can be translated to a corresponding thresh- are trying to figure out if tribal knowledge about what old in pn by utilizing the linearity of the curves in Figure constitutes anomaly in terms of MD for a particular ap- 10. This enhances the confidence that an existing threshold- plication can be translated to the domain of the -statistic. P based alerting strategy that depends on values of Mahalanobis Let us represent the Mahalanobis Distance (MD) (Mahalanobis, Distance may be translated to an equivalent -statistic-based P 1936) of test-point y with respect to baseline cluster sampled one. Hence, the pn-value in (48) and the corresponding p- from the random variable X as M(y, X). Thus, value calculated in (50) can be used effectively for anomaly detection. M(y, X)= (y X)T S 1(y X)where (54) q 6. PERFORMANCE ON REAL-LIFE DATA the d dimensions of y and X are represented along the rows For verifying the performance of the -statistic, we take the and S is the covariance matrix of the cluster sampled from X. P Breast Cancer Data-set available publicly in the UCI Ma- We now examine the relationship between pn and M(y, X) in an empirical fashion by varying different parameters of the chine Learning Repository (Dua & Graff, 2017). We use synthetic data-set designed in (14). the version embedded inside Python’s scikit-learn package and load it using the native python command. For details on Similar to Figure 8, we sweep the distance of test-point from how to load and use the data-set, please check scikit-learn baseline centroid and calculate M(y, X) and pn for each po- documentation (Scikit-learn breast cancer data-set, n.d.). sition. Finally, the MD values corresponding to each value of pn are plotted for comparison in Figure 9 for two different 6.1. Data Description values of baseline radius. Also, we assume different values of dimension d in (14) and for each value of d, we define a This data-set consists of features computed from a set of fine baseline cluster X and test-point y as in (14). For each d needle aspirate (FNA) images of breast masses from a set of and a given baseline radius, Figure 10 shows the relationship patients along with the ground-truth of the diagnosis i.e. ma- lignant or benign. The characteristics of cell nuclei in the between pn and M(y, X). image are described in this data-set. There are a total of 30 features per image with more detailed descriptions available in (Dua & Graff, 2017). The feature matrix is shaped 357 30 ⇥ and in the label vector, 0 represents malignant and 1 repre- sents benign. In order to visualize this high dimensional data, we use t- distributed Stochastic Neighbourhood (tSNE) (Maaten & Hin- ton, 2008) to calculate a 2D embedding of the data so that it can be plotted as a 2D scatter plot and examined manually for proximity between data-points and clusters. Before ap- Figure 9. Relationship between pn and M(Y,X) for different plying tSNE, each of the feature columns are normalized to values of baseline radius. extend from 0 to 1. Figure 11 represents the two clusters of data available in this data-set. The subsequent analysis has From Figures 9 and 10, pn is linearly related to the Ma- now been performed on this 2-D embedded version of this halanobis Distance and the relationship holds good for dif- data-set so that results can be visually related to the cluster ferent dimensions of the problem-space. Thus, any thresh- neighbourhood behaviour.

10 INTERNATIONAL JOURNAL OF PROGNOSTICS AND HEALTH MANAGEMENT

lapping indicating that the -statistic is as effective as the P Mahalanobis Distance in classification tasks as visible from this data.

Figure 11. tSNE embedding of the malignant and benign clusters in the breast cancer data-set from UCI ML repository

6.2. Analysis with -statistic P 40 % Figure 14. ROC curve for -statistic and Mahalanobis Dis- We now consider of the benign cluster as the training tance P baseline and the remaining points in the benign and malignant groups as test set. We now use the steps summarized in Algo- rithm 1 to estimate the p-values for all the data-points in both benign and malignant clusters in the test set with respect to 7. SOME OBSERVATIONS ON COMPUTATION TIME the training baseline. Figures 12 and 13 show the -statistic 7.1. Test Setup and Limitations in Field P and Mahalanobis Distance respectively, for all these points. From these, we see that the capability for discrimination be- Since one of the premises on which the choice of energy- tween the baseline and anomalous data are similar for both based distance was made was that of computation simplicity, training in- the approaches. we analyse the computation requirement for and ference separately in this section for both -statistic and MD. P We notice from (24) and (54) that the bulk of the computation happens during training time when model parameters are be- ing computed from the baseline cluster XS. However, the inference phase has less computation to perform and is likely to have similar computation time for both -statistic and MD. P We henceforth focus our attention on the training time. Since many field deployed Data Acquisition (DAQ) systems like edge devices and protection relays are likely to be de- ployed without a live internet connection, having the training Figure 12. -statistic for benign and malignant data-points. P done in a cloud-based infrastructure and the trained model transferred to these devices may not be feasible. Hence, any algorithm that would reduce training computation require- ment would be advantageous as these devices are usually chal- lenged in terms of computation power. It must be noted that the analysis in this section is conducted on a CPU-based windows machine without parallel process- ing. It has 16 GB RAM and an i5 processor from Intel. No GPU is available. This is a valid test scenario because field- deployed industrial DAQ systems are not likely to have GPUs making massive parallel processing difficult to implement. Figure 13. Mahalanobis Distance for benign and malignant data-points. 7.2. Computation Time Analysis A practical industrial system cannot wait till the entire base- In order to compare the discrimination capability between line of a desired size is accumulated before producing results these two metrics analytically, we plot the true positive rate as it might take a long time for an industrial machine to cap- vs. false positive rate with varying detection thresholds (ROC ture baseline samples from all possible operating conditions. curve) in Figure 14. When calculating this ROC curve the be- After each new data-point is received, the baseline cluster pa- nign and malignant labels are considered negative and posi- rameters are expected to be computed incrementally based on tive respectively. We can see that the curves are nearly over- the calculations of the previous stage and comparison run on

11 INTERNATIONAL JOURNAL OF PROGNOSTICS AND HEALTH MANAGEMENT the new baseline thus obtained. For any baseline size along the x axis in Figure 15, the vari- ability in CPU loading is accounted for by computing the The block of computation that needs to happen during base- times as an average of those calculated for 20 random choices line computation phase is all that which can be done on the of the baseline cluster from the data-set defined above. The baseline X alone and without the test-point y being avail- S CPU time has been calculated by using time ns() function able. from the time package in Python before and after each code From (24), the training computation for -statistic calcula- block during execution and taking the difference to be the P tion is limited to calculating the numerator N (say) as computation time for that segment.

n 1 For a given existing baseline size, the ratio of the two timings N = . (55) X X is now plotted against the dimension in Figure 16. j=2 k j 2 X Xk

1 From Figures 15 and 16, we have the following observations 2. S : Inverse of the covariance matrix for the input dataset. 1. Time taken for incremental baseline computation for MD From the above steps, only the centroid can be computed in- is more than that of the -statistic after each update step. crementally. Since it is not very straightforward to write the P covariance matrix in an incremental fashion, we have to re- 2. MD calculation time scales up much more with data di- compute the covariance matrix after each step. The inversion mension than the -statistic computation. Thus, the ad- P needs to be repeated any way. Let the computation time be vantage offered by -statistic is more pronounced for P TMD for each of these update steps. data of high dimensions. We consider a baseline distribution as one defined in Figure This demonstrates the computation advantage which is claimed 6. Only now, the dimension d is considered more than 2 and by energy-based distances over Mahalanobis Distance. a variable. We assume that the data-points arrive serially and compare computation times T and TMD for a single update 8. CONCLUSION P step with a new data-point given an already existing baseline In this paper, we have reviewed the energy distance, a dis- size. Figure 15 shows how these two compare against each tance metric proposed in literature and shown to have simpler other for varying baseline sizes and values of the dimension computation than most classical distance metric. We have d . shown that this metric possesses several shortcomings which prevent it from being applied in a practical anomaly detection application. We have proposed a modified metric called - P statistic which works on distributions of Euclidean distances just like the energy distance. Using a synthetic data-set, we show that the above shortcomings are overcome by this modi- fied metric. We have taken an example real-life data, the UCI breast-cancer data and shown that the -statistic and the Ma- P halanobis Distance have similar performance as a discrimina- tion metric, as shown by an almost overlapping ROC curve. We have also demonstrated that computation times for Ma- Figure 15. Time taken (ms) for a single incremental update halanobis Distance calculation are higher than those for - P step when using -statistic and MD, plotted for different al- statistic when computed in an incremental fashion with each P ready existing baseline sizes new data arrival. Specially, we have seen that -statistic of- P

12 INTERNATIONAL JOURNAL OF PROGNOSTICS AND HEALTH MANAGEMENT fers pronounced advantage with data having high dimension. Schopf,¨ H., & Supancic, P. (2014). On Burmann‘s¨ theorem Thus, we conclude that it is feasible to use this metric for an- and its application to problems of linear and nonlinear omaly detection without losing discrimination performance, heat transfer and diffusion. The Mathematica Journal, at the same time utilizing the simpler computation that en- 16(11). ergy distance based methods offer. Scikit-learn breast cancer data-set. (n.d.). Retrieved from https://scikit-learn.org/stable/ modules/generated/sklearn.datasets REFERENCES .load breast cancer.html Baringhaus, L., & Franz, C. (2004). On a new multivari- Statistical distance. (n.d.). Retrieved from ate two-sample test. Journal of multivariate analysis, https://en.wikipedia.org/wiki/ 88(1), 190–206. Statistical\ distance Bhattacharyya, A. (1943). On a measure of divergence be- Szekely,´ G. J. (1989). Potential and kinetic energy in statis- tween two statistical populations defined by their prob- tics. LBudapest Institue of Technology. ability distributions. Bulletin of the Calcutta Mathe- matical Society(35), 99-109. Szekely,´ G. J. (2002). E-statistics: energy of statistical sam- Dixon, A. (1901). On burmann’s theorem. Proceedings of ples (Tech. Rep. No. 02-16). Bowling Green State Uni- the London Mathematical Society, 1(1), 151–153. versity, Dep. Math. stat. Dua, D., & Graff, C. (2017). UCI Szekely,´ G. J., & Rizzo, M. L. (2005). A new test for mul- repository. Retrieved from http://archive.ics tivariate normality. Journal of Multivariate Analysis, .uci.edu/ml/datasets/Breast+Cancer+ 93(1), 58–80. Wisconsin+\%28Diagnostic\%29 Szekely,´ G. J., & Rizzo, M. L. (2009). Brownian distance co- Feuerverger, A. (1993). A consistent test for bivariate depen- variance. The annals of applied statistics, 1236–1265. dence. International Statistical Review/Revue Interna- Szekely,´ G. J., & Rizzo, M. L. (2013). Energy statistics: A tionale de Statistique, 419–433. class of statistics based on distances. Journal of statis- Goldstein, M., & Uchida, S. (2016). A comparative evalua- tical planning and inference, 143(8), 1249–1272. tion of unsupervised anomaly detection algorithms for multivariate data. PloS one, 11(4), e0152173. Szekely, G. J., & Rizzo, M. L. (2017). The energy of data. Harman, R., & Lacko, V. (2010). On decompositional al- Annual Review of Statistics and Its Application, 4, 447– gorithms for uniform sampling from n-spheres and n- 479. balls. Journal of Multivariate Analysis, 101(10), 2297– Szekely,´ G. J., Rizzo, M. L., et al. (2004). Testing for equal 2304. distributions in high dimension. InterStat, 5(16.10), Kim, A. Y., Marzban, C., Percival, D. B., & Stuetzle, W. 1249–1272. (2009). Using labeled data to evaluate change detec- Szekely, G. J., Rizzo, M. L., et al. (2005). Hierarchical clus- tors in a multivariate streaming environment. Signal tering via joint between-within distances: Extending Processing, 89(12), 2529–2536. ward’s minimum variance method. Journal of classi- Maaten, L. v. d., & Hinton, G. (2008). Visualizing data using fication, 22(2), 151–184. t-sne. Journal of machine learning research, 9(Nov), Yang, G. (2012). The energy goodness-of-fit test for univari- 2579–2605. ate stable distributions (Unpublished doctoral disserta- Mahalanobis, P. C. (1936). On the generalized distance in tion). Bowling Green State University. statistics.. Matteson, D. S., & James, N. A. (2014). A nonparametric approach for multiple change point analysis of multi- A. LIMITING CONDITIONS OF -STATISTIC variate data. Journal of the American Statistical Asso- E ciation, 109(505), 334–345. From (2) and (7), Rizzo, M. L. (2002a). A new rotation invariant goodness- n of-fit test (Unpublished doctoral dissertation). Bowling 2 (X 0 (),y)= (X X )+X y Green University. En S n || i S S ||2 i=1 Rizzo, M. L. (2002b). A test of homogeneity for two mul- X tivariate populations. Proceedings of the American where (57) n2 Statistical Association, Physical and Engineering Sci- n n ences Section. = X X < . (58) || i j||2 1 Rizzo, M. L. (2009). New goodness-of-fit tests for pareto i=1 j=1 distributions. ASTIN Bulletin: The Journal of the IAA, X X 0 lim n(XS(),y)=2XS y 2 < . (59) 39(2), 691–715. ) 0 E || || 1 !

13 INTERNATIONAL JOURNAL OF PROGNOSTICS AND HEALTH MANAGEMENT

Also, from (57), From (24), (r) may be written as Pn n n 1 lim n =2 lim Xi XS 2 (60) n(r)=↵ where (64) E n || || 2n P sk !1 !1 i=1 ! k=1 ✓ ◆ X X n 1 Now, ↵ = and Xk Xj 2 j=2 k X X n || i j||2 q j=1 Here, Xk(d) is the value of the dth coordinate of the point Xk X n which is D-dimensional in nature. 1 > X X . 2n|| i j||2 From (64) and (65), j=1 X 2 n d n 1 d 1 Xi XS 2 > from (58). P = ↵ ) || || 2n dr s dr s i=1 k ! k X k k ✓ ◆ n X 2 X 1 1 dsk ) Xi XS 2 > 0 (61) ) || || 2n = ↵ 2 i=1 ! sk s dr X k ! k k X 2 X From (60) and (61), 1 r Xk(1) ✏ = ↵ 3 . (66) sk s 0 k ! k k lim n(XS(),y)=+ . (62) X X E 1 !1 From (66), there exists r⇤ >X + ✏ k such that Here, the reader should note that it was necessary to prove the k(1) 8 relationship in (61) to establish whether the above-mentioned d n limit was + or . If the term on the LHS in (61) was P > 0 r>r⇤ since ↵>0 and sk > 0 k. (67) 1 1 dr 8 8 negative, the limit in (62) would be and not + . 1 1 Also, from (66), B. SOME GEOMETRICAL PROPERTIES OF -STATISTIC P r X ✏ k(1) Let y(d) be the value of the dth coordinate of the test-point 3 d n s y in the D-dimensional space. We assume that y is situated lim P =lim↵ k r dr r 2 at a distance r from the origin. Since we are studying the !1 !1 1 1 k + impact of r alone on the values of n for any given radial X0sk sj 1 P j=k direction, we may assume, without any loss of generality, that X6 the coordinate system is oriented in such a manner that X @+ ✏ A s 1 lim k(1) lim k r r !1 r !1 r r, for d =1, = ↵ ✓ ◆⇣ 2 ⌘ (68) y(d) = (63) 0, for d =[2, 3,...,D]. s ⇢ k 1+ lim k X 0 r sj 1 j=k !1 X6 @ A Now, from (65),

2 s 2Xk(1) X lim k =lim 2 + || k||2 r r r s r r !1 !1 ✓ ◆ =lim, where =1 ✏/r. r !1 =1. (69)

14 INTERNATIONAL JOURNAL OF PROGNOSTICS AND HEALTH MANAGEMENT

Also, from (65), since ↵ is a constant. Here N is assumed to be the number of elements in the cluster XS. 2 2 sk Xk 2 +(r ✏) 2(r ✏)Xk(1) lim =lim || || A few other points about the nature of n(r) that need to be r s r X 2 +(r ✏)2 2(r ✏)X P !1 j !1 s || j||2 j(1) mentioned here are =1. (70) r< , (r) < since ↵< and s < . 8 1 Pn 1 1 k 1 From (68), (69) and (70), (72) and 2 n 0 r since ↵>0 and sk > 0 k. (73) d P 8 8 lim Pn = ↵ 1 1+ 1 r dr 0 1 !1 k j=k X X6 = ↵/N, a constant,@ A (71)

15