Isolation Forest
Total Page:16
File Type:pdf, Size:1020Kb
Isolation Forest Fei Tony Liu, Kai Ming Ting Zhi-Hua Zhou Gippsland School of Information Technology National Key Laboratory Monash University, Victoria, Australia for Novel Software Technology {tony.liu},{kaiming.ting}@infotech.monash.edu.au Nanjing University, Nanjing 210093, China [email protected] Abstract anomalies. Notable examples such as statistical methods [11], classification-based methods [1], and clustering-based Most existing model-based approaches to anomaly de- methods [5] all use this general approach. Two major draw- tection construct a profile of normal instances, then iden- backs of this approach are: (i) the anomaly detector is opti- tify instances that do not conform to the normal profile as mized to profile normal instances, but not optimized to de- anomalies. This paper proposes a fundamentally different tect anomalies—as a consequence, the results of anomaly model-based method that explicitly isolates anomalies in- detection might not be as good as expected, causing too stead of profiles normal points. To our best knowledge, the many false alarms (having normal instances identified as concept of isolation has not been explored in current liter- anomalies) or too few anomalies being detected; (ii) many ature. The use of isolation enables the proposed method, existing methods are constrained to low dimensional data iForest, to exploit sub-sampling to an extent that is not fea- and small data size because of their high computational sible in existing methods, creating an algorithm which has a complexity. linear time complexity with a low constant and a low mem- This paper proposes a different type of model-based ory requirement. Our empirical evaluation shows that iFor- method that explicitly isolates anomalies rather than profiles est performs favourably to ORCA, a near-linear time com- normal instances. To achieve this, our proposed method plexity distance-based method, LOF and Random Forests in takes advantage of two anomalies’ quantitative properties: terms of AUC and processing time, and especially in large i) they are the minority consisting of fewer instances and data sets. iForest also works well in high dimensional prob- ii) they have attribute-values that are very different from lems which have a large number of irrelevant attributes, those of normal instances. In other words, anomalies are and in situations where training set does not contain any ‘few and different’, which make them more susceptible to anomalies. isolation than normal points. We show in this paper that a tree structure can be constructed effectively to isolate every single instance. Because of their susceptibility to isolation, 1 Introduction anomalies are isolated closer to the root of the tree; whereas normal points are isolated at the deeper end of the tree. This Anomalies are data patterns that have different data char- isolation characteristic of tree forms the basis of our method acteristics from normal instances. The detection of anoma- to detect anomalies, and we call this tree Isolation Tree or lies has significant relevance and often provides critical ac- iTree. tionable information in various application domains. For The proposed method, called Isolation Forest or iFor- example, anomalies in credit card transactions could signify est, builds an ensemble of iTrees for a given data set, then fraudulent use of credit cards. An anomalous spot in an as- anomalies are those instances which have short average path tronomy image could indicate the discovery of a new star. lengths on the iTrees. There are only two variables in this An unusual computer network traffic pattern could stand method: the number of trees to build and the sub-sampling for an unauthorised access. These applications demand size. We show that iForest’s detection performance con- anomaly detection algorithms with high detection perfor- verges quickly with a very small number of trees, and it mance and fast execution. only requires a small sub-sampling size to achieve high de- Most existing model-based approaches to anomaly de- tection performance with high efficiency. tection construct a profile of normal instances, then iden- Apart from the key difference of isolation versus pro- tify instances that do not conform to the normal profile as filing, iForest is distinguished from existing model-based [11, 1, 5], distance-based [6] and density-based methods [4] in the follow ways: • The isolation characteristic of iTrees enables them to build partial models and exploit sub-sampling to an extent that is not feasible in existing methods. Since a large part of an iTree that isolates normal points is not needed for anomaly detection; it does not need to be constructed. A small sample size produces better iTrees because the swamping and masking effects are reduced. (a) Isolating xi (b) Isolating xo • iForest utilizes no distance or density measures to de- tect anomalies. This eliminates major computational cost of distance calculation in all distance-based meth- ods and density-based methods. • iForest has a linear time complexity with a low constant and a low memory requirement. To our best knowledge, the best-performing existing method achieves only approximate linear time complexity with high memory usage [13]. • iForest has the capacity to scale up to handle extremely large data size and high-dimensional problems with a large number of irrelevant attributes. This paper is organised as follows: In Section 2, we demonstrate isolation at work using an iTree that recursively partitions data. A new anomaly score based on iTrees is also proposed. In Section 3, we describe the characteristic of this method that helps to tackle the problems of swamping and masking. In Section 4, we provide the algorithms to con- (c) Average path lengths converge struct iTrees and iForest. Section 5 empirically compares this method with three state-of-the-art anomaly detectors; we also analyse the efficiency of the proposed method, and Figure 1. Anomalies are more susceptible to report the experimental results in terms of AUC and pro- isolation and hence have short path lengths. cessing time. Section 6 provides a discussion on efficiency, Given a Gaussian distribution (135 points), and Section 7 concludes this paper. (a) a normal point xi requires twelve random partitions to be isolated; (b) an anomaly xo re- 2 Isolation and Isolation Trees quires only four partitions to be isolated. (c) averaged path lengths of xi and xo converge In this paper, the term isolation means ‘separating an in- when the number of trees increases. stance from the rest of the instances’. Since anomalies are ‘few and different’ and therefore they are more susceptible to isolation. In a data-induced random tree, partitioning of instances are repeated recursively until all instances are iso- To demonstrate the idea that anomalies are more suscep- lated. This random partitioning produces noticeable shorter tible to isolation under random partitioning, we illustrate paths for anomalies since (a) the fewer instances of anoma- an example in Figures 1(a) and 1(b) to visualise the ran- lies result in a smaller number of partitions – shorter paths dom partitioning of a normal point versus an anomaly. We in a tree structure, and (b) instances with distinguishable observe that a normal point, xi, generally requires more attribute-values are more likely to be separated in early par- partitions to be isolated. The opposite is also true for the titioning. Hence, when a forest of random trees collectively anomaly point, xo, which generally requires less partitions produce shorter path lengths for some particular points, then to be isolated. In this example, partitions are generated by they are highly likely to be anomalies. randomly selecting an attribute and then randomly selecting a split value between the maximum and minimum values of iTree BST the selected attribute. Since recursive partitioning can be Proper binary trees Proper binary trees represented by a tree structure, the number of partitions re- External node termination Unsuccessful search quired to isolate a point is equivalent to the path length from Not applicable Successful search the root node to a terminating node. In this example, the path length of xi is greater than the path length of xo. Table 1. List of equivalent structure and oper- Since each partition is randomly generated, individual ations in iTree and Binary Search Tree (BST) trees are generated with different sets of partitions. We av- erage path lengths over a number of trees to find the ex- pected path length. Figure 1(c) shows that the average path unsuccessful search in BST. We borrow the analysis from lengths of xo and xi converge when the number of trees in- BST to estimate the average path length of iTree. Given a creases. Using 1000 trees, the average path lengths of xo data set of n instances, Section 10.3.3 of [9] gives the aver- and xi converge to 4.02 and 12.82 respectively. It shows that anomalies are having path lengths shorter than normal age path length of unsuccessful search in BST as: instances. c(n) = 2H(n − 1) − (2(n − 1)/n), (1) Definition : Isolation Tree. Let T be a node of an isola- tion tree. T is either an external-node with no child, or an where H(i) is the harmonic number and it can be estimated internal-node with one test and exactly two daughter nodes by ln(i) + 0.5772156649 (Euler’s constant). As c(n) is the (Tl,Tr). A test consists of an attribute q and a split value p average of h(x) given n, we use it to normalise h(x). The such that the test q < p divides data points into Tl and Tr. anomaly score s of an instance x is defined as: Given a sample of data X = {x , ..., x } of n in- − E(h(x)) 1 n s(x, n) = 2 c(n) , (2) stances from a d-variate distribution, to build an isolation tree (iTree), we recursively divide X by randomly select- where E(h(x)) is the average of h(x) from a collection of ing an attribute q and a split value p, until either: (i) the isolation trees.