Automated Exploration for Interactive Outlier Detection
Total Page:16
File Type:pdf, Size:1020Kb
KDD 2017 Research Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada REMIX: Automated Exploration for Interactive Outlier Detection Yanjie Fu Charu Aggarwal Srinivasan Parthasarathy Missouri U. of Science & Technology IBM T. J. Watson Research Center IBM T. J. Watson Research Center Missouri, USA NY, USA NY, USA [email protected] [email protected] [email protected] Deepak S. Turaga Hui Xiong IBM T. J. Watson Research Center Rutgers University NY, USA NJ, USA [email protected] [email protected] ABSTRACT is widely used in many applications such as nancial fraud detec- Outlier detection is the identication of points in a dataset that tion, Internet trac monitoring, and cyber security [4]. Outlier do not conform to the norm. Outlier detection is highly sensitive detection is highly sensitive to the choice of the detection algorithm to the choice of the detection algorithm and the feature subspace and the feature subspace used by the algorithm [4, 5, 35]. Further, used by the algorithm. Extracting domain-relevant insights from outlier detection is oen performed on high dimensional data in outliers needs systematic exploration of these choices since diverse an unsupervised manner without data labels; distinct sets of out- outlier sets could lead to complementary insights. is challenge is liers discovered through dierent algorithmic choices could reveal especially acute in an interactive seing, where the choices must be complementary insights about the application domain. us, unsu- explored in a time-constrained manner. pervised outlier detection is a data analysis task which inherently In this work, we present REMIX, the rst system to address the requires a principled exploration of the diverse algorithmic choices problem of outlier detection in an interactive seing. REMIX uses a that are available. novel mixed integer programming (MIP) formulation for automati- Recent advances in interactive data exploration [10, 17, 33, 34] cally selecting and executing a diverse set of outlier detectors within a show much promise towards automated discovery of advanced time limit. is formulation incorporates multiple aspects such as statistical insights from complex datasets while minimizing the bur- (i) an upper limit on the total execution time of detectors (ii) diversity den of exploration for the analyst. We study unsupervised outlier in the space of algorithms and features, and (iii) meta-learning for detection in such an interactive seing and consider three practi- evaluating the cost and utility of detectors. REMIX provides two cal design requirements: (1) Automated exploration: the system distinct ways for the analyst to consume its results: (i) a partition- should automatically enumerate, assess, select and execute a diverse ing of the detectors explored by REMIX into perspectives through set of outlier detectors; the exploration strategy should guarantee low-rank non-negative matrix factorization; each perspective can coverage in the space of features and algorithm parameters. (2) be easily visualized as an intuitive heatmap of experiments versus Predictable response time: the system should conduct its explo- outliers, and (ii) an ensembled set of outliers which combines outlier ration within a specied time limit. is implies an exploration scores from all detectors. We demonstrate the benets of REMIX strategy that is sensitive to the execution time (cost) of the detectors. through extensive empirical validation on real-world data. (3) Visual interpretability: the system should enable the user to easily navigate the results of automated exploration by grouping ACM Reference format: together detectors that are similar in terms of the data points they Yanjie Fu, Charu Aggarwal, Srinivasan Parthasarathy, Deepak S. Turaga, identify as outliers. and Hui Xiong. 2017. REMIX: Automated Exploration for Interactive Out- lier Detection. In Proceedings of KDD’17, August 13–17, 2017, Halifax, NS, Canada., 9 pages. 1.1 Key Contributions DOI: hp://dx.doi.org/10.1145/3097983.3098154 We present REMIX, a modular framework for automated outlier exploration and the rst to address the outlier detection problem in an interactive seing. To the best of our knowledge, none of the 1 INTRODUCTION existing outlier detection systems support interactivity in the man- Outlier detection is the identication of points in a dataset that do ner outlined above – in particular, we are not aware of any system not conform to the norm. is is a critical task in data analysis and which automatically explores the diverse algorithmic choices in outlier detection within a given time limit. e following are the Permission to make digital or hard copies of all or part of this work for personal or key contributions of our work. classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation 1.1.1 MIP based Automated Exploration. REMIX systematically on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permied. To copy otherwise, or republish, enumerates candidate outlier detectors and formulates the explo- to post on servers or to redistribute to lists, requires prior specic permission and/or a ration problem as a mixed integer program (MIP). e solution to fee. Request permissions from [email protected]. this MIP yields a subset of candidates which are executed by REMIX. KDD’17, August 13–17, 2017, Halifax, NS, Canada. © 2017 ACM. ISBN 978-1-4503-4887-4/17/08...$15.00. e MIP maximizes a novel aggregate utility measure which trades DOI: hp://dx.doi.org/10.1145/3097983.3098154 o between the total utility of the top-k vs all selected candidates; 827 KDD 2017 Research Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada detectors marking a distinct set of data points as outliers. REMIX 100 provides a succinct way for the data analyst to visualize these re- sults as heatmaps. Consider the outlier matrix ∆ where ∆s;p is the 75 normalized ([0; 1]-ranging) outlier score assigned by detector s to data point p. REMIX uses a low rank non-negative matrix factor- 50 ization scheme to bi-cluster ∆ into a small user-specied number of perspectives such that there is consensus among the detectors Exploratory Action within a perspective. e idea of outlier perspectives is a general- 25 ization of the idea of outlier ensembles. A notable special case occurs when the number of perspectives equals one: here, the results from 0 all the detectors are ensembled into a single set of outliers. 0 50 100 150 200 Data Point REMIX can be used in two modes. (i) Simple: An analyst can (a) Perspective 1 gain rapid visual understanding of the outlier space by providing two simple inputs: an exploration budget and the number of per- spectives. is yields bi-clustered heatmaps of outliers that are 100 easily interpretable as in Figure 1. (ii) Advanced: As discussed in Section 4.3.1, an advanced user can also re-congure specic 75 modules within REMIX such as utility estimation or the enumera- tion of prioritized feature subspaces. Such re-conguration would 50 steer the REMIX MIP towards alternate optimization goals while still guaranteeing exploration within a given budget and providing Exploratory Action factorized heatmap visualizations of the diverse outlier sets. 25 e rest of the paper is organized as follows. We survey re- lated work in Section 2, and provide background denitions and an 0 overview in Section 3. In Sections 4.1 – 4.5, we present the details 0 50 100 150 200 Data Point of feature subspace and candidate detector enumeration, cost and (b) Perspective 2 utility estimation, MIP for automated exploration, and perspective factorization. We present an evaluation of REMIX on real-world datasets in Section 5 and conclude in Section 6. Figure 1: Factorization of an outlier matrix into two per- spectives. Each perspective is a heatmapped matrix whose columns are data points and rows are detectors. e inten- 2 RELATED WORK sity of a cell (s;p) in a perspective corresponds to the extent Outlier ensembles [4, 5, 23, 31] combine multiple outlier results to which detector s identies point p as an outlier. Each per- to obtain a more robust set of outliers. Model centered ensembles spective clearly identies a distinct set of outliers. combine results from dierent base detectors while data-centered ensembles explore dierent horizontal or vertical samples of the dataset and combine their results [4]. Research in the area of sub- the MIP also enforces (i) an upper limit on the total cost (budget) of space mining [8, 18, 19, 23, 30] focuses on exploring a diverse family the selected candidates, and (ii) a set of diversity constraints which of feature subspaces which are interesting in terms of their ability to ensure that each detection algorithm and certain prioritized feature reveal outliers. e work in [28] introduces a new ensemble model subspaces get at least a certain minimum share of the exploration that uses detector explanations to choose base detectors selectively budget. and remain robust to errors in those detectors. 1.1.2 Meta-Learning for Cost and Utility. e REMIX MIP re- REMIX is related to ensemble outlier detection since seing the quires an estimate of cost and utility for each candidate detector. number of perspectives to one in REMIX leads to ensembling of In order to estimate cost, REMIX trains a meta-learning model for results from the base detectors. However, REMIX has some notable each algorithm which uses the number of data points, size of the distinctions: the idea of perspectives in REMIX generalizes the notion feature subspace, and various product terms derived from them as of ensembles; seing the number of perspectives to a number greater meta-features. It is signicantly harder to estimate or even dene than one is possible in REMIX and results in complementary views utility.