Machine Learning and Extremes for Anomaly Detection — Apprentissage Automatique Et Extrêmes Pour La Détection D’Anomalies

École Doctorale ED130 “Informatique, télécommunications et électronique de Paris” Machine Learning and Extremes for Anomaly Detection — Apprentissage Automatique et Extrêmes pour la Détection d’Anomalies Thèse pour obtenir le grade de docteur délivré par TELECOM PARISTECH Spécialité “Signal et Images” présentée et soutenue publiquement par Nicolas GOIX le 28 Novembre 2016 LTCI, CNRS, Télécom ParisTech, Université Paris-Saclay, 75013, Paris, France Jury : Gérard Biau Professeur, Université Pierre et Marie Curie Examinateur Stéphane Boucheron Professeur, Université Paris Diderot Rapporteur Stéphan Clémençon Professeur, Télécom ParisTech Directeur Stéphane Girard Directeur de Recherche, Inria Grenoble Rhône-Alpes Rapporteur Alexandre Gramfort Maitre de Conférence, Télécom ParisTech Examinateur Anne Sabourin Maitre de Conférence, Télécom ParisTech Co-directeur Jean-Philippe Vert Directeur de Recherche, Mines ParisTech Examinateur List of Contributions Journal Sparse Representation of Multivariate Extremes with Applications to Anomaly Detection. (Un- • der review for Journal of Multivariate Analysis). Authors: Goix, Sabourin, and Clémençon. Conferences On Anomaly Ranking and Excess-Mass Curves. (AISTATS 2015). • Authors: Goix, Sabourin, and Clémençon. Learning the dependence structure of rare events: a non-asymptotic study. (COLT 2015). • Authors: Goix, Sabourin, and Clémençon. Sparse Representation of Multivariate Extremes with Applications to Anomaly Ranking. (AIS- • TATS 2016). Authors: Goix, Sabourin, and Clémençon. How to Evaluate the Quality of Unsupervised Anomaly Detection Algorithms? (to be submit- • ted). Authors: Goix and Thomas. One-Class Splitting Criteria for Random Forests with Application to Anomaly Detection. (to be • submitted). Authors: Goix, Brault, Drougard and Chiapino. Workshops Sparse Representation of Multivariate Extremes with Applications to Anomaly Ranking. (NIPS • 2015 Workshop on Nonparametric Methods for Large Scale Representation Learning). Authors: Goix, Sabourin, and Clémençon. How to Evaluate the Quality of Unsupervised Anomaly Detection Algorithms? (ICML 2016, • Workshop on Anomaly Detection). Co-winner of the Best Paper Award, sponsored by Google. Author: Goix. Scikit-Learn implementations Isolation Forest: https://github.com/scikit-learn/scikit-learn/pull/4163 • Authors: Goix and Gramfort Local Outlier Factor: https://github.com/scikit-learn/scikit-learn/pull/5279 • Authors: Goix and Gramfort iii Contents List of Contributions iii List of Figures ix List of Tables xiii 1 Summary 1 1.1 Introduction ................................... 1 1.2 Anomaly Detection, Anomaly Ranking and Scoring Functions . 2 1.3 M-estimation and criteria for scoring functions ................ 4 1.3.1 Minimum Volume sets ......................... 5 1.3.2 Mass-Volume curve ........................... 5 1.3.3 The Excess-Mass criterion ....................... 7 1.4 Accuracy on extreme regions .......................... 9 1.4.1 Extreme Values Analysis through STDF estimation .......... 9 1.4.2 Sparse Representation of Multivariate Extremes . 11 1.5 Heuristic approaches .............................. 13 1.5.1 Evaluation of anomaly detection algorithms . 14 1.5.2 One-Class Random Forests ....................... 15 1.6 Scikit-learn contributions ............................ 18 1.7 Conclusion and Scientific Output ........................ 18 I Preliminaries 21 2 Concentration Inequalities from the Method of bounded differences 23 2.1 Two fundamental results ............................ 23 2.1.1 Preliminary definitions ......................... 23 2.1.2 Inequality for Bounded Random Variables . 24 2.1.3 Bernstein-type Inequality (with variance term) . 26 2.2 Popular Inequalities ............................... 27 2.3 Connections with Statistical Learning and VC theory . 29 2.4 Sharper VC-bounds through a Bernstein-type inequality . 31 3 Extreme Value Theory 37 3.1 Univariate Extreme Value Theory ....................... 37 3.2 Extension to the Multivariate framework .................... 40 4 Background on classical Anomaly Detection algorithms 43 4.1 What is Anomaly Detection? .......................... 43 v vi CONTENTS 4.2 Three efficient Anomaly Detection Algorithms . 44 4.2.1 One-class SVM ............................. 44 4.2.2 Local Outlier Factor algorithm ..................... 47 4.2.3 Isolation Forest ............................. 47 4.3 Examples through scikit-learn ......................... 47 4.3.1 What is scikit-learn? .......................... 48 4.3.2 LOF examples ............................. 49 4.3.3 Isolation Forest examples ........................ 50 4.3.4 Comparison examples ......................... 52 II An Excess-Mass based Performance Criterion 55 5 On Anomaly Ranking and Excess-Mass Curves 57 5.1 Introduction ................................... 57 5.2 Background and related work .......................... 59 5.3 The Excess-Mass curve ............................. 60 5.4 A general approach to learn a scoring function . 63 5.5 Extensions - Further results ........................... 66 5.5.1 Distributions with non compact support . 66 5.5.2 Bias analysis .............................. 68 5.6 Simulation examples .............................. 69 5.7 Detailed Proofs ................................. 71 III Accuracy on Extreme Regions 77 6 Learning the dependence structure of rare events: a non-asymptotic study 79 6.1 Introduction ................................... 79 6.2 Background on the stable tail dependence function . 80 6.3 A VC-type inequality adapted to the study of low probability regions . 81 6.4 A bound on the STDF .............................. 84 6.5 Discussion .................................... 88 7 Sparse Representation of Multivariate Extremes 91 7.1 Introduction ................................... 91 7.1.1 Context: multivariate extreme values in large dimension . 91 7.1.2 Application to Anomaly Detection ................... 93 7.2 Multivariate EVT Framework and Problem Statement . 94 7.2.1 Statement of the Statistical Problem . 95 7.2.2 Regularity Assumptions ........................ 98 7.3 A non-parametric estimator of the subcones’ mass : definition and preliminary results ...................................... 99 7.3.1 A natural empirical version of the exponent measure mu . 100 7.3.2 Accounting for the non asymptotic nature of data: epsilon-thickening. 100 7.3.3 Preliminaries: uniform approximation over a VC-class of rectangles . 101 7.3.4 Bounding empirical deviations over thickened rectangles . 104 7.3.5 Bounding the bias induced by thickened rectangles . 105 7.3.6 Main result ............................... 106 CONTENTS vii 7.4 Application to Anomaly Detection . 108 7.4.1 Extremes and Anomaly Detection. 108 7.4.2 DAMEX Algorithm: Detecting Anomalies among Multivariate Extremes109 7.5 Experimental results .............................. 112 7.5.1 Recovering the support of the dependence structure of generated data 112 7.5.2 Sparse structure of extremes (wave data) . 113 7.5.3 Application to Anomaly Detection on real-world data sets . 113 7.6 Conclusion ................................... 116 7.7 Technical proofs ................................. 118 7.7.1 Proof of Lemma 7.5 . 118 7.7.2 Proof of Lemma 7.6 . 119 7.7.3 Proof of Proposition 7.8 . 120 7.7.4 Proof of Lemma 7.10 . 124 7.7.5 Proof of Remark 7.14 . 126 IV Efficient heuristic approaches 127 8 How to Evaluate the Quality of Anomaly Detection Algorithms? 129 8.1 Introduction ................................... 129 8.2 Mass-Volume and Excess-Mass based criteria . 131 8.2.1 Preliminaries .............................. 131 8.2.2 Numerical unsupervised criteria . 132 8.3 Scaling with dimension ............................. 133 8.4 Benchmarks ................................... 134 8.4.1 Datasets description . 135 8.4.2 Results ................................. 136 8.5 Conclusion ................................... 139 8.6 Further material on the experiments . 140 9 One Class Splitting Criteria for Random Forests 149 9.1 Introduction ................................... 149 9.2 Background on decision trees . 151 9.3 Adaptation to the one-class setting . 152 9.3.1 One-class splitting criterion . 153 9.3.2 Prediction: a majority vote with one single candidate? . 155 9.3.3 OneClassRF: a Generic One-Class Random Forest algorithm . 156 9.4 Benchmarks ................................... 157 9.4.1 Default parameters of OneClassRF. 157 9.4.2 Hyper-Parameters of tested algorithms . 158 9.4.3 Description of the datasets . 159 9.4.4 Results ................................. 160 9.5 Theoretical justification for the one-class splitting criterion . 161 9.5.1 Underlying model . 161 9.5.2 Adaptive approach . 162 9.6 Conclusion ................................... 163 9.7 Further details on benchmarks and unsupervised results . 164 10 Conclusion, limitations & perspectives 173 viii CONTENTS 11 Résumé des contributions en Français 175 11.1 Introduction ................................... 175 11.2 Détection d’anomalies, ranking d’anomalies et fonctions de scores . 176 11.3 M-estimation et critères de performance pour les fonctions de scores . 178 11.3.1 Ensembles à volume minimal . 178 11.3.2 La courbe Masse-Volume . 179 11.3.3 Le critère d’excès de masse . 181 11.4 Précision sur les régions extrêmes . 183 11.4.1 Analyse du point de vue de la théorie des valeurs extrêmes par l’estimation de la STDF . 183 11.4.2 Représentation parcimonieuse des extrêmes multivariés . 185 11.5 Approches heuristiques ............................. 188 11.5.1 Évaluer un algorithme

Load more