Meaningful Objective Frequency-Based Interesting Pattern Mining Thomas Delacroix
Total Page:16
File Type:pdf, Size:1020Kb
Meaningful objective frequency-based interesting pattern mining Thomas Delacroix To cite this version: Thomas Delacroix. Meaningful objective frequency-based interesting pattern mining. Artificial In- telligence [cs.AI]. Ecole nationale supérieure Mines-Télécom Atlantique, 2021. English. NNT : 2021IMTA0251. tel-03286641 HAL Id: tel-03286641 https://tel.archives-ouvertes.fr/tel-03286641 Submitted on 15 Jul 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THESE DE DOCTORAT DE L’ÉCOLE NATIONALE SUPERIEURE MINES-TELECOM ATLANTIQUE BRETAGNE PAYS DE LA LOIRE - IMT ATLANTIQUE ECOLE DOCTORALE N° 601 Mathématiques et Sciences et Technologies de l'Information et de la Communication Spécialité : Informatique Par Thomas DELACROIX Meaningful objective frequency-based interesting pattern mining Extraction objective et signifiante de motifs intéressants sur la base de leur fréquence Thèse présentée et soutenue à IMT Atlantique, le 21 mai 2021 Unité de recherche : UMR CNRS 6285 Lab-STICC Thèse N° : 2021IMTA0251 Rapporteurs avant soutenance : Jean-Paul Haton Professeur émérite, Université de Lorraine Gilbert Saporta Professeur émérite, Conservatoire National des Arts et Métiers Composition du Jury : Président : Jérôme Azé Professeur, Université de Montpellier Examinateurs : Jean-Paul Haton Professeur émérite, Université de Lorraine Gilbert Saporta Professeur émérite, Conservatoire National des Arts et Métiers Pascale Kuntz Professeure, Université de Nantes Franck Vermet Maître de conférences, Université de Bretagne Occidentale Dir. de thèse : Philippe Lenca Professeur, IMT Atlantique Contents 1 Foreword 11 1.1 The two cultures of statistical modeling . 11 1.2 The end of theory . 12 1.3 The right to an explanation . 13 1.4 A culture shock . 14 1.5 Objective frequency-based interesting pattern mining . 15 1.6 Acknowledgments . 17 2 Mining objectively interesting itemsets and rules - History and state-of-the-art 19 2.1 Early developments . 19 2.2 Frequent itemsets and association rules . 22 2.2.1 The models . 22 2.2.1.1 Frequent itemsets . 22 2.2.1.2 Association rules . 23 2.2.2 Algorithms . 24 2.2.2.1 The Apriori algorithm . 24 2.2.2.2 Other mining methods . 27 2.3 Interestingness . 28 2.3.1 Interestingness measures . 29 2.3.1.1 A prolongation of the association rule model . 29 2.3.1.2 Finding the right interesting measure . 30 Subjective and objective interestingness measures. 30 Objective interestingness measures: a definition. 31 Objective interestingness measures for rules be- tween itemsets? . 32 Properties of objective interestingness measures. 33 3 Algorithmic properties. 33 Good modeling properties. 33 Meaningfulness. 35 Measuring the interestingness of interesting- ness measures. 36 2.3.2 Exact summarizations of itemsets . 37 2.3.2.1 Maximal frequent itemsets . 38 2.3.2.2 Closed itemsets and minimal generators . 39 Closed itemsets. 39 Minimal generators. 41 2.3.2.3 Non-derivable itemsets . 41 2.3.3 Local and global models for mining informative and sig- nificant patterns . 44 2.3.3.1 Local data models for identifying local redun- dancy within individual patterns . 44 Local redundancy in rules. 45 Statistical models. 45 Information theory models. 48 Local redundancy in rules between itemsets. 49 Local redundancy in itemsets. 50 The independence model . 51 MaxEnt models. 52 Other models . 54 2.3.3.2 Testing redundancy against global background knowledge models . 54 2.3.3.3 Iterative learning . 56 2.3.3.4 Global models defined by interesting and non- redundant sets of patterns . 57 Model evaluation. 58 Pattern mining through compression. 58 A necessary resort to heuristics. 60 2.4 Conclusion . 60 3 Meaningful mathematical modeling of the objective interest- ingness of patterns 63 3.1 Introduction . 63 4 3.1.1 Modeling and mathematical modeling . 64 3.1.1.1 Modeling as translation . 64 3.1.1.2 Mathematical modeling . 67 3.1.1.3 Complex modeling processes . 69 3.1.2 Chapter outline . 70 3.2 The data: subject or object of the modeling process . 71 3.2.1 The data modeling process in the case of objective in- terestingness measures for rules . 72 3.2.2 The data modeling process when considering the fixed row and column margins constraint . 74 3.2.3 Recommendation . 77 3.3 Phenotypic modeling and genotypic modeling of interestingness 77 3.3.1 Phenotypic and genotypic modeling: a definition . 77 3.3.2 Phenotypic and genotypic approaches for measuring ob- jective interestingness . 79 3.3.2.1 Phenotypic approaches for defining objective interestingness measures . 79 3.3.2.2 Genotypic approaches for modeling interesting- ness . 81 3.3.3 When phenotypic modeling meets genotypic modeling: modeling information . 81 3.3.3.1 Phenotypic approaches for modeling entropy . 82 3.3.3.2 Genotypic approaches for modeling information 83 3.3.4 Recommendation . 86 3.4 Pragmatic modeling . 87 3.4.1 Pragmatic modeling of interestingness . 87 3.4.2 Meaningfulness first, computability second . 88 3.4.3 Recommendation . 89 3.5 Patchwork and holistic modeling processes . 89 3.5.1 Patchwork modeling in interesting pattern mining . 92 3.5.1.1 Type 1 patchwork modelings (PW1) . 92 3.5.1.2 Type 2 patchwork modelings (PW2) . 93 3.5.2 Recommendation . 96 3.6 Mathematical modeling of patterns . 96 3.6.1 Measure spaces and Boolean lattices . 97 5 3.6.2 Benefits of the measure space and Boolean lattice models 98 3.6.2.1 Modeling the dataset using a random variable . 98 3.6.2.2 Pattern diversity . 99 3.6.2.3 Pattern complexity . 100 3.6.2.4 Type diversity . 102 3.6.2.5 Sound and complete families of patterns . 102 3.6.2.6 Rule mining . 108 3.6.3 Recommendation . 109 3.7 Modeling objectivity . 110 3.7.1 A static finite model for a dynamic never-ending process? 111 3.7.2 Prerequisites to considering data-driven data models . 114 3.7.2.1 Confidence in the empirical distribution . 115 Potential distributions. 115 Distance. 116 Statistical test. 117 Confidence. 117 3.7.2.2 How many transactions are needed? . 118 Background knowledge. 118 Precision. 118 Minimizing the χ2 statistic. 120 Lower bounds for nα,f ................ 121 Upper bounds for nα,f ................ 122 The case of the uniform distribution. 123 The limits of pure empirical science. 125 Empirical distribution precision. 126 3.7.3 Formulating hypotheses . 126 3.7.3.1 Global hypotheses . 127 3.7.3.2 Local hypotheses . 128 3.7.3.3 Selecting hypotheses . 128 3.7.4 Evaluating hypotheses . 130 3.7.4.1 Evaluating global hypotheses . 130 Compression scores. 130 Statistical testing. 131 3.7.4.2 Evaluating local hypotheses . 132 3.7.5 Recommendation . 133 6 3.8 Conclusion . 134 4 Mutual constrained independence models 137 4.1 Theoretical foundations of MCI . 137 4.1.1 Preliminaries . 137 4.1.1.1 Notations . 137 4.1.1.2 Transfer matrix . 138 4.1.1.3 Problem statement . 140 4.1.1.4 Formulating objective hypotheses . 141 4.1.1.5 Application to the problem statement . 143 4.1.2 Finite approach . 144 4.1.2.1 Particular constrained sets . 145 Empty set. 145 Independence model. 145 All proper subitemsets. 145 4.1.2.2 Computing µ ................... 147 4.1.3 Asymptotic approach . 148 4.1.3.1 MCI convergence theorem . 149 4.1.3.2 Model justification . 149 4.1.3.3 Proof of the convergence theorem . 150 Preliminary step 1: Reduced transfer matrix. 150 Preliminary step 2: Largest derivable constraint system. 151 Preliminary step 3: Equations. 152 Strong version of Theorem 4.1.1 and proof. 153 4.1.4 Definition of MCI . 156 4.1.5 MCI and maximum entropy . 157 4.2 MCI Models: properties and computation . 158 4.2.1 K = I n fIdg ......................... 159 4.2.1.1 Algebraic expression of the model . 159 4.2.1.2 Distance to the MCI model . 161 4.2.1.3 Particularity of the MCI approach. 166 4.2.2 Algebraic geometry for computing MCI models . 166 4.2.2.1 Algebraic geometry for polynomial system solv- ing . 167 4.2.2.2 A zero-dimensional polynomial system . 169 7 Linear part. 169 Loglinear part. 170 Computing P..................... 173 4.2.2.3 General structure of the algorithm . 176 4.2.2.4 Speed-up for independence cases . 177 4.2.2.5 Speed-ups for step 4 . 179 4.2.2.6 Algebraic solutions for all cases when m ≤ 4 . 180 Solutions for m = 3. 182 4.2.2.7 Pros and cons of the algebraic method . 184 4.3 Conclusion . 186 5 Extracting objectively interesting patterns from data 189 5.1 Testing the MCI hypothesis . 190 5.1.1 Definition of the MCI hypothesis . 190 5.1.2 Statistical testing of the MCI hypothesis . 190 5.1.2.1 χ2 statistic . 190 5.1.2.2 χ2 test . 191 5.2 Discovering a valid global MCI hypothesis . 193 5.2.1 Valid MCI hypotheses . 193 5.2.2 Ordering P(I)....................... 194 5.2.2.1 A possible order relation . 194 5.2.2.2 Further discussions on the definition of an or- der relation . 196 5.2.3 Search algorithms . 197 5.2.3.1 Comprehensive search . 198 5.2.3.2 Greedy algorithms . 198 5.2.3.3 Efficiency of.