Eindhoven University of Technology BACHELOR Regression Models On

Eindhoven University of Technology BACHELOR Regression models on bivariate count data in an exceptional model mining context Raaijmakers, Boy A.J. Award date: 2017 Link to publication Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain Regression Models on Bivariate Count Data in an Exceptional Model Mining Context Bachelor Final Project - 2WH40 B.A.J. (Boy) Raaijmakers - 0845644 Supervision: dr. P.J. (Paulo) De Andrade Serra dr. W. (Wouter) Duivesteijn Eindhoven, Friday 14th July, 2017 Abstract Exceptional Model Mining is a special form of Subgroup Discovery. It concerns the discovery of ìnteresting' patterns in datasets using constraints on descriptive attributes to cluster target variables. More specifically, EMM assumes that the target variables adhere to a model which can be induced on the data itself. Using this property, this thesis introduces a manner of analyzing bivariate count data by means of bivariate Poisson regression. Moreover, an implementation in R is made. This algorithm will be applied to both simulated data and real life Football League datasets. 2 CONTENTS Contents 1 Introduction 4 1.1 History..........................................4 1.2 Significance........................................4 1.3 Related work.......................................5 1.4 Contributions.......................................5 1.5 Structure.........................................5 2 Subgroup Discovery & Exceptional Model Mining6 2.1 Subgroup Discovery...................................6 2.1.1 Formal definition.................................6 2.1.2 Quality measure.................................7 2.2 Exceptional Model Mining................................7 2.2.1 Beam search...................................8 3 Regression models on count data 10 3.1 Regression models.................................... 10 3.2 Bivariate count data................................... 10 3.2.1 Poisson regression................................ 11 3.2.2 Bivariate Poisson distribution.......................... 12 4 Implementation of the framework 14 4.1 The EMM framework.................................. 14 4.2 Refinement Operator................................... 14 4.3 Beam sizes, result set size, constraints......................... 15 4.4 Used packages....................................... 15 5 Quality measure 16 5.1 Correlation coefficient.................................. 16 5.2 Absolute difference.................................... 17 5.3 Correlation in bivariate Poisson regression....................... 18 6 Simulation and real world data 19 6.1 Simulation......................................... 19 6.2 Premier League data................................... 23 6.2.1 Data Cleaning and filtering........................... 23 6.2.2 Results...................................... 24 7 Conclusion 28 8 Future work 29 3 Introduction 1 Introduction Data mining concerns the analysis of data to derive interpretable results from it. Examples of data mining are predictive analysis, pattern mining or clustering. A specific case of data mining is Subgroup Discovery. When analyzing a dataset, Subgroup Discovery concerns finding subsets of data at which some interesting variables shows deviating behaviour. With this, one could detect interesting patterns in datasets. Examples of which are Subgroup Discovery in social media [3], traffic accidents [29], the medical world [4] and many others. Exceptional Model Mining (EMM) [12] is a special form of Subgroup Discovery [39]. It concerns the discovery of ìnteresting' patterns in datasets using constraints on descriptive attributes to cluster target variables. More specifically, EMM assumes the target variables to adhere to a statistical model which can be induced from the data itself. [12] In this thesis, I will elaborate on descriptions of the discovery of subgroups, using the Exceptional Model Mining Framework as proposed in [12]. More specifically, the framework will be applied to bivariate count data. The used techniques will be applied to both simulated and real world datasets. I review the previous work on the analysis of count data. In [34], count data (there referred to as object-relational data) is analyzed using graphical models (specifically, Bayesian Networks) to find elements of a dataset which properties substantially deviate from the whole set. This thesis will use the EMM framework on the same situation in order to gain new insights on the same nature of data. After introducing the EMM framework and assessing its quality, the findings of this thesis will be compared to the ones from [34] to identify the differences. 1.1 History As mentioned above, the umbrella field of the discussed EMM framework is Subgroup Discovery. Subgroup Discovery is a method to perform knowledge discovery which can be used to explain and describe data [3]. Earlier work in Subgroup Discovery in, for example, medical and technical domains can be found in [19], [29], [4] and many others. To my knowledge, the first notion of Subgroup Discovery together with the first discovery algorithm EXPLORA was proposed in 1996 by Klösgen in [25]. The Exceptional Model Mining framework was first introduced by Leman et al in 2008 [30]. Later, it was rewritten and applied in [12] and [13]. Since the framework is relatively new, its behavior has not yet been formally analyzed. However, there is some related work on the framework. This will be discussed in Section 1.3 of this thesis. Regression models are an older concept. The far reaching documentation on the subject is found in, for example, [2], [10], [18]. Due to the exhaustive research behind these models, they are relatively well understood. Also, several different implementations from well known programming languages (e.g. R, Python, etc.) are available for simulation and implementation (examples are [40] and [24]). 1.2 Significance Exceptional Model Mining is a framework introduced to provide a new viewpoint on Subgroup Discovery. EMM assumes adherence of response variables to a statistical model (as further explained in Section 2.2). This extension to Subgroup Discovery opens doors to new data analysis techniques by providing methods that can deal with some underlying model on the variables of interest. An example in which data is analysed using statistical models is found in the pharma- ceutical industries [22]. 4 Introduction 1.3 Related work Related work is, among others, found in the field of traditional subgroup discovery. As explained, first introduced by [25], traditional subgroup discovery concerns describing subsets of data of which one knows that some target variable takes a different value per part of the data. The limitation to this method, as stated in [21], is that there is just one variable of interest. Although being proven to be applicable in many different contexts [21], having one variable of interest comes with limitations. Closely related to Subgroup Discovery lies the field of outlier (or anomaly) detection. Whereas subgroup discovery focuses on arbitrary sizes for interesting subgroups, outlier detection will focus on rather small subsets of (not necessarily statistically related) deviating data. Whereas subgroup discovery will distinguish between different subsets of data, will outlier detection focus more on constructing a model only on the complete dataset. After construction, all data points not adhering to this model are classified as outliers. An example of this is found in [28], where outlier detection is used to secure networks and identify hackers using deviating behavior of their computers. [31] identifies a method for multinomial outlier detection. The work in this thesis is based on the findings of a paper which concerned outlier detection on object-relational data [34]. In the thesis, I will use the findings in this previous work and will review the conclusions from the angle of Exceptional Model Mining. Since EMM is a rather new framework, not much has been published on the subject. The initial framework was published by [30] and later refined by [12]. In the meantime, several publications were made on the framework [13], [37][17][14]. More closely related to this thesis, [11] already handles the case of EMM in combination with regression models. To the authors knowledge, no other research is done regarding multivariate count data in subgroup discovery or EMM. As for the case of count data, different previous papers and books are published. An extensive overview of regression models on count data is given in [6]. Furthermore, among many other, research in conducted analyzing health-care utilization in Australia in 1977-1978 [20] using count data

Load more