Eindhoven University of Technology

BACHELOR

Regression models on bivariate count data in an exceptional model mining context

Raaijmakers, Boy A.J.

Award date: 2017

Link to publication

Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration.

General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain Regression Models on Bivariate Count Data in an Exceptional Model Mining Context Bachelor Final Project - 2WH40

B.A.J. (Boy) Raaijmakers - 0845644

Supervision: dr. P.J. (Paulo) De Andrade Serra dr. W. (Wouter) Duivesteijn

Eindhoven, Friday 14th July, 2017 Abstract Exceptional Model Mining is a special form of Subgroup Discovery. It concerns the discov- ery of ‘interesting’ patterns in datasets using constraints on descriptive attributes to cluster target variables. More specifically, EMM assumes that the target variables adhere to a model which can be induced on the data itself. Using this property, this thesis introduces a manner of analyzing bivariate count data by means of bivariate Poisson regression. Moreover, an implementation in R is made. This algorithm will be applied to both simulated data and real life Football League datasets.

2 CONTENTS

Contents

1 Introduction 4 1.1 History...... 4 1.2 Significance...... 4 1.3 Related work...... 5 1.4 Contributions...... 5 1.5 Structure...... 5

2 Subgroup Discovery & Exceptional Model Mining6 2.1 Subgroup Discovery...... 6 2.1.1 Formal definition...... 6 2.1.2 Quality measure...... 7 2.2 Exceptional Model Mining...... 7 2.2.1 Beam search...... 8

3 Regression models on count data 10 3.1 Regression models...... 10 3.2 Bivariate count data...... 10 3.2.1 Poisson regression...... 11 3.2.2 Bivariate Poisson distribution...... 12

4 Implementation of the framework 14 4.1 The EMM framework...... 14 4.2 Refinement Operator...... 14 4.3 Beam sizes, result set size, constraints...... 15 4.4 Used packages...... 15

5 Quality measure 16 5.1 Correlation coefficient...... 16 5.2 Absolute difference...... 17 5.3 Correlation in bivariate Poisson regression...... 18

6 Simulation and real world data 19 6.1 Simulation...... 19 6.2 data...... 23 6.2.1 Data Cleaning and filtering...... 23 6.2.2 Results...... 24

7 Conclusion 28

8 Future work 29

3 Introduction

1 Introduction

Data mining concerns the analysis of data to derive interpretable results from it. Examples of data mining are predictive analysis, pattern mining or clustering. A specific case of data mining is Subgroup Discovery. When analyzing a dataset, Subgroup Discovery concerns finding subsets of data at which some interesting variables shows deviating behaviour. With this, one could detect interesting patterns in datasets. Examples of which are Subgroup Discovery in social media [3], traffic accidents [29], the medical world [4] and many others. Exceptional Model Mining (EMM) [12] is a special form of Subgroup Discovery [39]. It concerns the discovery of ‘interesting’ patterns in datasets using constraints on descriptive attributes to cluster target variables. More specifically, EMM assumes the target variables to adhere to a statistical model which can be induced from the data itself. [12] In this thesis, I will elaborate on descriptions of the discovery of subgroups, using the Exceptional Model Mining Framework as proposed in [12]. More specifically, the framework will be applied to bivariate count data. The used techniques will be applied to both simulated and real world datasets. I review the previous work on the analysis of count data. In [34], count data (there referred to as object-relational data) is analyzed using graphical models (specifically, Bayesian Networks) to find elements of a dataset which properties substantially deviate from the whole set. This thesis will use the EMM framework on the same situation in order to gain new insights on the same nature of data. After introducing the EMM framework and assessing its quality, the findings of this thesis will be compared to the ones from [34] to identify the differences.

1.1 History

As mentioned above, the umbrella field of the discussed EMM framework is Subgroup Discovery. Subgroup Discovery is a method to perform knowledge discovery which can be used to explain and describe data [3]. Earlier work in Subgroup Discovery in, for example, medical and technical domains can be found in [19], [29], [4] and many others. To my knowledge, the first notion of Subgroup Discovery together with the first discovery algorithm EXPLORA was proposed in 1996 by Kl¨osgen in [25]. The Exceptional Model Mining framework was first introduced by Leman et al in 2008 [30]. Later, it was rewritten and applied in [12] and [13]. Since the framework is relatively new, its behavior has not yet been formally analyzed. However, there is some related work on the framework. This will be discussed in Section 1.3 of this thesis. Regression models are an older concept. The far reaching documentation on the subject is found in, for example, [2], [10], [18]. Due to the exhaustive research behind these models, they are relatively well understood. Also, several different implementations from well known programming languages (e.g. R, Python, etc.) are available for simulation and implementation (examples are [40] and [24]).

1.2 Significance

Exceptional Model Mining is a framework introduced to provide a new viewpoint on Subgroup Discovery. EMM assumes adherence of response variables to a statistical model (as further ex- plained in Section 2.2). This extension to Subgroup Discovery opens doors to new data analysis techniques by providing methods that can deal with some underlying model on the variables of interest. An example in which data is analysed using statistical models is found in the pharma- ceutical industries [22].

4 Introduction

1.3 Related work

Related work is, among others, found in the field of traditional subgroup discovery. As explained, first introduced by [25], traditional subgroup discovery concerns describing subsets of data of which one knows that some target variable takes a different value per part of the data. The limitation to this method, as stated in [21], is that there is just one variable of interest. Although being proven to be applicable in many different contexts [21], having one variable of interest comes with limitations. Closely related to Subgroup Discovery lies the field of outlier (or anomaly) detection. Whereas subgroup discovery focuses on arbitrary sizes for interesting subgroups, outlier detection will focus on rather small subsets of (not necessarily statistically related) deviating data. Whereas subgroup discovery will distinguish between different subsets of data, will outlier detection focus more on constructing a model only on the complete dataset. After construction, all data points not adhering to this model are classified as outliers. An example of this is found in [28], where outlier detection is used to secure networks and identify hackers using deviating behavior of their computers. [31] identifies a method for multinomial outlier detection. The work in this thesis is based on the findings of a paper which concerned outlier detection on object-relational data [34]. In the thesis, I will use the findings in this previous work and will review the conclusions from the angle of Exceptional Model Mining. Since EMM is a rather new framework, not much has been published on the subject. The initial framework was published by [30] and later refined by [12]. In the meantime, several publications were made on the framework [13], [37][17][14]. More closely related to this thesis, [11] already handles the case of EMM in combination with regression models. To the authors knowledge, no other research is done regarding multivariate count data in subgroup discovery or EMM. As for the case of count data, different previous papers and books are published. An extensive overview of regression models on count data is given in [6]. Furthermore, among many other, research in conducted analyzing health-care utilization in Australia in 1977-1978 [20] using count data models and [5] writes about Dutch day-trips in 1991 which can be analyzed using a bivariate Poisson model.

1.4 Contributions

The main contributions of this thesis are: • Introduce an R based implementation of the Exceptional Model Mining framework. • Extend the EMM toolbox with the bivariate Poisson regression model to provide means for finding exceptional behavior in data with bivariate count targets.

1.5 Structure

This thesis will be structured as follows. Section2 will describe the process of Subgroup Discovery and in particular the Exceptional Model Mining framework. In Section3, the bivariate Poisson regression model is presented. Section4 describes the implementation of the EMM framework. Section5 will introduce different quality measures for EMM. The implementation and measures combined will be applied to simulated and real-world data in Section6. Section7 concludes the work after which Section8 gives suggestions for future work.

5 Subgroup Discovery & Exceptional Model Mining

2 Subgroup Discovery & Exceptional Model Mining

Subgroup Discovery and Exceptional Model Mining concern the search for ‘interesting’ subsets of a particular dataset. Subgroups are defined by descriptions or ‘patterns’. A measure of quality is used to assess the interestingness of a subgroup. It measures how (dis)similar a subgroup is compared to the whole dataset or the complement of the subgroup. The goal of SD and EMM is to find the most interesting subgroups (i.e. the subgroup with the highest quality).

In this section, the general idea and purpose of SD and the EMM framework will be reviewed in order to get familiar with the capabilities of these techniques. First, an informal definition from literature will be formalized in order to define the related techniques in a structured and unambiguous manner. Next, the further definitions regarding Subgroup Discovery are given and are then translated into the EMM framework.

2.1 Subgroup Discovery

Subgroup Discovery is known as the field of finding ‘interesting’ subgroups. The subgroups are defined by descriptive attributes. The manner of ‘interestingness’ can be induced on the target attributes. A more structured and complete definition is given by [39]:

“In Subgroup Discovery, we assume we are given a so-called population of individuals (objects, customers, . . . ) and a property of those individuals that we are interested in. The task of Subgroup Discovery is then to discover the subgroups of the population that are statistically ‘most interesting’, i.e. are as large as possible.”

Although intuitively clear, this definition refers to a few terms which are not well defined and therefore need a better explanation. Terms like ‘interesting’ have an intuitively clear meaning, but are not precise enough to be used to analyze a certain dataset. Because of this, I will use the more precise and formal definition of [13].

2.1.1 Formal definition

In Subgroup Discovery, one is given a dataset D in which each element is represented by a vector x¯. Each elementx ¯ ∈ D can be written asx ¯ = (a1, . . . , ak, t1, . . . , tm) with k, m ∈ N fixed for all elements of D. Here,a ¯ = (a1, . . . , ak) represents descriptive attributes and t¯ = (t1, . . . , tm) represents target attributes ofx ¯ (throughout this thesis ‘attributes’ might be interchanged by ‘variables’). For the case of count data, ti is discrete for some i throughout the whole dataset. The i for which this holds will be named responses and further defined in Section3. Furthermore, |D| = N represents the size of the dataset. Assumea ¯ to be elements of an unspecified domain D. Furthermore, assume t¯ to be elements of another unspecified domain T (this domain will be specified for the case of count data in Section 3.2).

Next, one can define a description on the data. A description is a indicator τ : D → {0, 1} in which a vectorx ¯ is said to be accepted by τ when τ(¯a) = 1. The collection of all descriptions τ will be defined by the description language collection P. To each description τ corresponds a subgroup, defined by:

Dτ = {x¯ ∈ D | τ(¯a) = 1}, τ ∈ P.

Furthermore let |Dτ | = Nτ . An intuitive explanation of this definition is that a subgroup of a dataset is a set of elements satisfying all descriptions describing the subgroup.

6 Subgroup Discovery & Exceptional Model Mining

2.1.2 Quality measure

Now, one only has to define how ‘interesting’ or ‘good’ a subgroup is. Interestingness or goodness of a subgroup is categorized using a quality measure. A quality measure is a function φ : TP → R. The quality measure thus maps a pattern of a particular subgroup to a real value (the quality). In traditional Subgroup Discovery, subgroups are deemed interesting when the value of the targets of a certain subset substantially deviates (i.e. the quality of the description is high enough) from the values of response values of the complement or the whole dataset [39]. Furthermore, in traditional SD, the number of target attributes is equal to 1 (i.e. m = 1). However, in a more realistic manner, there might be multiple target variables, each having some statistical relation with each other. Therefore, it is not enough to only look at the distribution of a particular subgroup, but rather the statistical models and properties they adhere to. More detailed definitions of different quality measures are given in Section5.

2.2 Exceptional Model Mining

In the Exceptional Model Mining framework the target variables of the data are assumed to adhere to a model [12]. The type of model used for the analysis in EMM will be determined by the nature of the data. In general, one might take the notion of ‘model’ in EMM as broad as possible, as long as it makes sense according to the data. One might, for example, be interested in the correlation between targets or the amount of dispersion in the targets of a subgroup. The main intuition to this, is that there no longer exists the constraint of fixed target variables (like subgroup discovery has), but that any statistical relation between targets can be analyzed. In the next section of this thesis, the chosen models will be presented. These models will later on be used to assess the ‘interestingness’ or quality of a specific subgroup. Using this measure, one could select the subgroup with the highest or lowest quality to identify the deviating behavior. An example of EMM is found in Figure1. In this illustration, taken from [12], one can see a population of 6 elements in which a clear distinction is present between two subgroups, each following their own (linear) model. In this example, the targets are given by t1 = X, t2 = Y and t3 is given by the shape of the data point. Also, note that the descriptive attributesa ¯ are not explicitly given here.

Figure 1: Example population with two clear subgroups and their trained linear models [12].

7 Subgroup Discovery & Exceptional Model Mining

A problem with EMM is that the set of all descriptions of subgroups is typically very large, since a subgroup can be constructed from a constraint on every attribute and every attribute can be divided by all forms it might take. A binomial attribute has just two possible divisions, whereas a nominal valued attribute could have arbitrary many. Taking all different possibilities into account results in a very large search space. Exhaustive analysis of this space is computationally costly. Therefore, a heuristic method to traverse the search space will be introduced next.

2.2.1 Beam search

Analyzing the search space in a heuristic but effective manner can be achieved by beam search. The method is, although not formally, first introduced in [33]. The algorithm is based on the heuristic that when the quality (in this thesis given by the quality measure φ) of a data-structure A is higher than the quality of structure B, then the quality of some subset A0 ⊆ A is also higher than the quality of B0 ⊆ B. Of course, this assumption is false for many cases, but will significantly shrink the search space of the algorithm. Beam search makes use of certain beam ‘width’ and ‘depth’. The width of the beam is the number of subsets (with the highest quality) which will be taken into account at the next layer of evaluation. The beam depth defines the number of layers or iterations in the search. An example is given by [35] which is reproduced in Figure2. In this example, the beam width is given by 2 and the beam depth is 4.

Figure 2: Beam search tree [35].

In order for beam search to be applied, a suitable operator is needed to traverse from one layer to the next. This ‘refinement’ operator will generally be a specialization operator. It will, given the descriptors of a certain layer, specify the rules for subgroups in the next. With specialization is meant that, given a description, the operator will add an ingredient (typically a constraint on a specific attribute) to it. In this way, the data is divided in different (not necessarily disjoint) subsets. Per pattern, the quality of the corresponding subgroup is assessed in the next layer of the search. An example of this refinements could be given using the data from Table1. When analyzing the amounts of calls and complaints made to a customer service portal, one could construct subgroups from customer demographics. A refinement operator, in this case, might provide the splits on these demographics. Assume a beam width of 1 and a beam depth of 2. In the first layer, the operator

8 Subgroup Discovery & Exceptional Model Mining

CustomerID Gender Income Nr. of children Calls Complaints 1 M e10.251 0 6 2 2 V e24.635 2 7 1 3 V e12.145 0 1 3 4 M e36.486 1 3 0 5 M e25.154 1 4 0 6 V e18.452 0 5 0 7 M e68.873 2 3 1 8 V e21.012 0 8 2

Table 1: Example data of contacting customer services.

Figure 3: Example of a refinement operator. will split the data on gender. From the two constructed subgroups, the quality is calculated. The quality of the male group turns out to be higher. Then, a split is made on the monthly income. Since these are numerical values, the operator splits on incomes higher and lower than e20.000,-. From the newly constructed subgroups, the quality is again assessed and the most exceptional subgroup is selected. A visualization of this procedure is found in Figure3. In this thesis, the refinement operator η is used, which is introduced by [12]. The choice of η is further explained in Section4.

9 Regression models on count data

3 Regression models on count data

With the aforementioned framework, one has a stable basis for discovery of subgroups in a broad range of datasets. However, to apply the framework, one still needs to have a hypothesis on how the target attributes behave. For this, the bivariate Poisson regression model will be described. First, some general notions and definitions on regression models are made. After that, the Poisson regression model is given. Finally, the Poisson regression model is extended to a bivariate Poisson regression model.

3.1 Regression models

The term regression implies that one is interested in the dependence or ‘impact’ that certain variables have on the outcomes of others [1]. Using regression models, one is able to characterize these impacts. One could, for example, see whether a higher cholesterol level implies a higher blood pressure or see the relation between the amount of diapers and beer bought at a supermarket. A formal definition of a regression model is given by [1] and states:

Suppose one has N observations (pi, ri), i = 1,...,N over a set of m predictors pi = (pi1, . . . , pim) and a one response variable ri. A general regression model assumes that

ri = hβ(pi) + i i = 1, . . . , n

m where the response function hβ(·): R → R has a known parametric form and depends on 0 N ≤ N unknown parameters β = (β1, . . . , βN 0 ). Furthermore, the noise i are i.i.d. random variables which are typically normal, but could also have some other distribution. Since both EMM and regression models concern different input and output variables, there has to be a clear distinction in notation. EMM uses descriptive attributesa ¯ = (a1, . . . ak) to describe subgroups and targets t¯ = (t1, . . . , tm) as the variables determining the interestingness of a sub- group. Since EMM is based on the target attributes adhering to a statistical model, the regression model will only consider the targets of EMM. More specifically, the target variables of EMM will be divided in l predictive variables and m − l response variables of the regression model, that is (t1, . . . , tm) = (p1, . . . , pl, r1, . . . , rm−l). A visualized explanation of this naming convention is:

descriptors targets z }| { z }| { x¯ = (a1, . . . , ak, t1, . . . , tm) = (a1, . . . , ak, p1, . . . , pl, r1, . . . , rm−l) | {z } | {z } predictors responses

Now that we know the definition of regression models, one can start to look at the link with EMM. One can use the regression models to derive how interesting a subgroup is. An example of this might be that, using the aforementioned definition of a regression model, one could derive a vector β for the whole dataset and β0 for a subgroup and look at the Euclidian distance between both. On the other hand, one might also look at the matter of dispersion in the noise  of the dataset and subgroup. If, for example, the variance or mean of  of a subgroup is different from the assumed standard normal parameters, then the subset might be deemed interesting. The measure of quality most suitable for a proper analysis should be chosen taking into account the nature of the data and the expected results.

3.2 Bivariate count data

Count data means that the response variables take non-negative integer values. these values usually correspond to counts of some event over a given amount of time. An example of count

10 Regression models on count data data is the amount of tropical cyclones crossing the North Queensland coast between 1956 and 1968 [10] or the Flying-bomb hits on during WWII (which is discussed below).

Recall the formal definition of subgroup discovery where the targets t¯are members of an unspecified domain T . Reviewing this for the case of bivariate count data, this domain is not unspecified + anymore. More specifically, let tm−1 and tm be elements of Z , the non-negative integers. The targets (t1, . . . , tm−2) are still elements of some unspecified domain. From here, I can define models on the targets to use in the EMM algorithm.

3.2.1 Poisson regression

When considering count data, or counts in general, the Poisson distribution is the first thing to come to mind. The Poisson distribution is a well known distribution for describing the number of event occurrences within a certain time period; each event time is independent of the previous event times, and the time between two consecutive events has a common mean. The probability density function of a random variable Y ∼ P oi(λ) is given by:

e−λ P (Y = y) = f(y, λ) = λy · . y!

Note that E[Y ] = Var(Y ) = λ. An example of the application of the Poisson distribution is found in [15]. This example illustrates flying-bomb hits in London during World War II. The data can be found in Table2. The city of 1 London was divided into N = 576 small areas of 4 square kilometer. The data in the table records the number Nk of areas with exactly k hits. P The total number of hits is T = k · Nk = 537, the average λ = T/N ≈ 0.9323. Now looking at the approximated values of a Poisson distribution with λ = 0.9323, one can see that the fit is surprisingly accurate. Back in the day, people believed that bombing attacks were clustered. This would be indicated by high frequencies in areas with either k = 0 or k ≥ 5. However, the fitness of the Poisson model proved almost perfect randomness and homogeneity.

k 0 1 2 3 4 ≥ 5 Nk 229 211 93 35 7 1 P oi(k; 0.9323) 226.74 211.39 98.54 30.62 7.14 1.57

Table 2: Flying-bomb hits on London [15].

Apart from this example, [15] also shows examples in chromosome interchanges in cells, radioactive decay or calls mistakenly made to a wrong number. These examples all roughly follow a Poisson distribution. Thus, the Poisson distribution seems a good fit for the data handled in this thesis.

Directly related to this distribution is Poisson regression. The following definition of this model is paraphrased from [7].

A Poisson regression model can model the frequency of events provided that the events (ri), i = 1,...,Nτ , given the predictors pi, are independent. The assumption is that ri|pi ∼ P oi(λi), where λi is defined by: T log(λi) = β pi, i = 1,...,Nτ .

Taking the logarithm assures non-negativity of λi. The Poisson regression model is part of a broader family of models which is known as the generalized linear models [40].

11 Regression models on count data

3.2.2 Bivariate Poisson distribution

In order to also take into account dependence of different count-valued responses, a new dis- tribution will be described. This distribution, given by [23], is called the multivariate Poisson distribution.

Let Yi, i = 1, . . . , a be independent Poisson random variables, i.e. Yi ∼ P oi(λi), i = 1, . . . , a. Now T T define the vectors Y = (Y1,...,Ya) and λ = (λ1, . . . , λa) . Furthermore, define the b × a matrix T A, b ≤ a with elements equal to 0 or 1. We say that the vector r = (r1, . . . , rb) , given as r = AY follows a multivariate Poisson distribution. Moreover, [23] gives the facts that E[r] = Aλ and Var(r) = AΣAT where Σ = diag(λ1, . . . , λa). As this thesis only focusses on bivariate response data, b is equal to 2. Furthermore, A is taken equal to:

1 0 1 A = 0 1 1 which gives the following equations for the responses:

r1 = Y1 + Y3

r2 = Y2 + Y3 with Y1 ∼ P oi(λ1), Y2 ∼ P oi(λ2) and Y3 ∼ P oi(λ3). Note that the mean vector is given by:   λ1 + λ3 E[r] = λ2 + λ3 and the variance-covariance matrix by:

λ + λ λ  Var(r) = 1 3 3 . λ3 λ2 + λ3

From here, one can see that the covariance between r1 and r2 is controlled by λ3. To view this distribution as a bivariate Poisson regression model, one can assume that the pa- rameters λi are functions of the predictive attributes. This brings us back to the definition of the Poisson regression for one response.

Now, we have a Yik ∼ P oi(λik) denoting the i-th observation and k ∈ {1, 2, 3}. The predictive attributes are denoted by pik = (pi1k, . . . , pijk), i = 1,...,Nτ , j = 1, . . . , l, k ∈ {1, 2, 3}. So for every Yik, there is a corresponding λik given by:

T log(λik) = βk pik i = 1,...,Nτ , k ∈ {1, 2, 3}.

The observed data are r1 and r2. However, the parameters βk, k ∈ {1, 2, 3} are the parameters for λi = (λi1, λi2, λi3) which in turn describe the distribution of the Yik’s which we do not observe. Therefore, a method to estimate the unknown parameters βk = (β1k, . . . , βlk) for k ∈ {1, 2, 3} is needed. For this, I use the EM algorithm, introduced by [9]. The EM (or Expectation-Maximization) algorithm is an iterative method to estimate parameters by maximum likelihood in statistical

12 Regression models on count data models, where the observations depend on unobserved latent variables [38]. The method consists of an expectation (E) step in which the conditional expectation of the log-likelihood given the current estimate of parameters is computed. After that, the maximization (M) step computes parameters maximizing this expected log-likelihood. The found parameters of the M-step will be used again in the next iteration E-step of the algorithm. The method for the case of multivariate Poisson regression is proposed in [23]. The case of bivariate Poisson regression is given next. • E-step: In the E-step, one computes the conditional expectation of the latent variables

Yk = (Y1k,...,YNτ k), k ∈ {1, 2, 3}, given the observed data and the current values for the (h) (h) (h) (h) estimates of βk, k ∈ {1, 2, 3}, which will be denoted by Θ = (β1 , β2 , β3 ). This (h) expectation, denoted by sij , is given by:

(h) (h−1) λikP (R = ri − φk) sik = E[Yik|Ri, Θ ] = P (R = ri)

i = 1,...,Nτ , k = {1, 2, 3}, where φk denotes the k-th column of A and P (·) denotes the joint probability function of a multivariate Poisson distribution. This probability function is given by [32]:

min(ri1,ri2) r −j r −j j X λ i1 λ i2 λ P (r , r ) = e−(λi1+λi2+λi3) i1 i2 i3 i1 i2 (r − j)!(r − j)!j! j=0 i1 i2

(h−1)T where λik = exp(βk pik), i = 1, . . . , n, k ∈ {1, 2, 3}.

• M-step: Using the sij’s as dependent variables and pij as predictive attributes, update Θ by fitting a Poisson regression model. • If some convergence criterion is satisfied, stop iterating, else: go back to the E-step for the next iteration. The initial values for the algorithm, Θ(1), can be obtained by fitting separate Poisson regression models assuming independence of the responses. With this algorithm, the bivariate Poisson model can be fitted to the data. The detailed description of the associated quality measure for the the EMM framework is given in Section5.

13 Implementation of the framework

4 Implementation of the framework

With the definition of EMM from Section2, it is now possible to make an implementation of the algorithm. This algorithm will later on be used on both simulated and real life datasets to check for performance.

4.1 The EMM framework

In [12], an implementation of EMM using beam search is proposed. Since this implementation also suits the specialized case of count data, this algorithm will be implemented for this thesis. The pseudo-code is given in Algorithm 1 below.

Algorithm 1 Beam Search for Top-q Exceptional Model Mining Input: Dataset Ω, quality measure φ, refinement operator η, beam width w, beam depth d, result set size q, Constraints C Output: PriorityQueue resultSet 1: candidateQueue ← new Queue; 2: candidateQueue.enqueue({}) . Start with empty description 3: resultSet ← new PriorityQueue(q); 4: for (Integer lever ← 1; level ≤ d; level++) do 5: beam ← new PriorityQueue(w); 6: while (candidateQueue6= ∅) do 7: seed ← candidateQueue.dequeue(); 8: set ← η(seed); 9: for (desc ∈ set) do 10: quality ← φ(desc); 11: if (desc.satisfiesAll(C)) then 12: resultSet.insert with priority(desc,quality); 13: beam.insert with priority(desc,quality); 14: while (beam 6= ∅) do 15: candidateQueue.enqueue(beam.get front element()); 16: return resultSet;

The implementation of this pseudo-code needs some definitions of the different operators which will be used. The different quality measures will be specified in Section5. The rest of the inputs will be specified next.

4.2 Refinement Operator

The refinement operator, already mentioned in Section 2.2, will be defined. The operator intro- duced by [12] will be used. This operator is defined as follows: “When η is presented with a description D to refine, it builds up the set η(D) by looping over all the descriptive attributes a1, . . . , ak . For each attribute, a number of descriptions are added to η(D), depending on the attribute type:

• if ai is binary: add D ∩ (ai = 0) and D ∩ (ai = 1) to η(D); g • if ai is nominal, with values v1, . . . , vg: add ∪j=1{D ∩ (ai = vj),D ∩ (ai 6= vj)} to η(D);

• if ai is numeric: order the values of ai that are covered by the descriptions D; this gives us a list of ordered values v(1), . . . v(n) (where n = |GD|). From this list, we select the split points

14 Implementation of the framework

s1, . . . , sb−1 by letting s = v n j (bj b c) b−1 Then, add {D ∩ (ai ≤ sj),D ∩ (ai ≥ sj)}j=1 to η(D).”

4.3 Beam sizes, result set size, constraints

Further inputs of the model are given by the beam width, beam depth, result set size and the initial constraints. The values for the beam width and depth influence the number of computations made by the algorithm and hence will influence the running time. These parameters are implemented as global variables and hence can be changed easily in between runs of the algorithm. Later on, using the different datasets as input, the parameters will be specialized to the most suitable values. In the same sense, the size of the result set of descriptions is left large. In more specific cases, this parameter can be set to smaller values. This will be determined later. Initial constraints are used when there is some restriction on the structure of the result set. It might, for example, be needed that a management office wants to fire a maximum amount of 50 employees and wants to find out which subgroup is functioning worst. These constraints are completely independent of the patterns generated by the EMM framework. In this general implementation, these constraints are not needed and might be filled in later. Hence, the constraint set is initially left empty.

4.4 Used packages

Due to the statistical nature of the source data, the implementation will be made in R. The following packages were used for the model: • liqueueR [8] for objects Queue and PriorityQueue which gives all functionalities of priority queues; • sandwich [41][42] for function glm() which provide the functionality of generalized linear models and hence Poisson regression; The whole code is found in the appendix of this thesis. Since the quality of a subgroup is dependent on the chosen model, multiple quality measures are implemented. These will be introduced and specified in Section5.

15 Quality measure

5 Quality measure

As already mentioned in Section2, a quality measure is used to assess the interestingness of a particular subgroup. In the case of EMM, the quality measure can have many different forms. In this section, different aspects of a quality measure will be given and concrete measures will be introduced which can be used for the analysis of data.

Figure1: Example population with two clear subgroups and their trained linear models [12].

Review the EMM example given in Figure1 (which was already given in Section2). It illustrates the question on what interestingness of a subgroup means. One can either assess the quality of a subgroup by comparing its model with a model for the complement of the subgroup, or by comparing it with the model for the whole dataset. Suppose the situation as in Figure1 arises. Fitting a simple linear regression using X (i.e. hβ(X) is linear) to predict Y on the whole dataset will result in a line with slope equal to −1. Now suppose that you have some description which takes the circles as a group. Again fitting a model gives a slope of 1. When the quality of a set is assessed using the slope of the trained model, one can see that this measure substantially differs from that of the whole dataset and therefore, the subgroup might be deemed interesting. On the other hand, when you are comparing with the complement of the circles, the squares, one can see that the slope is equal and hence the subgroup shows no deviating behavior. The choice of quality assessment is completely dependent on the nature of the data and the structure of the target variables and on what one wants to discover from the data. This has to be taken into account in further analysis and assessments of quality. First, some simpler heuristic measures will be introduced. These measures will be used to get familiar with the framework and its implementation. In addition to the simpler measures, more complex ones will be introduced. Those can be used to draw significant results and conclusions on different datasets.

5.1 Correlation coefficient

As a first step in the complete implementation, I will start with a rather straightforward quality measure. This quality measure is proposed by [12] in Section 5.1 and is known as φscd. The measure considers two numeric response variables, r1 and r2. The measure is related with the correlation coefficient ρ. This coefficient will be estimated by the Pearson correlation coefficient

16 Quality measure applied to samples, defined by:

PNτ (ri − r¯ )(ri − r¯ ) Dτ i=1 1 1 2 2 ρˆ = q PNτ i 2 PNτ i 2 i=1(r1 − r¯1) i=1(r2 − r¯2)

i i where r1, r2 denote the i-th entry in r1 and r2 respectively andr ¯1,r ¯2 denotes their respective C Dτ D means. Analogously, the same holds for r2. Now let ρ and ρ τ be the correlation coefficients for C C Dτ D a (sub)group Dτ and its complement Dτ , respectively, and letρ ˆ andρ ˆ τ denote their sample estimates. Using this, it is interesting to find a substantially deviating correlation of a subgroup with respect to its complement. Therefore, one can derive the quality based on the test

G GC G GC H0 : ρ = ρ against H1 : ρ 6= ρ

D DC In general however, the sampling distribution ofρ ˆ τ andρ ˆ τ is not known, even if the target variables are Gaussian. If r1 and r2 follow a bivariate normal distribution (which they do not in count data, this will be commented on later), then we can apply the Fisher z transformation, given by 1 1 +ρ ˆ z0 = · log 2 1 − ρˆ The sampling distribution of z0 is approximately normal. As a consequence,

0 C0 ∗ z − z z = q 1 + 1 Nτ −3 (N−Nτ )−3 approximately follows a standard normal distribution under H0. As a rule of thumb, one can say that if both n and nC are greater than 25, then the normal approximation gives quite accurate results. The quality measure from here is simply given by 1 minus the p-value of the test using z∗. In order to generate useful results from this measure, an assumption should be made on the size of the subgroup. This property is to be added to the constraints given as input to the algorithm.

5.2 Absolute difference

Using the definition ofr ˆ defined above, one might also think of a more straightforward quality measure, which is given by the absolute difference, i.e.

C Dτ D φabs(Dτ ) = |rˆ − rˆ τ |.

Also, as explained, the target variables should follow a bivariate normal distribution. As this thesis focusses on count data, this will not be the case. Hence, this measure is to be used as an exploratory measure and can not be used to derive any final results. A failure in this measure is that it does not take the coverage of a description into account. This can result in overfitting the model. Therefore, it is convenient to also take the entropy of given description into account. This can be done by the entropy function φef, given by N N N − N N − N φ (D ) = − τ · log τ − τ · log τ . ef τ N N N N

Both measures combined results in

φent(Dτ ) = φef(Dτ ) · φabs(Dτ ).

17 Quality measure

Despite being a fine quality measure for data which is (at least closely) normally distributed, this measure will probably not deliver any significant results on count data for those responses are assumed discrete and non-negative, whereas the normal distribution does not adhere to any of these constraints. In addition to this analysis, one might look at the Pearson correlation coefficient and just assume the means and data itself to be Poisson distributed. However, in Chapter 6 of [27] it is mentioned that: “normal correlation analyses should be limited to situations in which (r1, r2) is (at least very nearly) normal.” In the same paper, several strong conclusions are made on why the quality of non-normal data using the normal correlation coeffi- cients is poor. Therefore, other options for quality measure should be analyzed at which it might be possible to incorporate the count properties of our data.

5.3 Correlation in bivariate Poisson regression

The bivariate Poisson regression model is described in Section 3.2.2. In this quality measure, a specific statistic from this model will be used. Recall that the EM algorithm takes the observed data (both predictors and responses) as input and returns the estimates for the values of different β’s of the latent variables. The measure especially looks at the correlation between the observed responses and hence one is interested in:

λi3 ρˆi = p , i = 1,...,Nτ (λi1 + λi3) · (λi2 + λi3) whereρ ˆi denotes the correlation of the i-th entry of subgroup Dτ . For the implementation of this quality measure, the bivpois package [24] will be used. This package implements the EM- algorithm with the matrix A as specified in Section 3.2.2. This algorithm will take the targets of a subgroup as input and will give the estimates of β as output. Since only the β’s are estimated after the algorithm is finished , the value ofρ ˆ will be different per observation. For the quality measure, I take: N 1 Xτ ρ¯ = ρˆ . N i τ i=1

Now, the quality measure is defined by:

φ (D ) = |ρ¯ − ρ¯ C |. bp τ Dτ Dτ

C Note that Dτ and Dτ are here denoted as subscripts of ρ to avoid ambiguity in the notation with the bar and a superscript. Since this thesis only focusses on the case of count data, the first two mentioned quality measures will not be suitable for further analysis, for they assume the target data to be Gaussian. Thus, from here on, the only quality measure used for further analysis is the bivariate Poisson measure φbp.

18 Simulation and real world data

6 Simulation and real world data

Now that I have the framework, it is time to verify the implementation of it. Assessing the accuracy as well as correctness of the algorithm will be done in this section. To verify the quality of the algorithm, it will be convenient to test it against some known ground truth. For this, I will simulate data and apply the algorithm. After the quality of the algorithm against this truth is assessed, it will be used to derive subgroups in real life datasets originating from Premier League football matches.

6.1 Simulation

The quality of the algorithm can be assessed by analyzing data of which the distribution is known. This ground truth can be used to see with how much accuracy the algorithm will find subgroups. To illustrate this, different simulated datasets will be given as input for the algorithm.

The dataset consists of 100 samples of two binary descriptive attributes a = (a1, a2), three pre- dictive attributes p = (p1, p2, p3) and two responses r = (r1, r2). The descriptive and predictive attributes are generated randomly. The attributes are generated using the sample function of R. The descriptive attributes are picked from {0, 1}. The predictive attributes are picked from {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. Furthermore, the following value for β1, β2 and β3 are chosen:

β1 = (0.066, 0.066, 0.066)

β2 = (0, 0, 0) ( (0.1, 0.01, 0.003), if (a1, a2) = (0, 1). β3 = (−∞, −∞, −∞), otherwise.

T Note that β3 = (−∞, −∞, −∞) implies that exp(p β3) = 0 and hence λ3 = 0.

Making a distinction for the values of β3 for (a1, a2) = (0, 1) and other data defines the interesting subgroup. Since the correlation between targets in the former set is higher then of its complement, it should be deemed interesting by the algorithm when the quality measure from the previous section is used.

From these β’s, one can generate Y1, Y2 and Y3 by taking an element from a Poisson distribution T T T with parameters exp(p β1), exp(p β2) and exp(p β3) respectively. Now, r can be determined by r = AY.

To test the quality of the EMM framework, first, data will be generated using the above description. After this, the EMM framework with φbp is applied to the simulated data. The output of the algorithm are the subgroups with the highest quality and hence should be the subgroup with (a1, a2) = (0, 1). One run is obviously not enough to get a view on the quality of the algorithm. Therefore, this procedure is repeated 1000 times. During these iterations, five pieces of information are gathered. For this analysis, the true subgroup ((a1, a2) = (0, 1)) is denoted by DT and the subgroup with the highest quality found by the EMM framework is denoted by DB. The found information consists of:

1. The description of DB;

2. φbp: The quality of DB;

19 Simulation and real world data 0.4 0.3 Ratio 0.2 0.1 0.0 01 10 00 11 −1 0−

Descriptions

Figure 4: Bar plot showing the frequency of different descriptions.

|DT \DB| 3. R1 = : The number of attributes in DT which are not present in DB relative to |DT | the size of DT ;

|DB\DT | 4. R2 = : The number of attributes in DB which are not present in DT relative to |DB| the size of DB;

|DT ∩ DB| 5. R3 = p : The number of attributes in the intersection of DT and DB relative to |DT | · |DB| the square root of the multiplied size of both sets.

The results regarding the description of DB are found in Figure4. As the descriptions typically are long, they are mapped to a more compact notation. In this notation, a 0 or 1 in the first place means that a1 is 0 or 1 respectively in that subgroup. Analogously, 0 and 1 are handled for the a2. A bar (-) means that the attribute is not specified and it can be either 0 or 1. For example, ‘01’ is our true subgroup (a1, a2) = (0, 1) and ‘-0’ means a2 = 0. From the figure, one can see that DB is mostly given by the correct description. More specifically, in 46.5% of the cases, the correct subgroup is found as the best subgroup.

Next, a histogram of the quality φbp of DB over different runs is made. The results of these qualities is found in Figure5. From this histogram, one can identify a bell curve in the data. This might indicate that φbp, approximately follows a normal distribution. To test this, the Shapiro-Wilks test for normality [36] will be used. This procedure tests whether the hypothesis of a dataset being normally distributed is true. This test returned a p-value of 2.951 · 10−12. Hence I can conclude that the φbp is not normal.

20 Simulation and real world data 150 100 Frequency 50 0

0.2 0.3 0.4 0.5 0.6 0.7

Phi Bivariate Poisson

Figure 5: Histogram showing the frequencies of the highest quality.

Last, the different ratios of elements in DT and DB were analyzed. The boxplots containing the results are found in Figure6. From these measures, I can review the amount of overlap in the found subsets. If, for example, DB is larger than DT and DB ∩ DT = DT , then one might also see a higher quality since the exceptional subgroup is entirely found in the best subgroup. Now that the quality of the algorithm is known for a particular case of simulated data, it is good to see what the influence is of changing different parameters of the simulated dataset. The following properties of the simulated dataset will be shifted:

• The average value ofρ ¯DT ;

• The expected size of a DT ; • The number of descriptive attributes ina ¯ (k). All of these properties will be tested using N = 50 or N = 100 observations. Again, the same properties as in the above analysis will be assessed. The results are given in Table3. In this table, the mean and standard deviation of quality of the best scoring subset are given. Furthermore, the mean and standard deviation of the three aforementioned ratios between the best and true subgroup (denoted by R1, R2, R3) are reported. Last, the percentage of times the best subgroup was equal to the true subgroup is reported. The results of this analysis are found in Table3. Clearly, increasing the average value of ρ by increasing the values for β3 shows a significant increase in quality of the algorithm for both sample sizes. To test the changes of other parameters on the quality of the model, the value of β3 which results in an average ρ of 0.7 is used. Increasing the number of descriptors shows a significant decrease in the quality of the algorithm. This is interesting since this implies that a more complex search space will result in the algorithm

21 Simulation and real world data 1.0 0.8 0.6 Ratio 0.4 0.2 0.0 R1 R2 R3

Figure 6: Boxplots showing the latter three measures. failing to find the true subgroup. However, interestingly, the best subgroup with k = 3 was, on average, given a higher quality than the best subgroup from the set with k = 2. Also, increasing the size of the interesting subgroup shows decrease in the quality of the algorithm. The new subgroup is defined by a1 = 0 and hence is less specialized than the earlier explored one. Surprisingly, the new subgroup is never chosen as the best subgroup. Nevertheless, one can see that the means of the ratios R1, R2 and R3 are all around 0.5. This indicates that the best chosen sets, on average, overlapped half with the true subgroup. Indeed the subgroups (a1, a2) = (0, 0) and (a1, a2) = (0, 1) are chosen for 79.2% when N = 50 items and 84.9% when N = 100. From this, I can conclude that the chosen subgroup, although not the true subgroup, can still be deemed exceptional and the algorithm hence gives useful results.

22 Simulation and real world data

β3 ρ¯DT N |DT | k φbp R1 R2 R3 % correct (−0.1, −0.07, −0.002) 0.2 50 12 2 0.39 ± 0.12 0.85 ± 0.35 0.85 ± 0.35 0.15 ± 0.35 14.6% (0.1, 0.01, 0.003) 0.5 50 13 2 0.43 ± 0.11 0.58 ± 0.49 0.58 ± 0.49 0.42 ± 0.49 41.8% (0.2, 0.08, 0.007) 0.7 50 16 2 0.52 ± 0.09 0.25 ± 0.43 0.25 ± 0.43 0.75 ± 0.43 74.8% (0.2, 0.2, 0.2) 0.92 50 17 2 0.65 ± 0.06 0.04 ± 0.19 0.04 ± 0.19 0.96 ± 0.19 95.9% (−0.1, −0.07, −0.002) 0.2 100 26 2 0.35 ± 0.10 0.91 ± 0.29 0.91 ± 0.29 0.10 ± 0.29 30.2% (0.1, 0.01, 0.003) 0.5 100 27 2 0.39 ± 0.09 0.53 ± 0.50 0.53 ± 0.50 0.47 ± 0.50 47.2% (0.2, 0.08, 0.007) 0.7 100 23 2 0.49 ± 0.08 0.13 ± 0.33 0.13 ± 0.33 0.87 ± 0.33 87.4% (0.2, 0.2, 0.2) 0.92 100 22 2 0.64 ± 0.05 0.00 ± 0.04 0.00 ± 0.04 1.00 ± 0.04 99.8% (0.2, 0.08, 0.007) 0.7 50 16 2 0.52 ± 0.09 0.25 ± 0.43 0.25 ± 0.43 0.75 ± 0.43 74.8% (0.2, 0.08, 0.007) 0.7 50 25 3 0.57 ± 0.65 0.81 ± 0.39 0.81 ± 0.39 0.19 ± 0.39 18.6% (0.2, 0.08, 0.007) 0.7 100 23 2 0.49 ± 0.08 0.13 ± 0.33 0.13 ± 0.33 0.87 ± 0.33 87.4% (0.2, 0.08, 0.007) 0.7 100 13 3 0.61 ± 0.084 0.76 ± 0.43 0.76 ± 0.43 0.24 ± 0.43 23.0% (0.2, 0.08, 0.007) 0.7 50 16 2 0.52 ± 0.09 0.25 ± 0.43 0.25 ± 0.43 0.75 ± 0.43 74.8% (0.2, 0.08, 0.007) 0.7 50 51 2 0.53 ± 0.07 0.51 ± 0.50 0.51 ± 0.50 0.49 ± 0.50 0% (0.2, 0.08, 0.007) 0.7 100 23 2 0.49 ± 0.08 0.13 ± 0.33 0.13 ± 0.33 0.87 ± 0.33 87.4% (0.2, 0.08, 0.007) 0.7 100 28 2 0.57 ± 0.08 0.50 ± 0.50 0.50 ± 0.50 0.50 ± 0.50 0%

Table 3: Statistics of simulated data.

6.2 Premier League data

Now that the quality of the algorithm is known, it is possible to apply it to real world data and derive useful results. For this, data originating from the British Premier League football season of 2010 [16] will be used. This dataset contains information on all matches played during that season. There are 19 attributes which will later on be divided in descriptors, predictors and responses given in 380 records of data. Per match, the following data is available: • Date • Home team • Away team • Referee • FTHG/FTAG - (target attributes) Full time home/away team goals • FTR - Full time result • HS/AS - Home/away team shots • HST/AST - Home/away team shots on target • HF/AF - Home/away team fouls committed • HC/AC - Home/away team corners • HY/AY - Home/away team yellow cards • HR/AR - Home/away team red cards All attributes other than the Date, Home team name, Away team name, FTR and Referee name are counts of events. The most interesting question on this dataset is of course: Is there a subgroup of matches in which the result of the match is exceptional with respect to the other matches?

6.2.1 Data Cleaning and filtering

I did not find any suspicious records in the dataset. However, there is a lot of redundancy between the Full Time Result FTR and the targets, so I exclude FTR. Also, due to great dispersion in dates matches were played the Date element was omitted.

23 Simulation and real world data

6.2.2 Results

For this analysis, the parameters as specified in Section4 were used. The parameters will be divided in descriptors = {Home team, Away team, Referee}, predictors = {HS, AS, HST, AST, HF, AF, HC, AC, HY, AY, HR, AR} and responses = {FTHG, FTAG} The resulting exceptional subgroups were given by: • Subgroup 1a – Description: Referee != L Mason, AwayTeam != Sunderland, Referee != P Walton – Quality: 0.0342 – Subgroup size: 313 • Subgroup 2a – Description: Referee != L Mason, AwayTeam != Sunderland, Referee != H Webb – Quality: 0.0287 – Subgroup size: 312 • Subgroup 3a – Description: Referee != L Mason, AwayTeam != Sunderland, AwayTeam != Newcastle – Quality: 0.0266 – Subgroup size: 323 • Subgroup 4a – Description: Referee != L Mason, AwayTeam != Sunderland, HomeTeam != West Brom – Quality: 0.0260 – Subgroup size: 322 • Subgroup 5a – Description: Referee != L Mason, AwayTeam != Sunderland, HomeTeam != Man City – Quality: 0.0258 – Subgroup size: 322 From here, one can see that matches not containing as referee and Sunderland as away team have an exceptional correlation compared to the rest of the data. More specifically, the results show that the correlation between the scores is higher in the subgroup than in its complement. This indicates that matches that contain at least this referee or this away team, will have a lower correlation. A more detailed view on the subgroup with the highest quality might give some understanding of why the subgroup differs that much from the rest. For this, it is interesting so see whether there is any significant differences in the behavior between a subgroup and its complement. To do this, the distribution of the predictive attributes will be analyzed. The difference in distribution between attributes in the subgroup and its complement can be assessed using a Kolmogorov-Smirnov test [26]. Testing all attributes using this test, the attribute AY (Away Yellow) is identified to be signifi- cantly different with a p-value of 0.0591. Histograms of both the counts in the subgroup and its complement are found in Figure7. Although they have a similar shape, the distribution differs in both sets. Apart from this attribute, no significant differences in distribution were found.

24 Simulation and real world data

Figure 7: Histograms of AY in Subgroup 1a and its complement.

Moreover, simple statistics on Lee Mason show that he was referee on 23 of the 380 matches of the 2010 football season. With a total of 18 different referees and 380 matches, a referee should be on 21 matches on average. Lee Mason is slightly above this average. Interestingly, just 1 of the matches he was referee was a match with Sunderland as either home or away team. This indicates independence of the two entities, meaning that there is no dependence between the two attributes in being chosen for the best subgroup. For Sunderland as away team, one can see that the matches they played were mostly lead by Phil Dowd (three times). Apart from this, there is no clear distinction in the referees of matches with Sunderland as away team. Expanding this research by changing the beam depth to 5 instead of 3 yields the following results: • Subgroup 1b – Description: Referee != P Walton, HomeTeam != Newcastle, HomeTeam != West Brom, AwayTeam != Sunderland, Referee != M Atkinson – Quality: 0.0774 – Subgroup size: 276 • Subgroup 2b – Description: Referee != P Walton, HomeTeam != Newcastle, HomeTeam != West Brom, AwayTeam != Sunderland, Referee != A Marriner – Quality: 0.0719 – Subgroup size: 275 • Subgroup 3b – Description: Referee != P Walton, HomeTeam != Newcastle, HomeTeam != West Brom, HomeTeam != Wolves, Referee != A Marriner – Quality: 0.0689 – Subgroup size: 272 • Subgroup 4b – Description: Referee != P Walton, HomeTeam != Newcastle, HomeTeam != West Brom, HomeTeam != Wolves, Referee != M Atkinson – Quality: 0.0689

25 Simulation and real world data

– Subgroup size: 276 • Subgroup 5b – Description: Referee != L Mason, AwayTeam != Sunderland, HomeTeam != Fulham, AwayTeam != Newcastle, Referee != M Atkinson – Quality: 0.0681 – Subgroup size: 286 From these subgroups, we can find more interesting subgroups. Also, one can see that there are more teams and referees which are deemed exceptional. Moreover, one can see the clear structure of beam search here. Looking at beam depth 4, the results originate from the 3 different beams, namely: • Referee != P Walton, HomeTeam != Newcastle, HomeTeam != West Brom, AwayTeam != Sunderland • Referee != P Walton, HomeTeam != Newcastle, HomeTeam != West Brom, HomeTeam != Wolves • Referee != L Mason, AwayTeam != Sunderland, HomeTeam != Fulham, AwayTeam != Newcastle Setting the depth to a higher number implied that the chosen subgroups have a different referee in the first layer of the beam search. Now, Peter Walton is up front at most of the chosen subgroups. Sunderland is still mentioned in four out of five groups. The simple statistics on Peter Walson show that he was referee in 28 out of 380 matches. This is above the average of 21, indicating that he is a more experienced referee and hence should be able to play a fairer game. The new teams in the subgroups are Newcastle and West Brom. However, from these teams, there are no exceptional simple statistics found. Again, one can assess the differences in distributions of the attributes of the best subgroup and its complement using the Kolmogorov-Smirnov test [26]. The same analysis this times yields a significant difference in the distributions of the attribute AC (Away Corner) with a p-value of 0.0481. The histograms of AC of the best subgroup and its complement are found in Figure8. Again, the two figures show a similar shape. However, according to test, they have a different distribution.

Figure 8: Histograms of AC in Subgroup 1b and its complement.

26 Simulation and real world data

As this thesis is based on the work done in [34], it is interesting to review both results and identify similarities and differences in both approaches. In [34] outlier detection is applied using Bayesian Networks. As explained, outlier detection concerns building a model (in this case a Bayesian Network) on the whole dataset and finding elements which do not comply to this model. For this, the paper introduces a new metric to quantify the extent to which an element is an outlier to the model. As an example, the authors use the Premier League dataset with slightly different information per match. As a result, they find specific players of teams which show outlying behavior from the model on the whole dataset. Reviewing such analysis against the work in this thesis shows that the same data can be analyzed in different manners. Whereas [34] looked at outliers in the data, this thesis focussed more on subgroups with outlying behavior. Despite having different approaches, both methods show significant results in showing exceptional (or outlying) behavior. When looking at applications of the methods, the work of [34] could be used to find individual exceptionally behaving objects. This will result in identifying which football players show positively outstanding or fraudulent behaviour or which computer in a network might be used by hackers to attack a information system. When looking at the applications of this thesis, one could identify groups of football players, groups of computers used by hackers. Both methods are valuable for finding outstanding behavior.

27 Conclusion

7 Conclusion

In this thesis, one has seen an application of the Exceptional Model Mining framework. In EMM, a subset of data is deemed interesting when the model trained on the target attributes deviates significantly from either its complement or the whole dataset. Combining this notion, with a specific nature of data will tighten the possible models which could be applied to the data. Here, the EMM framework was applied to bivariate count data. Since count data represent the number of events that took place in a specific amount of time, this source of data is associated with distributions have the same property. From this, the Poisson distribution is selected as suitable model for the EMM framework. Poisson regression was introduced such that the responses were dependent on their corresponding predictors. Due to the bivariate nature of the data used in this thesis, a bivariate Poisson distribution and its corresponding regression model were introduced. The quality of the EMM framework with the bivariate Poisson regression model was assessed using simulated data. Using a known ground truth, I could assess how sensitive the algorithm behaves to changes in the size of the input data, input parameters β, number of descriptive attributes and the size of the true exceptional subgroup. From this analysis, I found that the algorithm behaves as expected when the average correlation between the two responses is varied. Also, I have found that, when the true exceptional subgroup can be specialized to a smaller one (by refining on its descriptors), the algorithm will less likely pick the true subgroup, but will rather choose the refined one. Knowing the quality of the method, I used real life data from the English Premier League of 2010. Applying the algorithm to this dataset indicated a correlation between the final scores and the referee or away team of different football matches. It was particularly interesting that the absence of a specific referee or club indicated a higher correlation. This might imply that these specific referee and team play a more ‘fair’ game. In the same manner, my method could be used for fraud detection in different sport matches in the future. The results were concluded with a comparison with the previous publication on Football data. From here, I concluded that the works differ in approach, but are both useful for future applications regarding outlier detection and subgroup discovery on count data.

28 Future work

8 Future work

As explained, this thesis is limited to the application of the framework to Football Data using the Exceptional Model Mining framework with the bivariate Poisson regression model. However, more distributions might be suitable for use on bivariate count data. For example the negative binomial distribution, which gives the number of success of i.i.d. Bernoulli trials before a specified number of fails. Closely related to this (and also member of the generalized linear models) is the negative binomial regression model [40]. Whereas the Poisson distribution has the property of the mean being equal to the variance of a random variables, provides the negative binomial regression model provides the option to have a variance independent of the mean. This might result in refinements of parameter choices and hence more accurate results. Apart from choosing other models for the EMM framework, one might also look at different applications of this method. This thesis focussed on sports data, but one might also look at cases from the medical world or other fields of expertise. e.g. one might look at the correlation of different diseases and see exceptional behaviour in subgroups formed by different medication given to the patient. A model can be formed using patient demographics and exceptional properties might be found.

29 REFERENCES

References

[1] F Abramovich and Y Ritov. Statistical Theory: A Concise Introduction. CRC Texts in Statistical Science, 2013. [2] A Agresti. Categorical data analysis. Wiley, 1990. [3] M Atzmueller. Subgroup discovery. Wiley Interdisciplinary Reviews: Data Mining and Knowl- edge Discovery, 5(1):35–49, 2015. [4] M Atzmueller, F Puppe, and H P Buscher. Exploiting background knowledge for knowledge- intensive subgroup discovery. IJCAI International Joint Conference on Artificial Intelligence, pages 647–652, 2005. [5] P Berkhout. A bivariate Poisson count data model using conditional probabilities. Statistica Neerlandica, 58(3):349–364, 2004. [6] A C Cameron and P K Trivedi. Regression Analysis of Count Data. Cambridge University Press, April 2012. [7] H Chen and J K Lindsey. Applying Generalized Linear Models. Springer, 1998. [8] A Collier. liqueueR: Implements Queue, PriorityQueue and Stack Classes, 2016. [9] A P Dempster, N M Laird, and D B Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B Methodological, 39(1):1–38, 1977. [10] A J Dobson. Introduction to Generalized Linear Models. CRC Press, 2000. [11] W Duivesteijn, A J Feelders, and A Knobbe. Different slopes for different folks: mining for exceptional regression models with cook’s distance. Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 868–876, 2012. [12] W Duivesteijn, A J Feelders, and A Knobbe. Exceptional Model Mining: Supervised de- scriptive local pattern mining with complex target concepts. Data Mining and Knowledge Discovery, 30(1):47–98, 2016. [13] W Duivesteijn, A Knobbe, A J Feelders, and M van Leeuwen. Subgroup discovery meets Bayesian networks - An Exceptional Model Mining approach. Proceedings - IEEE Interna- tional Conference on Data Mining, ICDM, pages 158–167, 2010. [14] W Duivesteijn, E Loza Menc´ıa,J F¨urnkranz,and A Knobbe. Multi-label LeGo enhancing multi-label classifiers with local patterns. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 7619 LNCS:114–125, 2012. [15] W Feller. An Introduction to Probability Theory and Its Applications. Wiley, 1959. [16] Football-Data. The premier league dataset. http://www.football-data.co.uk/englandm.php. Accessed: 2017-07-04. [17] R Goebel. Lecture Notes in Artificial Intelligence Subseries of Lecture Notes in Computer Science LNAI Series Editors. 2011. [18] W Greene. Functional forms for the negative binomial model for count data. Economics Letters, 99(3):585–590, 2008. [19] H Grosskreutz and S R¨uping. On subgroup discovery in numerical domains. Data Mining and Knowledge Discovery, 19(2):210–226, 2009. [20] S Gurmu and J Elder. Generalized bivariate count data regression models. 68:31–36, 2000.

30 REFERENCES

[21] F Herrera, C J Carmona, P Gonz´alez,and M J del Jesus. An overview on subgroup discovery: foundations and applications. Knowledge and Information Systems, 29(3):495–525, 2011. [22] M Hocine, D Guillemot, P Tubert-Bitter, and T Moreau. Testing independence between two Poisson-generated multinomial variables in case-series and cohort studies. Statistics in Medicine, 24(24):4035–4044, 2005. [23] D Karlis and L Meligkotsidou. Multivariate Poisson regression with covariance structure. Statistics and Computing, 15(4):255–265, 2005. [24] D Karlis and I Ntzoufras. Bivariate Poisson and Diagonal Inflated Bivariate Poissn Regression Models in R. Journal of Statistical Software, 30(April):1–3, 2009. [25] W Kl¨osgen. Explora: A multipattern and multistrategy discovery assistant. In Ramasamy Fayyad, Usama M. and Piatetsky-Shapiro, Gregory and Smyth, Padhraic and Uthurusamy, editor, Advances in knowledge discovery and data mining, pages 249–271. American Associa- tion for Artificial Intelligence, Menlo Park, CA, USA, 1996. [26] A Kolmogorov. Sulla determinazione empirica di una legge di distribuzione. G. Ist. Ital. Attuari, 4:83–91, 1933. [27] C J Kowalski. On the Effects of Non-Normality on the Distribution of the Sample Product- Moment Correlation Coefficient. Journal of the Royal Statistical Society, Series C (Applied Statistics), 21(1):1–12, 1972. [28] V Kumar. Parallel and Distributed Computing for Cybersecurity. IEEE Distributed Systems Online, 6(10):1–9, 2005. [29] N Lavraˇc,B Kavˇsek,P Flach, and L Todorovski. Subgroup Discovery with CN2-SD. The Journal of Machine Learning Research, 5:153–188, 2004. [30] D Leman, A J Feelders, and A Knobbe. Exceptional model mining. In Proceedings of ECM- L/PKDD, vol 2, pages 1–16, 2008. [31] W R Mebane and J S Sekhon. Robust Estimation and Outlier Detection for Overdispersed Multinomial Models of Count Data. American Journal of Political Science, 48(2):392–411, 2004. [32] W L Poston. Discrete multivariate distributions. Technometrics, 40(2):161–162, 1998. [33] D R Reddy. Speech understanding systems: summary of results of the five-year research effort at Carnegie-Mellon University. page 181, 1977. [34] F Riahi and O Schulte. Model-Based Outlier Detection for Object-Relational Data. In 2015 IEEE Symposium Series on Computational Intelligence, pages 1590–1598, dec 2015. [35] I Sabuncuoglu and M Bayiz. Job shop scheduling with beam search. European Journal of Operational Research, 118(2):390–412, 1998. [36] S. S. Shapiro and M. B. Wilk. An Analysis of Variance Test for Normality (Complete Samples). Biometrika, 52(3/4):591, 1965. [37] M Van Leeuwen. Maximal exceptions with minimal descriptions. Data Mining and Knowledge Discovery, 21(2):259–276, 2010. [38] Wikipedia. Expectation–maximization algorithm — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Expectation–maximization algorithm&oldid=783069174. Accessed: 2017-05-25. [39] S Wrobel. Inductive Logic Programming for Knowledge Discovery in Databases. Springer Berlin Heidelberg, Berlin, Heidelberg, 2001. [40] T W Yee. Vector Generalized Linear and Additive Models. Springer.

31 REFERENCES

[41] A Zeileis. Econometric Computing with HC and HAC Covariance Matrix Estimators. Journal of Statistical Software, 11(10):1–17, 2004. [42] A Zeileis. Object-Oriented Computation of Sandwich Estimators. Journal of Statistical Software, 16(9):1–16, 2006.

32 REFERENCES

Appendix: R code

Quality measures

#Specify here which measure you want to use ! phi <− function ( set ){ return ( phi bp ( set )) }

#Entropy function phi e f <− function ( set ){ n <− nrow( set ) return(−n/N∗log (n/N) − (N−n)/N∗log ((N−n)/N)) }

#Correlation statistical test measure phi c o r r <− function ( set ){ n <− nrow( set ) n C <− N − n

set C <− s e t d i f f ( data , set )

r prime <− cor ( set$V6 , set$V7) z prime <− 1/2 ∗ log ((1+ r prime )/(1− r prime ) )

r prime C <− cor ( set C$V6 , set C$V7) z prime C <− 1/2 ∗ log ((1+ r prime C)/(1− r prime C))

z s t a r <− ( z prime − z prime C)/sqrt (1/(n−3) + 1/(n C−3))

pValue <− 1 − pnorm( abs ( z s t a r ) ) #pValue <− pValue ∗ phi e f ( s e t )

return ( pValue ) }

#correlation absolute difference measure phi abs <− function ( set ){ set C <− s e t d i f f ( data , set )

r prime <− cor ( set$V6 , set$V7) r prime C <− cor ( set C$V6 , set C$V7)

q u a l i t y <− abs ( r prime − r prime C) q u a l i t y <− q u a l i t y ∗ phi e f ( set ) return ( q u a l i t y )

}

#Bivariate Poisson measure phi bp <− function ( set ){ set C <− s e t d i f f ( data , set )

33 REFERENCES

i f (nrow( set C) == 0){ return ( 0 ) }

#F o o t b a l l model <− lm . bp ( FTHG˜HS+AS+HST+AST+HF+AF+HC+AC+HY+AY, FTAG˜HS+AS+HST+AST+HF+AF+HC+AC+HY+AY, l 3=˜HS+AS+HST+AST+HF+AF+HC+AC+HY+AY, data=set , pres = 1e−03 ) model C <− lm . bp ( FTHG˜HS+AS+HST+AST+HF+AF+HC+AC+HY+AY, FTAG˜HS+AS+HST+AST+HF+AF+HC+AC+HY+AY, l 3=˜HS+AS+HST+AST+HF+AF+HC+AC+HY+AY, data=set C, pres = 1e−03 ) #Simulation model <− lm . bp ( r1˜c1+c2+c3 , r2˜c1+c2+c3 , l 3=˜c1+c2+c3 , data=set , pres = 1e −03) model C <− lm . bp ( r1˜c1+c2+c3 , r2˜c1+c2+c3 , l 3=˜c1+c2+c3 , data=set C, pres = 1e −03)

m <− nrow( set ) n <− nrow( set C)

rho sum <− 0

for ( i in 1 :m){ #F o o t b a l l vector <− c (1 , set$HS[ i ] , set$AS[ i ] , set$HST[ i ] , set$AST[ i ] , set$HF[ i ] , set$AF[ i ] , set$HC[ i ] , set$AC[ i ] , set$HY[ i ] , set$AY[ i ] ) #Simulation vector <− c (1 , set$c1 [ i ] , set$c2 [ i ] , set$c3 [ i ] ) l 1 <− exp( vector %∗% model$beta1 ) l 2 <− exp( vector %∗% model$beta2 ) l 3 <− exp( vector %∗% model$beta3 )

rho sum <− rho sum + ( l 3 )/sqrt ( ( l 1 + l 3 ) ∗ ( l 2 + l 3 ) ) }

rho hat <− rho sum/m

rho sum C <− 0

for ( i in 1 : n){ #F o o t b a l l vector <− c (1 , set$HS[ i ] , set$AS[ i ] , set$HST[ i ] , set$AST[ i ] , set$HF[ i ] , set$AF[ i ] , set$HC[ i ] , set$AC[ i ] , set$HY[ i ] , set$AY[ i ] ) #Simulation vector <− c (1 , set C$c1 [ i ] , set C$c2 [ i ] , set C$c3 [ i ] )

34 REFERENCES

l 1 <− exp( vector %∗% model C$beta1 ) l 2 <− exp( vector %∗% model C$beta2 ) l 3 <− exp( vector %∗% model C$beta3 )

rho sum C <− rho sum C + ( l 3 )/sqrt ( ( l 1 + l 3 ) ∗ ( l 2 + l 3 ) ) }

rho hat C <− rho sum/n

return ( abs ( rho hat − rho hat C)) }

Refinement operator nu <− function ( seed ){ set <− l i s t ()

for ( i in 1 : length ( attributes )) {

i f ( attributes [i] == ”tar”){ #skip target variables }

else i f ( attributes [i] == ”bin”) { newset <− seed[seed[ i]==”0” ,] i f ( ! (nrow(newset) == 0 | g r e p l ( paste ( colnames (seed)[i], ”== 0” ) , newset$description [ 1 ] ) ) ) { newset$description <− paste ( newset$description ,” | ”, colnames (seed)[i], ”== 0” ) set [[ length ( set ) +1 ] ] <− newset }

newset <− seed[seed[ i]==”1” ,] i f ( ! (nrow(newset) == 0 | g r e p l ( paste ( colnames (seed)[i], ”== 1” ) , newset$description [ 1 ] ) ) ) { newset$description <− paste ( newset$description ,” | ”, colnames (seed)[i], ”== 1” ) set [[ length ( set ) +1 ] ] <− newset } }

else i f ( attributes [i] == ”nom”) { types <− lapply ( seed [ i ] [ !duplicated (seed[i]), ], as . character ) for (type in types){ newset <− seed[seed[ i]==type ,] i f ( ! (nrow(newset) == 0 | g r e p l ( paste ( colnames (seed)[i], ”==”, type), newset$description [ 1 ] ) ) ) { newset$description <− paste ( newset$description ,” | ”, colnames (seed)[i], ”==”, type) set [[ length ( set ) +1 ]] <− newset }

newset <− seed[seed[i] !=type , ]

35 REFERENCES

i f ( ! (nrow(newset) == 0 | g r e p l ( paste ( colnames (seed)[i], ”==”, type), newset$description [ 1 ] ) ) ) { newset$description <− paste ( newset$description ,” | ”, colnames (seed)[i], ” !=” , type ) set [[ length ( set ) +1 ]] <− newset } } } else { #Numerical values not implemented } }

return ( set ) }

EMM algorithm

EMM <− function ( data , attributes ){

#Load Bivariate Poisson functions source ( ” b i v p o i s /lm . bp . r ” ) source ( ” b i v p o i s / pbivpois.r”) source ( ” b i v p o i s /newnamesbeta. r”) source ( ” b i v p o i s / splitbeta.r”)

#EMM Algorithm

#Specify some input values N <− nrow( data ) beam width <− 100 beam depth <− 3 r e s u l t s i z e <− 5 constraints <− NULL

#Add output columns data$description <− ”” data$ q u a l i t y <− 0

candidateQueue <− Queue$new() candidateQueue$push ( data ) r e s u l t S e t <− PriorityQueue$new()

for (depth in 1:beam depth ) { beam <− PriorityQueue$new()

while ( ! candidateQueue$empty ( ) ) { seed <− candidateQueue$pop ( ) set <− nu ( seed ) for ( desc in set ) { i f (satisfyAll(desc, constraints)) { q u a l i t y <− phi ( desc ) desc$ q u a l i t y [ 1 ] <− q u a l i t y

36 REFERENCES

r e s u l t S e t $push(desc , quality) beam$push(desc , quality) } }

finalBeam <− PriorityQueue$new() for ( index in 1 : beam width ){ i f (beam$empty ( ) ) { break } finalBeam$push (beam$pop ( ) ) }

} while ( ! finalBeam$empty ( ) ) { candidateQueue$push(finalBeam$pop ( ) ) } }

#Print results i f (FALSE){ for ( index in 1 : r e s u l t s i z e ){ r e s u l t <− r e s u l t S e t $pop ( ) cat ( paste ( ”Subgroup:”, index ,”\n” , ”Quality:”, result $ quality[1], ”\n” , ” S i z e o f group : ” , nrow( r e s u l t ) , ”\n” , r e s u l t $description [ 1 ] , ”\n\n” ) ) } } r e s u l t <− r e s u l t S e t $pop ( )

return ( r e s u l t ) }

Simulate function

simulate <− function (items, beta3){ betaH <− c (0.066, 0.066, 0.066) betaA <− c (0 , 0 , 0)

multiVarPoi <− function (covariates, useI = 0, betaI = c (0 , 0 , 0)){ A <− matrix ( c (1, 0, 1, 0, 1, 1), nrow = 2 , ncol = 3 , byrow = TRUE )

i f ( useI ){ Y <− matrix ( c (

37 REFERENCES

rpois (1 , exp( betaH %∗% covariates)), rpois (1 , exp( betaA %∗% covariates)), rpois (1 , exp( betaI %∗% covariates)) ), nrow = 3 , ncol = 1 ) } else { Y <− matrix ( c ( rpois (1 , exp( betaH %∗% covariates)), rpois (1 , exp( betaA %∗% covariates)), rpois (1 , 0) ), nrow = 3 , ncol = 1 ) }

Y <− matrix ( c ( rpois (1 , exp( betaH %∗% covariates)), rpois (1 , exp( betaA %∗% covariates)), rpois (1 , exp( betaI %∗% covariates)) ), nrow = 3 , ncol = 1 ) return (A%∗% Y) }

#simulateBin simulationBin <− data . frame () for (i in 1:items){ row <− data . frame( ”d1” = character ( 1 ) , ”d2” = character ( 1 ) , ”d3” = character ( 1 ) , ” c1 ” = numeric ( 1 ) , ” c2 ” = numeric ( 1 ) , ” c3 ” = numeric ( 1 ) , ” r1 ” = numeric ( 1 ) , ” r2 ” = numeric ( 1 ) )

descriptive <− c (sample ( 0 : 1 , 1) , sample ( 0 : 1 , 1) , sample ( 0 : 1 , 1 ) ) c o v a r i a t e s <− c (sample ( 0 : 1 0 , 1) , sample ( 0 : 1 0 , 1) , sample ( 0 : 1 0 , 1 ) )

row$d1 <− descriptive [1] row$d2 <− descriptive [2] row$d3 <− descriptive [3] row$c1 <− covariates [1] row$c2 <− covariates [2]

38 REFERENCES

row$c3 <− covariates [3]

i f (descriptive[1] == 0 & descriptive[2] == 1) { response <− multiVarPoi(covariates , useI = 1, betaI = beta3) row$r1 <− response [ 1 ] row$r2 <− response [ 2 ] } else { response <− multiVarPoi(covariates) row$r1 <− response [ 1 ] row$r2 <− response [ 2 ] } simulationBin <− rbind (simulationBin , row) } return (simulationBin)

}

Simulation procedure phi l i s t <− l i s t () desc l i s t <− l i s t () m1 l i s t <− l i s t () m2 l i s t <− l i s t () m3 l i s t <− l i s t () attributes <− c (”bin”, ”bin”, ”bin”, ”tar”, ”tar”, ”tar”, ”tar”, ”tar”)

i t e r a t i o n s <− 1000 number o f items <− 100 beta <− c (0.2, 0.08, 0.007) for (i in 1:iterations) { data <− simulate(number o f items , beta ) print ( i )

output <− EMM( data , attributes )

c o r r e c t <− data [ data$d1==0,] c o r r e c t <− correct[correct $d2==1,]

q u a l i t y <− output$ q u a l i t y [ 1 ] description <− output$description [ 1 ]

output <− output [ , ! (names(output) %in% c (”quality”, ”description”))]

m1 <− nrow( s e t d i f f (correct , output)) /nrow( c o r r e c t ) m2 <− nrow( s e t d i f f (output, correct)) /nrow( output ) m3 <− nrow( intersect (output, correct)) /sqrt (nrow( output )∗nrow( c o r r e c t ) )

m1 l i s t <− rbind (m1 l i s t , m1) m2 l i s t <− rbind (m2 l i s t , m2) m3 l i s t <− rbind (m3 l i s t , m3)

39 REFERENCES

phi l i s t <− rbind ( phi l i s t , q u a l i t y ) desc l i s t <− rbind ( desc l i s t , description )

} hist ( as . double ( phi l i s t ), xlab = ”Phi B i v a r i a t e Poisson ” , ylab = ”Frequency”, main = NULL) boxplot ( as . double (m1 l i s t ), as . double (m2 l i s t ), as . double (m3 l i s t ), names=c (”R1”,”R2”,”R3”), ylab=”Ratio”) barplot ( sort (prop . table ( table ( unlist ( desc l i s t ))), decreasing = TRUE), xlab=”Descriptions”, ylab = ”Ratio”) paste ( mean( as . double ( phi l i s t )), sd ( as . double ( phi l i s t )), mean( as . double (m1 l i s t )), sd ( as . double (m1 l i s t )), mean( as . double (m2 l i s t )), sd ( as . double (m2 l i s t )), mean( as . double (m3 l i s t )), sd ( as . double (m3 l i s t )), sort ( table ( unlist ( desc l i s t )), decreasing = TRUE)[1] /1000 )

Football Analysis Procedure

#Read data football input data <− read . csv (”pl.csv”, sep=”;”) #Filter out Full Time Result and date to prevent trivial results input data <− input data [, ! (names( input data ) %in% c (”FTR”, ”Date”))]

#characterize attributes input attributes <− c (”nom”, ”nom”, ”nom”, ”tar”, ”tar”, ”tar”, ”tar”, ”tar”, ”tar”, ”tar”, ”tar”, ”tar”, ”tar”, ”tar”, ”tar”, ”tar”, ”tar”)

EMM( input data , input attributes )

40