Finding interesting itemsets using a probabilistic model for binary

Tijl De Bie University of Bristol, Department of Engineering Mathematics Queen’s Building, University Walk, Bristol, BS8 1TR, UK [email protected]

ABSTRACT elegance and efficiency of the algorithms to search for fre- A good formalization of interestingness of a pattern should quent patterns. Unfortunately, the frequency of a pattern satisfy two criteria: it should conform well to intuition, and is only loosely related to interestingness. The output of fre- it should be computationally tractable to use. The focus quent pattern mining methods is usually an immense bag has long been on the latter, with the development of frequent of patterns that are not necessarily interesting, and often pattern mining methods. However, it is now recognized that highly redundant with each other. This has hampered the more appropriate measures than frequency are required. uptake of FPM and FIM in data mining practice. In this paper we report results in this direction for item- Recent research has shifted focus to the search for more set mining in binary databases. In particular, we introduce useful formalizations of interestingness that match practical a probabilistic model that can be fitted efficiently to any needs more closely, while still being amenable to efficient binary , and that has a compact and explicit repre- algorithms. As FIM is arguably the simplest special case of sentation. We then show how this model enables the formal- frequent pattern mining, it is not surprising that most of the ization of an intuitive and tractable interestingness measure recent work has focussed on itemset patterns, see e.g. [4, 7, for itemsets, relying on concepts from information theory. 13, 5, 6, 8, 15, 9], and we will follow this in this paper. Our probabilistic model is closely related to the uniform distribution over all databases that can be obtained by means Statistically inspired interestingness measures. of swap randomization [8]. However, in contrast to the swap A sensible approach is to search for ‘large’ itemsets spec- randomization model, our model is explicit, which is key to ified in a more comprehensive way than by means of fre- its use for defining practical interestingness measures. quency alone, but with a close eye on algorithmic scalability. For example, tiles (also known as hyperrectangles [16]) are itemset-tidset pairs covering ones in the database at their Categories and Subject Descriptors intersection, and their surface (product of the itemset’s size H.2.8 [Database management]: Database applications— and frequency) has been suggested as a more effective inter- Data mining; I.5.1 [Pattern recognition]: Models—Sta- estingness measure than the itemset’s frequency alone [7]. tistical Mining large tiles can be done relatively efficiently, which makes it an attractive alternative to frequent itemset min- General Terms ing. Still, the surface of a tile, while better than the fre- quency of an itemset, is often not sufficiently close to the Algorithms interestingness (see also Empirical results in Sec. 4). Other approaches are rooted more directly in statistics Keywords and in particular in hypothesis testing. Such methods at- tempt to compute the statistical significance of each itemset, Interestingness measures, frequent pattern mining, proba- or of a global measure that assesses the interestingness of the bilistic modelling, binary databases, maximum entropy. entirety of all itemsets found (e.g. the number of frequent itemsets) [4, 5, 6, 8, 15]. 1. INTRODUCTION Such approaches are confronted with the challenge of spec- Frequent Pattern Mining (FPM), and Frequent Itemset ifying and computing a probabilistic hypothesis for the Mining (FIM) in particular, has quickly become a prototyp- database, to which the patterns are contrasted. Too naive ical problem in data mining. The reasons are arguably the a null hypothesis will result in the discovery of fairly trivial patterns (which are known to be present or easily spotted), rather than the interesting ones. The null hypothesis from [8] is a particularly attractive one, being the uniform distribution over all databases with Permission to make digital or hard copies of all or part of this work for the same item frequencies and transaction sizes as in the personal or classroom use is granted without fee provided that copies are given database. They argued that item frequencies and not made or distributed for profit or commercial advantage and that copies transaction sizes can often be thought of as prior knowledge, bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific such that this null hypothesis is an adequate representation permission and/or a fee. of the prior state of mind of the data mining practitioner. Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00. Unfortunately, no explicit characterization of this distrib- pected that it simultaneously contains two specified items. ution is known. Nevertheless, random databases can be sam- As another example, if a text miner were told that the words pled from it to a good approximation and with reasonable ‘is’ and ‘the’ occur frequently together in the texts in his cor- efficiency using a Markov chain of swap operations. This pus, he would not receive much additional information—it makes it possible to statistically assess global measures of is not surprising given his prior information about the lan- the whole of the discovered set of patterns by means of re- guage. Similarly, if his corpus consists of texts that widely sampling methods. Still, the lack of an explicit characteri- vary in length, overlaps between long texts should be con- zation makes it useless for the assessment of discovered pat- sidered less interesting than between short texts. terns individually, let alone for directly discovering patterns In the exposition of this work we adopt the same assump- that are significant with respect to this null hypothesis. tion on this type of prior knowledge, following previous au- thors in believing that this assumption is practically relevant Contributions of this paper. indeed. Nevertheless, we hope that our focus on this partic- Here, we overcome some of the limitations of the above- ular type of prior knowledge will not hide the generality of mentioned approaches by complementing them with new our methodology, and its potential to be applied for different ideas from statistical modelling and convex optimization types of prior knowledge. theory. In particular, we show how it is possible to fit a As the dependency structure is what one typically would probabilistic model to a database that respects the item fre- like to discover, our model should not encode any depen- quencies and transaction sizes, much like in [8] using swap dencies between items or transactions that can not be ac- randomization, but here in a direct and explicit manner. counted for by their individual frequencies. To this end, We envision that this new probabilistic model will prove we choose a data model in which all database entries are useful in a variety of ways. Of particular interest in this pa- independent, while according to the prior knowledge about per, we show how it allows us to define new interestingness the item and transaction marginals. Still, this additional measures for tiles (or equivalently, itemsets). The proposed assumption does not unambiguously determine the model. interestingness measure is the ratio of the information con- Not to introduce any additional bias, we therefore choose tent of a tile with respect to the probability distribution the model with largest entropy satisfying these constraints. fitted to the database, and the description length of the itemset-tidset pair describing the tile (see Sec. 3.1). 2.2 The mathematical formalization Moreover, like the swap randomization model, our model In this paper we will represent the database D by the n×m can be used to assess the statistical significance of data min- binary matrix D ∈ {0, 1} , with rows indexed by trans- ing results. However, our model has a significant advantage actions t ∈ T and columns by items i ∈ I. The element on over swap randomization: it does not rely on computation t and i is dt,i = 1 iff item i was part of trans- intensive Markov chain sampling techniques, of which the action t. We will now mathematically formalize the above convergence to the desired distribution is hard to assess. description of the . We conclude this paper with some empirical results that illustrate how the model can be efficiently fitted to a bi- The independence assumption. nary database; how the newly defined interestingness mea- As we do not want to encode any spurious prior infor- sure compares to two important existing measures; and how mation, we will assume that all entries dt,i are independent the model can be used for assessing the significance of data of each other. This means that the probabilistic database mining results. model P we are looking for is the product distribution of a set of Bernoulli distributions, one for each entry dt,i in 2. THE MAXIMUM ENTROPY MODEL D. Each such Bernoulli distribution is parameterised com- pletely by the probability p that d = 1, so that we need In this Section, we introduce the principles of our proba- t,i t,i to identify n × m numbers p . bilistic database model. We endeavour to create the model t,i such that it most adequately reflects the prior knowledge of the data miner about the data and the patterns it contains. Constraining the item and transaction marginals. The empirical marginal for any item i can be computed 2.1 The rationale of the model from D as: It seems discussable whether one such model may have 1 X p = d . (1) a sufficiently general applicability. Nevertheless, e.g. [8, i n t,i t 5, 6] have provided a convincing argumentation that the individual item frequencies, which we will refer to as the Similarly, the empirical marginal for any transaction t is empirical item marginals, may be thought of as known by equal to: the practitioner. Similarly, in many cases it is warranted 1 X p = d . (2) to regard the transaction sizes as prior knowledge. These t m t,i transaction sizes, normalized by the total number of items, i are also known as the empirical transaction marginals. In- In our model, we will constrain the marginal probabilities deed, items and transactions are atomic entities that can for items and transactions to be equal to these empirical usually be readily understood—this as opposed to correla- estimates derived from the database, or formally: tions between items or transactions that can be much harder 1 X p = p , to understand, and certainly harder to predict. n t,i i For example, items like the plastic carrier bag or the t 1 X daily newspaper typically overwhelm market basket analy- p = p . m t,i t ses. Similarly, given that a basket is large, it is not unex- i The MaxEnt objective. all databases with these marginals, so that approximate uni- The number of constraints is only n+m, while the number form sampling from these databases is made possible. How- of free variables in the database model P is equal to n × m: ever, the mixing time of the Markov chain is hard to assess, all entry probabilities pt,i. Hence, P is typically not fully such that it is unknown how many random swaps need to identified by the constraints on the marginals. In such cases, be done in order to arrive at the uniform distribution. in order to avoid biasing the distribution, one searches for Evidently, our MaxEnt model is different from the uni- the distribution of maximal entropy among those that sat- form distribution over all databases with given marginals. isfy the constraints [10]. We will refer to this distribution In the MaxEnt model, the empirical marginals of sampled as the MaxEnt distribution. Despite the wide acceptance of databases are only equal to the given empirical marginals in the choice for MaxEnt in the absence of sufficient informa- expectation, rather than exactly. In fact, this may be a de- tion, we appreciate that this choice may seem arbitrary at sirable feature, as restricting the empirical marginals to be the moment. However, we will provide an additional justifi- exactly equal may be an overly strong prior, and indeed the cation in Sec. 2.3. authors of [8] suggest that further work could look into ap- proaches that allow relaxing this constraint. A second and The complete optimization problem. related difference is that, while the database entries are all Stated mathematically, the MaxEnt database model P is independent in our model, this is not the case in the uniform thus found by solving the following optimization problem: distribution over databases with fixed marginals. P Still, there is a very strong connection between both prob- max − t,i [pt,i log pt,i + (1 − pt,i) log(1 − pt,i)] (3) abilistic models, as exemplified by the following Theorem. pt,i P Theorem 2.1. Databases with the same specific set of s.t. pt,i = npi, (4) Pt empirical transaction and item marginals are equally proba- i pt,i = mpt. (5) ble under the MaxEnt model. Furthermore, given constraints of the form Eq. (4,5), the MaxEnt model is the only proba- bilistic model with independent p for which this holds. The final shape of the model. t,i In particular, this means that all databases with empirical In Appendix A, we show that the optimal values of pt,i for this optimization problem can be stated compactly as: marginals equal to those of the given database are equally probable under our MaxEnt model, just as in the uniform exp(λi + µt) limit distribution obtained by swap randomizations. The pt,i = , 1 + exp(λi + µt) proof is given in Appendix B. where λi and µt are dual variables corresponding to the 3. FINDING INTERESTING ITEMSETS primal constraints, which can be computed efficiently us- ing standard convex optimization techniques (e.g. pseudo- A crucial difference between the swap randomization ap- Newton, see 4 for an algorithm). Additionally, there proach and our database modelling approach is that the lat- ter allows one to create the model itself, rather than just is a lot of redundancy among these variables: many λi are to sample from it. We foresee that this has the potential equal to each other, and similarly for µt. More specifi- of becoming the foundation on which many measures of in- cally, the number√ of different λi and µt is upper bounded by min{m, n, 2N}, where N is the number of nonzero entries terestingness can be based. This could be so for itemsets, in D. Hence, the probabilistic model can be characterized but also for association rules, and possibly other patterns using a number of variables sublinear in the size of D. in transactional databases. However, in this paper we will confine our attention to the definition of itemsets in Sec. 3.1. 2.3 Connections to swap randomization In Sec. 3.2, as a second (and more direct) application of our model, we will discuss its use for the statistical assess- Previous research [8] has looked into the use of Monte ment of data mining results. Carlo randomizations to generate databases with the same empirical transaction and item marginals. Comparing pat- 3.1 Information ratio: a new interestingness terns found in a given database with thus randomized ver- measure sions of that database allows one to assess the significance of the patterns, by estimating empirical p-values for exam- ple. These randomizations are based on swaps, which are The information content of a tile. A first attempt to design a measure of interestingness that elementary operations on a database that transform it into springs to mind is the probability of a tile under our prob- another database leaving the marginals unchanged. abilistic model. We deliberately use the terminology ‘tile’, To explain what is meant by a swap operation on a binary an itemset together with its supporting transaction set, as database D, consider a 2 × 2 submatrix for transactions t 1 it conveys the fact that items and transactions are treated and t and items i and i , where d = d = 1 and 2 1 2 t1,i1 t2,i2 on the same footing for the MaxEnt model. d = d = 0. A swap is defined as the operation that t1,i2 t2,i1 For a tile τ = (T,I) with tidset T and itemset I, this prob- changes these values into d = d = 0 and d = t1,i1 t2,i2 t1,i2 ability P (τ) can be calculated as the product of all p for d = 1. t,i t2,i1 t ∈ T and i ∈ I. More conveniently, we can define the inter- By carrying out swaps for randomly selected 2 × 2 sub- estingness measure as minus the log-probability. Note that matrices, a Markov chain of random databases is generated, this log-probability is also equal to the Shannon information each of which has the same empirical transaction and item content of a tile. It is given by: marginals. In the limit, if the swaps are randomly sampled X according to appropriate distributions, this Markov chain InformationContent(τ) = − log(pt,i). has been shown to converge to the uniform distribution over t∈T,i∈I Written in this way, one can see that the information content The interestingness of a set of tiles. is a refinement of the surface of a tile: it would be equivalent One is rarely only interested in the single most interesting to it if all pt,i were equal to each other, which is the case pattern from a database. Therefore, it has been suggested to when all transaction marginals are equal to each other as output maximally interesting sets of patterns. This is non- well as all item marginals. trivial: since patterns may convey overlapping information, While certainly relevant, the information content may not the interestingness of a set of tiles cannot simply be reduced be appropriate directly as a measure of informativeness. For to the sum of the interestingness of each tile it contains. example, tiles with a singleton itemset (or a singleton tidset) The information content of a tile can, however, easily could be considered very interesting by this measure, even be generalized to the information content of a set of tiles, though they do not reveal any associations between items namely as the negative log probability of all database entries (or transactions). covered by the set of tiles. For the same reasons mentioned Rather than considering its information content, we should above, we should trade off this information content with the contrast a tile’s information content with the length of the total description length of all tiles in the set, e.g. by consid- description (i.e., information) required to encode the tile. ering the ratio of the information content and the description This is exactly what our proposed measure of interesting- length of the set of tiles (as we did for a single tile). ness does: the Information Ratio. An essentially equivalent way to trade off these quantities is to search for the set of tiles with the largest information Information ratio: a new interestingness measure. content, with an upper bound on the description length of A tile will contain ‘net’ information when it helps under- the set of tiles encoded as itemset-tidset pairs. This prob- standing the database better, without itself being too com- lem is an instantiation of the budgeted maximum coverage plex to describe. Hence, we need to take into account two problem [11], a hard combinatorial optimization problem. aspects: the length by which the description of the database Still, it is well known that a near-optimal solution can be can be reduced after the tile is known to be present, and the found using a greedy algorithm. Applied to the context of description length of the tile itself. this paper, this near-optimal greedy algorithm selects tiles The first of these aspects is exactly InformationContent(τ): in decreasing order of what could be called an added inter- the database entries covered by the tile need not be described estingness, an adaptation of InformationRatio(τ): anymore in order to describe the database, so that this sum InformationContent+(τ) of negative log-probabilities is what is gained in terms of the InformationRatio+(τ) = .(6) description length of the database. DescriptionLength(τ) The description length of a tile is harder to quantify, as Here, the numerator is defined as the additional information we need to decide on an encoding scheme for tiles to do conveyed by the specification of the tile τ: this. One sensible way of doing it would be to assume a X + probabilistic model for tiles in which transactions occur in- InformationContent (τ) = log(pt,i), dependently in T and items occur independently in I, with t ∈ T, i ∈ I, probabilities equal to their marginal probabilities p and p . t i (t, i) uncovered The description length for a tile using a code based on this model is then given by where the sum is only over the transaction-item pairs in the DescriptionLength(τ) tile that have not yet been covered by already selected tiles. X X The algorithm can be applied to any set of possibly in- = − log(pt) − log(pi) teresting tiles, e.g. the set of all tiles corresponding to all t∈T i∈I sufficiently frequent closed itemsets, as found using a stan- X X dard FIM algorithm. The greedy algorithm then efficiently − log(1 − pt) − log(1 − pi). sorts them in decreasing order of InformationRatio+. t∈/T i∈/I It can be proven for the greedy approximation that any Putting this together, we suggest the information ratio as first k tiles in this list have a total information content that is an interestingness measure, defined as: 1 at least 1− e times the maximum that is achievable given the InformationContent(τ) total description length of these tiles. Hence, each set of top- InformationRatio(τ) = . DescriptionLength(τ) k tiles would provide a good compression of the database, and therefore, we argue, a good intelligible summary. Other choices for the tile encoding will obviously yield Similar greedy approximations to set covering type prob- different results. However, typically the tile description will lems have been used earlier, for example for the interesting- remain linear in size to the number of items and transactions ness measure defined as the surface of a tile [7, 12, 16], and in I and T , whereas the description length of a tile entry by for interestingness measures defined using p-values [5, 6]. entry is quadratic. This means that as soon as the number of items and the number of transactions are above a certain 3.2 Assessing data mining results level, the tile will be deemed more interesting—just a large Another direct application of our model is for the statisti- tidset with a singleton itemset (or vice versa) will not suffice cal assessment of data mining results by means of p-values. to receive a high interestingness score. The probabilistic model can be used to sample a large num- Remember that InformationContent(τ) can be seen as ber of databases, and we can run the data mining algorithm a refinement of the surface |T | · |I| of a tile. Similarly, |T |·|I| on these randomized versions. Then, the value of a global InformationRatio(τ) can be seen as a refinement of a|T |+b|I|+1 measure of choice (such as the number of frequent closed with a, b positive constants. They would be equivalent if all itemsets, the size of the largest frequent itemset, etc) on transaction and item marginals were equal to each other. the database can be compared with the value of the same measure as obtained on the random databases. This can be used to compute an empirical p-value, which is smaller as Table 1: Some statistics for the datasets investi- the pattern is more significant in the given database. gated. The authors of [8] generated random databases by means # items # tids support # closeds of a chain of swap randomizations. Our explicit database KDD 6154 843 5 2,787,847 model allows to improve on this in two respects: Mushroom 120 8,124 812 6,298 Pubmed 12,661 1,683 10 1,249,913 • Little is known about the convergence of the swap ran- Retail 16,470 88,162 8 191,088 domization process. It is therefore unclear how many random swaps need to be done in order to arrive at 20 a sufficiently randomized database, as if it were ran- 10 domly sampled from the uniform distribution over all 10 databases with the same marginals. In contrast, the 10 MaxEnt distribution is characterized explicitly, and 0 can be directly sampled from. 10

• A MaxEnt-based sampling method is extremely sim- Gradient squared norm −10 10 ple to implement using a double for loop over the items 0 5 10 15 20 and transactions. The overall complexity of such im- plementation is O(mn), and does not depend on the 1.8 KDD mixing time of the chain of random swaps. 1.6 Mushroom Pubmed The main issue is probably the first one, although the au- 1.4 Retail thors of [8] argue that it is less relevant in practice: it seems 1.2 to suffice to do roughly as many swaps as the number of non- 1 zero entries in the database, or a small multiple thereof. As Normalized Lagrangian 0 5 10 15 20 a swap (for their Self loop algorithm) can be done in con- Iteration stant time, this means that their method could be faster than O(mn) for very sparse databases, assuming sparsity does not negatively affect the number of swaps required for Figure 1: Top: the squared norm of the dual gradi- mixing. Still, even on sparse datasets, the computational ent on a logarithmic scale as a function the iteration performance of our method appears comparable. For exam- number, plotted for the four datasets investigated. ple, for the Retail dataset [3] which has a density of 0.06%, This plot shows the exponential decrease of the gra- [8] reports a computation time of 1 minute and 1 second on dient. In the second plot, the convergence of the La- a 3GHz Pentium. Our method requires 1 minute and 35 sec- grange dual Eq. (11) is shown for the same datasets. onds on a 2GHz Pentium Centrino. Furthermore, we wish to point out that our simple sampling algorithm is subject to significant improvements by relying on results in [14], but 4.1 Computing the probabilistic model we postpone a more in depth discussion to further work. Fig. 1 shows the squared norm of the dual gradient (see Eqs. (12,13) in Appendix A for details) as it converges to 4. EMPIRICAL RESULTS zero during the first 20 iterations. Noting the logarithmic vertical axis, the convergence appears clearly exponential. We report empirical results on four datasets. Two tex- The lower plot in Fig. 1 shows the convergence of the dual tual datasets, and two other datasets commonly used for objective (see Eq. (11) in Appendix A) to its minimum, evaluation purposes: clearly a very fast convergence in just a few iterations. KDD All KDD paper abstracts since 2001 (from all ses- We wish to note that the dual gradient contains as its el- sions) downloaded from the ACM website. Each ab- ements the difference between the empirical marginal prob- stract is represented by a transaction and words are abilities pt and pi on the one hand, and their counterparts items, after stop word removal and stemming. under the estimated probabilistic model (see Eqs. (12,13)). Hence, the dual gradient norm quantifies the total violation Mushroom A publicly available item-transaction dataset of the constraints. This means that the norm of the gradi- [1]. ent, normalized by the number of items and transactions, Pubmed All Pubmed abstracts retrieved by querying with provides for a suitable stopping criterion. In all our experi- the search query “data mining” (first used in [6]), after ments we stopped the iterations as soon as it reached 10−12. stop word removal and stemming. The number of iterations and the computation time to fit the model to the data are summarized in Table 2. Retail A dataset about transactions in a Belgian super- market store, where transactions and items have been anonymized [3]. Table 2: Number of iterations and computation time Some statistics are gathered in Table 1. The Table also men- in seconds required to fit the probabilistic model. tions support thresholds used for some of the experiments # iterations time (s) reported below, and the numbers of closed itemsets satisfy- KDD 13 0.54 ing these support thresholds. Mushroom 37 0.02 All experiments were done on a 2GHz Pentium Centrino Pubmed 15 1.24 with 1GB Memory, and implementated in C++. Retail 18 2.80 4.2 Information ratio as interestingness 5. CONCLUSIONS We implemented the greedy algorithm described in Sec. 3.1, We have introduced a probabilistic model for transactional which sorts a set of tiles in order of decreasing added inter- databases, and developed a highly scalable algorithm that estingness, as formalized by InformationRatio+ in Eq. (6). fits it to a given database. Such a model can be used to We then applied it to the list of closed tiles as found by sample random databases much like earlier approaches, in CHARM [17]. The support thresholds we used in running order to assess the significance of a data mining result. CHARM, and resulting number of tiles, are listed in Table 1. More importantly, the explicit availability of a probabilis- In Fig. 2, InformationRatio+(τ) is plotted for the 30 top tic model for the entire database, respecting sensible prior ranking tiles, as a function of their rank in the list. The knowledge (the item and transaction marginals), opens up dashed lines show the same result for a randomly generated new possibilities to develop measures of interestingness for dataset, sampled from the fitted probabilistic model. Note patterns on such databases. Specifically, we suggested a that in the random databases, the information ratio is typ- new interestingness measure for itemsets (or tiles) in binary ically smaller than 1, even for the most highly ranked tiles. databases, which we referred to as the information ratio, and The textual datasets were included to allow for a subjec- evaluated it on real-life data. tive assessment of the interestingness measures. We believe We hope that this work may open two new research direc- that such a subjective assessment is crucial in an evaluation. tions. The first of these is the search for relevant interest- Even though it may hard to make any strong statements, it ingness measures that are based on our database model, for does provide one with an insight into the practical use and itemsets but also for other patterns on transactional data- behaviour of the interestingness measure. bases, including association rules and ‘noisy’ tiles (tiles that The top-10 itemsets returned for the textual datasets are cover also zeroes in the database). shown in Table 3. For comparison, we also list the outputs As a second research direction, we hope that the results in given by the method based on the surface of tiles as proposed this paper may lead to the specification of similar probabilis- in [7], and by a method called KRIMP proposed in [13]. We tic models for more complexly structured data than trans- would argue that InformationRatio+ achieves the most sen- actional databases, such as sequential data, strings, graphs, sible results in terms of non-redundancy and interestingness and more. It is quite likely that the core ideas presented of the highly ranked itemsets, many of them coinciding with in this paper will transfer relatively easily. Indeed, a closer major topics and concepts in data mining (KDD) and data examination of the optimization problem in this paper re- mining applied to biological problems (Pubmed). In con- veals that the simple shape of the MaxEnt distribution, and trast, the tile-based method seems to favour tiles with few the computational tractability, are essentially due to the lin- but individually frequent items. The KRIMP algorithm suf- earity of the constraints on the probabilities specifying the fers from another problem when used for this purpose: the distribution. We believe that linear constraints on proba- redundancy among the top-ranked itemsets is high (e.g., for bilities (such as the constraints on the item and transaction the KDD abstracts almost all top-ranked itemsets contain marginals in this paper) are sufficiently rich to quantify a ‘data’ or ‘paper’), and their interestingness seems worse than broad variety of prior information. for the itemsets that are ranked highly by our method. All data used in this paper not previously made available, and all code used in the experiments, will be made publicly 4.3 Assessing data mining results available on the author’s website. Lastly, we wish to demonstrate the use of our probabilistic model for generating random databases with expected item 6. ACKNOWLEDGEMENTS and transaction marginals equal to the empirical ones. The The author wishes to thank Bart Goethals for very inter- availability of such random databases can then be used for esting discussions relating to this work, and Matthijs Van the statistical assessment of data mining results. Leeuwen for kindly making the code of KRIMP available. Fig. 3 plots the number of closed itemsets retrieved on each of these four databases considered in this paper. Ad- 7. REFERENCES ditionally, it shows box plots for the results obtained on databases randomly sampled from our probabilistic model [1] C. Blake and C. Merz. Uci repository of machine fitted to the respective databases. If desired, these results learning databases. In allow one to extract one global measure from these results, http://www.ics.uci.edu/mlearn/MLRepository.html. as in [8], and to compute an empirical p-value by comparing 1998. that measure obtained on the actual data with the result on [2] S. Boyd and L. Vandeberghe. Convex Optimization. the randomized versions. However, the plots given here do Cambridge University Press, 2004. not force one to make such a choice, and still they give a [3] T. Brijs, G. Swinnen, K. Vanhoof, and G. Wets. Using good idea of the patterns contained in the databases. association rules for product assortment decisions: a As a side-note, we remark that the number of closed item- case study. In Proceedings of the fifth ACM SIGKDD sets in the Retail database is smaller than what is typically international conference, 1999. obtained in a randomly generated database. However, this [4] S. Brin, R. Motwani, and C. Silverstein. Beyond seems to be attributable by the fact that large tiles, only market baskets: Generalizing association rules to present in the actual database, are shattered in the random- correlations. In SIGMOD 1997, Proceedings ACM ized versions, thus leading to a larger number of smaller SIGMOD International Conference on Management of ones. This effect is even stronger for the Mushroom data- Data, 1997. base. Hence, the total number of closed itemsets above a [5] A. Gallo, T. De Bie, and N. Cristianini. Mini: Mining fixed threshold would not be a good quantification of the informative non-redundant itemsets. In Proceedings of amount of structure in the database. 2007 KDD, 2007. KDD Mushroom Pubmed Retail 4 2 3 1 3 1.5 2 2 1

score 0.5 1 0.5 1

0 0 0 0 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 rank rank rank rank

Figure 2: These plots show the scores of the ranked closed itemsets for the four datasets investigated (full line). In dotted line, the same is plotted for a randomized version of that dataset.

Table 3: The top-10 ranked itemsets in the textual datasets based on the Information Ratio interestingness measure (top), based on the surface of tiles (middle), and as outputted by KRIMP (bottom). Words were stemmed, and all itemsets are sorted alphabetically. KDD Pubmed machin support svm vector algorithm data database drug gamma item mgps mine multi poisson report...... safeti shrinker system art state chain data express gene polymerase reaction revers data labeled semi supervised unlabeled advers data detect drug mine reactions report signal spontan algorithm effici frequent mine pattern chromatography data lc liquid mass mine ms algorithm data paper propos real synthetic data est express gene mine sequenc tag graph larg network social acid amino data mine protein sequenc associ mine rule avail availability data mine motivation result [url] express gene nucleotide polymorphisms singl snps studi classifi featur machin support text vector data express gene pcr rt algorithm faster magnitud order artifici data mine network neural data paper data mine algorithm propos data method result data mine analysi data base method express gene result show base problem studi data set data inform approach develop model system present approach algorithm base data improv perform propos provid set algorithm data drug event gamma item mine multi poisson report safeti shrinker...... system analysi base data tool user visual visualization advers algorithm data drug gamma item mine multi poisson report shrinker system algorithm data mine paper propos set advers data detect drug method mine report signal spontan applic data paper problem propos set conclus data determin import method mine result studi data method problem real set synthetic advers data event method mine report safeti signal algorithm applic effici paper set studi age analysi conclus data mine result studi algorithm base frequent larg paper set analysi data decis method mine result studi tree data demonstr mine paper pattern structur analysi combin conclus data includ method mine result algorithm deriv effici method problem propos analysi conclus data examin method mine rate result algorithm construct data paper problem relat base clinical data decis inform mine support system

KDD Mushroom Pubmed Retail

5 10 4 10

2 10 2 10 Number of itemsets 0 0 0 10 10 10 5 10 5 10 15 5 10 5 10 Itemset size Itemset size Itemset size Itemset size

Figure 3: For the four datasets under investigation, these plots show the number of closed itemsets on a logarithmic scale, as a function of their size. Additionally, box plots are shown for the number of closed itemsets as a function of size found on 100 randomized datasets, based on our probabilistic model. [6] A. Gallo, A. Mammone, T. De Bie, M. Turchi, and itly, by equating the gradient of the Lagrangian with respect N. Cristianini. From frequent itemsets to informative to the primal variables pt,i to 0: patterns. Submitted, 2009. ∂L [7] F. Geerts, B. Goethals, and T. Mielik¨ainen. Tiling = 0 ⇔ log pt,i − log(1 − pt,i) − λi − µt = 0, ∂pt,i databases. In Discovery Science, 2004. [8] A. Gionis, H. Mannila, T. Mielikainen, and pt,i ¨ ⇔ log = λi + µt, (9) P. Tsaparas. Assessing data mining results via swap 1 − pt,i

randomization. TKDD, 1(3), 2007. exp (λi + µt) ⇔ pt,i = (10) [9] J. Han, H. Cheng, D. Xin, and X. Yan. Frequent 1 + exp (λi + µt) pattern mining: current status and future directions. Data Mining in Knowledge Discovery, 15(1):55–86, Note that 1 ≥ pt,i ≥ 0, as required for probabilities. Hence 2007. we did not have to specify these constraints explicitly. [10] E. Jaynes. On the rationale of maximum-entropy If we substitute this in the Lagrangian (Eq. (8)) we obtain methods. Proceedings of the IEEE, 70, 1982. the Lagrange dual of optimization problem (3-5): X [11] S. Khuller, A. Moss, and J. Naor. The budgeted minλi,µt log(1 + exp (λi + µt)) maximum coverage problem. Information Processing t,i Letters, 70, 1999. X X [12] T. Mielik¨ainen. Summarization Techniques for Pattern −n λipi − m µtpt. (11) Collections in Data Mining. PhD thesis, Department i t of Computer Science, University of Helsinki, 2005. This is an unconstrained convex optimisation problem that [13] A. Siebes, J. Vreeken, and M. van Leeuwen. Item sets can be solved efficiently by means of standard convex opti- that compress. In SIAM Conference on Data Mining, mization techniques (see e.g. [2]). 2006. [14] Y. Till´e. Sampling algorithms. Springer, 2006. Efficient solution of the optimisation problem. [15] G. Webb. Discovering significant patterns. Machine In order to use widely established first or even second Learning, 68(1), 2007. order methods, we need to be able to compute the gradient [16] Y. Xiang, R. Jin, D. Fuhry, and F. Dragan. Succinct and second order information efficiently. summarization of transactional databases: an The gradient of this objective function is readily computed overlapped hyperrectangle scheme. In Proceedings of by: the 14th ACM SIGKDD international conference, X ∂L exp (λi + µt) 2008. = − npi (12) ∂λi 1 + exp (λi + µt) [17] M. Zaki and C. Hsiao. CHARM: An efficient t X algorithm for closed itemsets mining. In Proc. of the ∂L exp (λi + µt) = − mpt (13) 2nd SIAM ICDM, 2002. ∂µt 1 + exp (λi + µt) i APPENDIX The Hessian is computed by: ∂2L A. EFFICIENTLY FITTING THE MODEL = 0, ∂λiλi0 2 X ∂ L exp (λi + µt) The shape of the MaxEnt solution. 2 = 2 , ∂λ (1 + exp (λi + µt)) Optimization problem (3-5) cannot be solved analytically. i t However, the constraints are linear and the objective is strictly ∂2L concave, so that a unique maximum must exist. = 0, ∂µtµt0 Duality theory [2] allows us to achieve better insight in ∂2L X exp (λ + µ ) the shape of the solution. Let us introduce m dual variables i t 2 = 2 , ∂µt (1 + exp (λi + µt)) λi for constraints (4) and n dual variables µt for constraints i (5). Then problem (3) can be optimized by solving: 2 ∂ L exp (λi + µt) = 2 . ∂λiµt (1 + exp (λi + µt)) max minλi,µt L(pt,i, λi, µt), (7) pt,i The direct computation of both the gradient and the Hessian where L(pt,i, λi, µt) is the Lagrangian: would be highly demanding for large n + m. However, typ- ically there is a lot of redundancy in the gradient and the L(pt,i, λi, µt) (8) X h Hessian. Observe that if pi = pi0 , the constraints remain = − pt,i log pt,i + (1 − pt,i) log(1 − pt,i) satisfied and value of the primal objective remains unchanged t,i by interchanging pt,i and pt,i0 for all t. Since the optimum i X X is unique, this means that pt,i = pt,i0 . Therefore, we can −λipt,i − µtpt,i − n λipi − m µtpt. reduce the number of free parameters and constraints at the i t outset taking this into account. A similar property holds Thanks to convexity, strong duality holds, meaning that we when pt = pt0 . can change the order of the maximization and the minimiza- The result is that the total number of distinct primal pa- tion operators in Eq. (7) without altering the result. After rameters is only equal to the number of distinct values of doing this, we can carry out the inner maximization explic- the pi times the number of distinct values of the pt, and B. A LEMMA AND A PROOF OF THEO- Table 4: Algorithm for computing the model. In: The item marginals pi and pt. REM 2.1 1: Find all distinct values of pi and pt, Lemma B.1. Under the MaxEnt model, it holds that for and denote these by p˜i and p˜t. pair of transactions t and t and pair of items i and i , Compute the sets of transactions T˜t ⊆ T 1 2 1 2 the constellation such that pt = p˜t for all t ∈ T˜t, and similarly the sets I˜ ∈ I. i dt ,i = dt ,i = 1 and dt ,i = dt ,i = 0 Compute the multiplicities 1 1 2 2 1 2 2 1 m˜i = |I˜i| and n˜t = |T˜i|. is equally probable as 2: Pick initial values for λ and µ (e.g. = 0). ˜t t d = d = 0 and d = d = 1. 3: Compute the gradient: t1,i1 t2,i2 t1,i2 t2,i1 P exp (λ +µ ) ∂L ˜i ˜t In other words, ∂λ = ˜t n˜t 1+exp (λ +µ ) − np˜i ˜i P ˜i ˜t ∂L exp (λ˜i+µ˜t) = ˜ m˜i − mp˜t pt1,i1 pt2,i2 (1 − pt1,i2 )(1 − pt2,i1 ) ∂µ˜t i 1+exp (λ˜i+µ˜t) Compute the diagonal of the Hessian: = pt1,i2 pt2,i1 (1 − pt1,i1 )(1 − pt2,i2 ). (14) 2 P ∂ L exp (λ˜i+µ˜t) 2 = n˜t 2 ∂ λ˜ ˜t i (1+exp (λ˜i+µ˜t)) Proof. Let us rewrite condition (14) as: 2 P ∂ L exp (λ˜i+µ˜t) 2 = ˜ m˜i 2 ∂ µ˜t i 1+exp (λ +µ ) pt ,i pt ,i ( ˜i ˜t ) log 1 1 + log 2 2 5: Update: 1 − pt1,i1 1 − pt2,i2 2 λ ← λ − f · ∂L / ∂ L ˜i ˜i ∂λ ∂2λ pt1,i2 pt1,i2 ˜i ˜i = log + log . (15) ∂L ∂2L 1 − p 1 − p µ˜t ← µ˜t − f · / 2 t1,i2 t1,i2 ∂µ˜t ∂ µ˜t where f is a step length found by a line Clearly, a parametric solution to this set of equations is given search so as to decrease the Lagrangian. by Eq. (9). Hence, the maximum entropy solution is a solu- 6: If the gradient’s norm is small enough, go to 7. tion for which property Eq. (15) needs to be satisfied. Otherwise, go to 3. 7: For all t ∈ T˜t, equate µt = µ˜t. Said in other words, this means that a swap operation For all i ∈ I˜i, equate λi = λ˜i. applied to a database does not alter its probability under Out: All λ˜i and µ˜t, directly useable to compute the MaxEnt model. This Lemma allows us to prove Theo- exp (λ˜i+µ˜t) rem 2.1. pt,i = 1+exp (λ˜i+µ˜t) where p = p and p = p . ˜t t ˜i i Proof of Theorem 2.1. Proof. Let us first prove the first part of the Theorem: under the MaxEnt model, all databases with the same em- pirical transaction and item marginals are equally probable. the number of distinct dual variables is equal to the sum To this end, note that all these databases can be obtained of those two numbers. Typically, these numbers are much by using a chain of swap operations starting from one such database that has the required marginals. Since Lemma B.1 smaller than n and m, and it is not√ hard to show that they are both smaller than min{n, m, 2N} where N is the to- states that swap operations do not alter the probability of tal number of nonzero entries in D. Hence, their number is the database, all databases with the same marginals must small for sparse databases in particular, when itemset min- have the same probability. ing often proves most useful. This simplification makes the The second part of the Theorem can be proven as follows. optimization problem easily amenable to real-life problems. Observe that the number of linearly independent equations In principle, any standard first or second order technique of the form of Eq. (15) is equal to (n−1)(m−1) = nm−(n+ for solving convex optimization problems could now be used. m) + 1. For example, choose the values of t1, i1, t2, and We found that the pseudo-Newton method did very well in i2 in Eq. (15) such that t1 = i1 = 1 and t2 ∈ {2, 3, . . . , n} particular, requiring only basic vector operations in each it- and i2 ∈ {2, 3, . . . , m}. These equations being linear and eration, and converging to near machine accuracy in a few homogeneous in the nm log-odds ratios, so the solution space tens of iterations for all problems we considered (see Ta- is an n + m − 1 dimensional linear manifold. Now, also ble 2). Eq. (9) defines an n + m − 1 dimensional linear manifold, The complete algorithm is given in Table 4, taking into as parameterized by values of λi and µt. (Indeed, while the account the technicalities that exploit the redundancy in the number of degrees of freedom seems to be n + m, in fact it gradient and the Hessian. The most expensive step is the is only n + m − 1, as summing a fixed value to all λi and computation of the gradient and the diagonal of the Hessian subtracting the same value from all µttid will not alter the and the line search, each taking a time proportional to the values of the log-odds ratios). From the first part of this product of the number of distinct empirical item marginals Theorem, it follows that any solution satisfying Eq. (9) also and the number of distinct empirical transaction marginals. satisfies Eq. (15). Additionally, we have just shown that The output of the algorithm is the optimal value for all the two solution spaces are equally large. Thus, the sets of dual variables µt and λi. This is a surprisingly compact conditions Eq. (9) and Eq. (15) are equivalent. representation for the database model, linear in the size of the database dimensions. It allows computing the individual entry probabilities in constant time using Eq. (10).