<<

Universit¨atdes Saarlandes Max-Planck-Institut f¨urInformatik

Redescription Mining Over non-Binary Data Sets Using Decision Trees

Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by Tetiana Zinchenko

angefertigt unter der Leitung von / supervised by Dr. Pauli Miettinen

begutachtet von / reviewers Dr. Pauli Miettinen Prof. Dr. Gerhard Weikum

Saarbr¨ucken, November 2014

Eidesstattliche Erkl¨arung

Ich erkl¨arehiermit an Eides Statt, dass ich die vorliegende Arbeit selbstst¨andigverfasst und keine anderen als die angegebenen Quellen und Hilfsmittel verwendet habe.

Statement in Lieu of an Oath

I hereby confirm that I have written this thesis on my own and that I have not used any other media or materials than the ones referred to in this thesis.

Einverst¨andniserkl¨arung

Ich bin damit einverstanden, dass meine (bestandene) Arbeit in beiden Versionen in die Bibliothek der Informatik aufgenommen und damit ver¨offentlicht wird.

Declaration of Consent

I agree to make both versions of my thesis (with a passing grade) accessible to the public by having them added to the library of the Computer Science Department.

Saarbr¨ucken, November 2014 Tetiana Zinchenko

Acknowledgements

First of all, I would like to thank Dr. Pauli Mittienen for the opportunity to write my Master thesis under his supervision and for his support and encouragement during the work on this thesis.

I would like to thank the International Max Planck Research School for Computer Science for giving me the opportunity to study at Saarland University and their constant support during all the time of my studies.

And special thanks I want to address to my husband for being the most supportive and inspiring person in my life. He was the first trigger for me to start and finish this degree.

v

Abstract

Scientific data mining is aimed to extract useful information from huge data sets with the help of computational efforts. Recently, scientists encounter an overload of data which describe domain entities from different sides. Many of them provide alterna- tive means to organize information. And every alternative data set offers a different perspective onto the studied problem.

Redescription mining is tool with a goal of finding various descriptions of the same objects, i.e. giving information on entity from different perspectives. It is a tool for knowledge discovery which helps uniformly reason across data of diverse origin and integrates numerous forms of characterizing data sets.

Redescription mining has important applications. Mainly, redescriptions are useful in biology (e.g. to find bio niches for ), bioinformatics (e.g. dependencies in genes can assist in analysis of diseases) and sociology (e.g. exploration of statistical and political data), etc.

We initiate redescription mining with data set consisting of 2 arrays with Boolean and/or real-valued attributes. In redescription mining we are looking for such queries which would describe nearly the same objects from both given arrays.

Among all redescription mining algorithms there exist approaches which exploits al- ternating decision tree induction. Only Boolean variables were involved there so far. In this thesis we extend these approaches to non-Boolean data and adopt two methods which allow redescription mining over non-binary data sets.

Contents

Acknowledgementsv

Abstract vii

Contents ix

1 Introduction1 1.1 Outline of Document...... 3

2 Preliminaries5 2.1 The Setting for Redescription Mining...... 5 2.2 Query Languages...... 8 2.3 Propositional Queries, Predicates and Statements...... 8 2.3.1 Predicates...... 9 2.3.2 Statements...... 10 2.4 Exploration Strategies...... 13 2.4.1 Mining and Pairing...... 13 2.4.2 Greedy Atomic Updates...... 13 2.4.3 Alternating Scheme...... 14

3 Related research 15 3.1 Rule Discovery...... 15 3.2 Decision Trees and Impurity Measures...... 16 3.3 Redescription Mining Algorithms...... 21

4 Contributions 25 4.1 Redescription Mining Over non-Binary Data Sets...... 25 4.2 Algorithm 1...... 27 4.3 Algorithm 2...... 30 4.4 Stopping Criterion...... 32

ix x CONTENTS

4.5 Extraxting Redescriptions...... 34 4.6 Extending to Fully non-Boolean setting...... 35 4.6.1 Data Discretization...... 35 4.7 Quality of Redescriptions...... 37 4.7.1 Support and Accuracy...... 37 4.7.2 Assessing of Significance...... 38

5 Experiments with Algorithms for Redescription Mining 41 5.1 Finding Planted Redescriptions...... 41 5.2 The Real-World Data Sets...... 44 5.3 Experiments With Algorithms on Bio-climatic Data Set...... 44 5.3.1 Discussion...... 54 5.4 Experiments With Algorithms on Conference Data Set...... 57 5.4.1 Discussion...... 64 5.5 Experiments against ReReMi algorithm...... 66

6 Conclusions and Future Work 69

Bibliography 71

A Redescription Sets from experiments with Bio Data Set 75

B Redescription Sets from experiments with DBLP data Set 91 Chapter 1

Introduction

Nowadays we encounter massive amounts of data everywhere and increased capabilities accelerate the generation and acquisition of it. This data can be of different origin and describe diverse objects which provides the stage for active data mining in the scien- tific domain. There are numerous techniques and approaches to find useful tendencies, dependencies or underlying patterns in it. The data derived from scientific domains is usually less homogeneous and more massive than the one stemming from business domain. Despite the fact that a lot of data mining techniques applied in business return nice results for the science as well, some more sophisticated and tailored methods are needed to meet needs arising in science. According to Craford [12] there are two types of analytic tasks for science that can be supported by data mining. Firstly, discovery driven mining used to deriving hypothe- sizes. Secondly, verification driven mining used to support (or discourage) hypothesis, i.e. experiments. In this setting hypothesis formation requires more exquisite approaches and deeper domain-specific knowledge. Facing imposing data volumes, scientist experience overload of data for describing domain entities. The issue which comes along with it is the fact that all these data sets can offer alternative (or even sometimes contradictory) perspective on a studied data. Thus, a universal tool which is suitable for data analysis is a necessary option to have on hand. Moreover, identifying correspondences between interesting aspects of studied data is a natural task in many domains. It is well known that viewing the data from different prospective is useful for better understanding of a whole concept. Redescription mining is aimed to embody this. The ultimate goal of it is finding different ways of looking at data and extracting alternative characteristics of the same (or nearly the same) objects. As it can be concluded from the name, redescription mining is aimed to learn model from data in order to describe it and help with interpretability of investigated results. Redescription is a way of finding objects that can be described from at least two different sides. The number of views can be larger than two, but the setting with double-sided data is more common. To assist in understanding of a redescription mining concept the following example can be used:

Example 1. We consider a set of nine countries as our objects of interest, namely Canada, Mexico, Mozambique, Chile, China, , Russia, the United Kingdom and the USA. Simple toy data set [48, 43, 63] consisting of four properties characterizing

1 2 Chapter 1 Introduction these countries, represented as a Venn diagram in Figure 1.1. is also included. Consider the couple of statements below: 1. Country outside the Americas with land area more than 8 billion square kilometers. 2. Country is a permanent member of the UN security council with a history of state communism.

Figure 1.1: Geographic and geopolitical characteristics of countries represented as a Venn diagram. Adapted from [48].

Blue - Located in the Americas Green - History of state communism Yellow- Land area above 8 Billion square kilometers Red - Permanent member of the UN security council

Two countries (Russia and China) satisfy both statements. They show alternative characterizations of the same subset of countries from geographical and geopolitical properties. Thus, the redescription is formed. The strength of it is given by symmetric Jaccard coefficient (1/1=1). Descriptors of any side of derived redescription can contain more than one entity. This simple example provides an intuition in understanding concept of redescription. Thus, we are given multi-view data set (in our case consisting of two sub-sets describing same objects with different features). For example, in a setting of niche-finding problem for studied in [23, 49], we can be provided with the one set containing species which live in particular regions. Another set will contain climatic data about same regions. Thus, the redescription mined for such a problem, can be a statement that some resides in a terrain where the average June temperature is in a particular range, etc. Very often extracting such rules is very laborious if done manually, because require picking up particular species and investigating their peculiarities. Application of redescription mining in Bioinformatics can be associated with genes. In such a case, the task to find such dependencies without suitable tool seems to be Chapter 1 Introduction 3 unfeasible. Because the amount of data is enormous and very often it is not complete. But mined redescriptions using one of existing methods are more informative and can reveal unexpected useful information in a domain. Of course, usage of redescription mining techniques is not limited to only these two domains. However, to make use of received redescriptions knowledge of the domain is highly recommended. Currently redescription mining techniques are able to handle non-Boolean data without pre-processing. This is claimed to be a better option against previous transformation of data sets [18]. In a setting with one side of a data set to be real-valued or categorical redescription mining performed meaningful outcomes. And in case if both data sets contain real-valued entries the exhaustive search is inevitable. This, in turn, might put unwanted computational burden. Beside this, redescription mining using decision trees with a modification such that it can work with numerical entries (at least on the one side) might perform well and become a competitive alternative to aforementioned techniques. However, it is not implemented so far. Thus, this is a starting point for work conducted within thesis. A stretch goal for the project can be defined as a resulting algorithm which allows both sides of data set to be non-binary. Finally, the comparison of received outcomes with the redescription mining conducted by existing methods is to be performed. Also it is useful to test new method in synthetic setting to study the behavior and performance of the algorithms. After this, conclusions about the quality of the method can be made.

1.1 Outline of Document

This Thesis is organized as follows:

• Chapter1 provides introduction to the topic

• In Chapter2 the problem of redescription mining is formalized. Section 2.2 and 2.4 cover Query languages and Exploration strategies that can be used within algorithms for redescription mining.

• Chapter3 is devoted to related research. Namely, cover other algorithms, which share some features with redescription mining. Section 3.2 describes in detail decision tree induction methods together with impurity measures. Section 3.3 is dedicated to other existing algorithms to mine redescription.

• Chapter4 describes contributions made within this Thesis to the topic. In par- ticular, Sections 4.2 and 4.3 provide explanation of two elaborated algorithms for redescription mining over non-binary data sets using using decision trees. In Section 4.7 we outline the way we evaluate our results.

• In Chapter5 all experiments are covered. In particular, Section 5.1 is responsible for synthetic setting and Sections 5.3 and 5.4 report the results and discussion of experiments on the real-world data sets: biological and bibliographic respectively. In addition, here in Section 5.5 we compare results of our algorithms to the ReReMi algorithm [18].

• Finally, Chapter6 contains conclusions to the Thesis.

Chapter 2

Preliminaries

2.1 The Setting for Redescription Mining

We denote O as a set of elementary objects and A a set of attributes which characterize properties of the objects or relations between them. The attributes originate from different sources and terminologies are denoted as a set of views V . Function v maps an attribute to the corresponding view: v : A → V . The data set can be represented in a form of triplet: (O, A, v). Redescriptions are composed with several queries. Definition 1. An expression formed with logical operators, expressed over attributes in A and evaluated against the data set is called a query. Q denotes a set of valid queries and called - query language. In order to assess any statement against a data set, it is necessary to conduct expression replacement of the variables in this statement with objects from the data set and identify the substitutions where the ground formula holds. Support of a query q is this subset of objects of nonempty tuples. We denote it as supp(q). All feasible substitutions for queries in a query language are called as a set of entities and denoted as E. By att(q) we denote set of attributes which can be found in a query q. Function v also denotes their view’s unions: v(q) = ∪A∈att(q)v(A). To make sure that two queries describe data from different view their attribute sets are required to be disjoint. Similarity in support is provided by symmetric binary relation ∼ as a Boolean indicator. Finally, set C can denote arbitrary constraints that can be applied to redescription. For example, to ensure ease of interpretation the maximal length of set-theoretic expressions is to be provided or only conjunctions are used. Thus, having this formalism, a redescription can be defined as following: Definition 2. Given a data set (O,A,v), a query language Q over A and a binary relation ∼, a redescription is a pair of queries (qA; qB)∈ Q× Q such that v(qA)∩v(qB) = ∅ and supp(qA) ∼ supp(qB). Redescription mining is a process of these pairs discovery. The problem of redescription mining: Find all redescriptions that satisfy con- straints from C, given a data set (O,A,v) with query language Q, and the binary relation ∼. Example 2. (Based on Figure 1.1.) Here ten counties (UK, France, USA, Mexico, Chile, Canada, China, Russia, Mozambique, France) form a set of objects. For attributes (Blue, Yellow, Red, Green - equivalently (B, Y, R, G)) are split into two views: G - geography (includes B and Y) and P - geopolitics (includes R and G). Thus, a set 5 6 Chapter 2 Preliminaries of attributes is written as A = {B,Y,R,G}. For example, v(B)=G. First query over geographical attributes can be written as: qG = B ∧ Y . In our data set this query is supported by two countries: supp(qG) = {Russia, China}.

Next step is a query over geopolitics. That is, qP = R ∧ G. Again, when evaluated against our data set, support is provided by the same two countries. Hence, supp(qP ) ∼ supp(qG). Moreover, v(qP ) ∩ v(qG) = {G} ∩ {P } = ∅. Then, based on Definition 1, (qG, qp) is a redescription.

As it can be derived from its name, redescription mining is the analysis which is focused on describing. It is not supposed to predict unknown data, but rather, describe properly available data. In addition, the extent of expressiveness and interpretability of the outcome really matters. Expressiveness can be determined through the variety of concepts that a language can represent. At the same time interpretability is more difficult to measure, since it implies the ease with which the associated meaning can be grasped. Nevertheless, simpler queries facilitate interpretability of an element of the language. While solving any redescription mining task collection O (which consists of elementary objects/samples) is considered. Attributes in A characterize the properties of these objects. The set of views V denotes the various sources, domains or terminologies from which the data originate. If talking about particular tasks, for example in case in biological niche finding prob- lem. Climate data on one side and fauna data on the other side create to fully diverse sets of attributes that fit a setting for given problem. In case when we have medically related problem these sets can be formed by personal information about patient’s back- ground, elements of diagnosis and symptoms. Since redescriptions are focused to find characteristics of the same (or nearly the same objects), we require that the attributes over which both queries of a redescription are expressed come from disjoint sets of views. As it was already mentioned we will stick to two-sided setting. This means, there will be to data sets, which are denoted by L (for left) and R (for right) such that AL∪AR = A. In case we have multiple views the correspondence between the elementary objects across the views might not be available. This can be caused by the fact that the sets of objects occurring in distinct views do not coincide completely. Or, some objects might have many observations in one view and single in another. Setting up of these correspondences appear to be a non-trivial task, which formulates a research question on its own [54]. The purpose of redescription mining is to find alternative characterizations of almost the same objects. This means that the similarity of the supports of the queries de- termines the quality of a redescription derived. It is said, that a couple of queries are accurate if they have similar supports. More general, similarity relation between support sets is determined by similarity function f. In addition to that, a threshold σ such that the following holds:

Ea ∼ Eb ←→ f(Ea; Eb) ≥ σ

The function f is usually chosen to be Jaccard’s coefficient [27]. We use this coefficient as our measure of choice for accuracy, but it can easily be replaced with another set similarity function. We consider similarity between the supports of the queries of a redescription to be a main property of a redescription and call it accuracy. Thus, the Chapter 2 Preliminaries 7 pair of two queries can be called accurate if their supports are similar. By similar we imply they pass the given threshold. Moreover, similarity coefficient is 1 when two queries are identical. This means we have a perfect redescription. In practice, redescriptions with the similarity coefficient less than 1 are also useful in many domains. A chain of these redescriptions can be used to connect independent entities (i.e. applicable in story telling). Or, if we talk about bioinformatics, trying to find genes responsible for a particular disease.

For a pair of queries (ql; qr), we denote by several subsets of entities:

1. E1,1 - entities that support both queries (i.e. E1,1 = supp(qL) ∪ supp(qR))

2. E1,0 - entities that support only first query

3. E0,1 - entities that support only second query

4. E0,0 - entities that do not support any query.

As an example of similarity function the following can be applied:

• matching number |E1,1| + |E0,0|

• matching ratio |E1,1|+|E0,0| |E1,0|+|E1,1+|E0,1|+|E0,0|

• Russell & Rao coefficient |E1,1| |E1,0|+|E1,1+|+|E0,1|+|E0,0|

• Jaccard’s coefficient |E1,1| |E1,0|+|E1,1|+|E0,1|

• Rogers & Tanimoto coefficient |E1,1|+|E0,0| |E1,0|+2|E1,1|+|E0,1|+|E0,0|

• Dice coefficient 2|E1,1| |E1,0|+2|E1,1|+|E0,1|

The choice of Jaccrad coeficient is more common when talking about evaluation of redescriptions. This caused due to its simplicity and its agreement with the symmetric approach adopted in redescription mining. Jaccard coefficient includes the support of the two queries equally. Moreover, it is scaled to the unit interval without involving the set of entities that support neither queries E0,0. 8 Chapter 2 Preliminaries

2.2 Query Languages

The way we represent the results of redescription mining is determined by the query languages. They are an essential part of the whole redescription mining technique. Queries are the logical statements that are evaluated against given data set. These statements are obtained after a combination of distinct predicates using Boolean oper- ators. We can replace predicate variables with objects from given data set and verify whether the conditions of the predicates are satisfied returns the truth value. The objects which satisfy the given query are considered to be a support of this query. In this part we cover different types of query languages. In particular we determine the query structures which are used for redescription mining. They offer a representation of logical combinations of constraints on the variety of individual attributes. Previous papers which cover redescription mining also discussed diverse formal representations of queries and query languages [48, 20].

2.3 Propositional Queries, Predicates and Statements

The queries are formed by logical statements evaluated against the data set. These state- ments are derived by atomic predicates built from individual attributes using Boolean operators. Substituting predicate variables with objects from the data set and verifying whether the conditions of the predicates are satisfied returns a truth value. The object tuples in substitutions satisfying the statement form the support of the query. We define a query language as a compound of acceptable queries, dependent on the supported types of attributes and the principles for building predicates. Also, syntactic rules for combining them into statements belong to the query language we use. In this thesis we focus on propositional data sets. They contain attributes character- izing properties of individual objects. Sets of objects are deemed to be homogeneous, i.e. attribute applies to all objects. The set is called propositional, if it contains attributes which characterize properties of distinct objects. In this setting, a value which attributes from A take form a matrix D. This matrix contains |O| rows. One attribute correspond to one object. There are |A| columns, each of them correspond to an attribute. Thus, the value of an attribute Aj ∈ A is defined as D(i, j) = Aj(oi) for objects oi ∈ O. Let’s consider an example from [17] to exemplify query languages. Data set from Table 2.1 contains countries as objects. Each column represent some property of a county (geographical details). This data can be expressed as matrix G with 7 columns. G = {G1,G2,...,G7}, where Gn - is a vector, which corresponds to some property, i.e. maximal elevation, continent, etc. Chapter 2 Preliminaries 9

Table 2.1: Example data set. World countries with their attributes. Country G1 G2 G3 G4 G5 G6 G7 Canada 0 1 0 1 N.America 9.98 5959 Chile 1 1 0 1 S.America 0.76 6893 China 0 0 0 1 Asia 9.71 8850 France 0 1 0 0 Europe 0.64 4810 Great Britain 0 1 0 0 Europe 0.24 1343 Mexico 0 1 0 1 N.America 1.96 5636 Mozambique 1 0 1 0 Africa 0.79 2436 Russia 0 0 0 1 Asia, Europe 17.1 5642 USA 0 1 0 1 N.America 9.63 6194

Here we have 7 vectors, constituting the following features:

1. G1 - Location in South Hemisphere

2. G2 - Border with Atlantic Ocean

3. G3- Border with Indian Ocean

4. G4- Border with Pacific Ocean

5. G5- Localization on a continent

9 2 6. G6- Land area(10 km )

7. G7 - Maximal elevation of the surface in meters

2.3.1 Predicates

Attributes take values which compose a range. We restrict the values to selected subset of the range of it and construct a predicate from an attribute. Let’s consider some attribute Aj ∈ A from a range R. Having fixed a subset Rs ⊆ R, it is possible to transform an associated data column into truth value assignment. This is we turn it into a Boolean vector which indicates which values are placed within the fixed range.

This is denoted as [Aj ∈ RS]. As a consequence, it includes a subset of objects each of them has an attribute Aj with the value RS. Membership in such a sub set can be then written as follows: s(Aj,Rs) = {oi ∈ O,Aj(oi) ∈ RS} and [Aj ∈ RS]. Based on range, all attributes can be segregated into types: Boolean, nominal and real-valued. Boolean predicates. Boolean attributes can be only in two values: true of false. Or, equivalently either 0, or 1. Interpretation of a Boolean variable can naturally create a predicate. For simplicity, a true value assignment (i.e. [A = true]) is written simply as A. Thus, [A = false] is a complementary assignment, which can be written with negation (i.e ¬A). From the example above vector G3, a Boolean attribute corresponding to a predicate with the following truth assignment for this data:

h0, 0, 0, 0, 0, 1, 0, 0i 10 Chapter 2 Preliminaries

Thus, one country (i.e. Mozambique) which has a border with Indian ocean is selected. Nominal predicates. Some attribute Acan be called a nominal attribute when its range is non-ordered set C or its power set. Categories (which reside in C) are considered to be categories on an attribute A. To ensure truth value assignment, the subset of the categories CS ⊆ C is chosen. Alternatively, a single category c ∈ C is selected. Thus, nominal predicates are written as follows: [A ∈ CS] and [A = c]. In practice, we consider only those nominal attributes which take a single value. In case there are nominal attributes with multiple values, we represent them with a help of multiple Boolean attributes, i.e. one attribute for each category. From the above example, six countries have borders with Pacific Ocean:

G4 ∈ {P acific Ocean}

Is satisfied by truth assignment: h1, 1, 1, 0, 0, 1, 0, 1, 1i. If we look on location on a continent vector (G1), the attribute becomes multi-valued, because Russia falls into two categories: Asia and Europe. In practice, multi-valued attributes are expressed via several Boolean attributes, one per category. Real-valued predicates. Some attribute A is considered to be a real-valued at- tribute, if its range is formed from real numbers R ⊆ R. The truth value assignment is derived from selecting of any subset of R. Nevertheless, for ease of interpretation truth value assignment is made based on some particular adjacent subset of R. That is, we use A ∈ [a; b] to denote an interval [a, b] ⊆ R. In addition, for any given real-valued attribute there are infinite possible intervals. All of them will result in truth value assignment. Thus, the measurement of query language consistency must involve also a criterion to select one of such equal intervals. exemplification, let’s consider the following:

G7 = h5959, 6893, 8850, 4810, 1343, 5636, 2436, 5642, 6194i

For a pair (a, b), with a ∈ (2000, 2200) and b ∈ (5000, 5500) the truth value assignment will look like:

[a ≤ G7 ≤ b] = h0, 0, 0, 1, 0, 0, 1, 0, 0i

Thus, as a result we get several equivalent intervals for truth value assignment. Fox example, [2200 ≤ G7 ≤ 5000] and [2436 ≤ G7 ≤ 4810]. Thus, decision depends on the belief whether rounded bound are considered to have better interpretability or not. This in turn depends on a task or problem we work with. For instance, usage of rounded bounds can be adopted in case we work with big data sets involving many countries, when the range in values is big enough. In case of smaller data sets (e.g. like the one we consider here with 9 countries) exact bounds might be more desirable, because they provide more precise description og each studied country.

2.3.2 Statements

Predicates, discussed previously, are used as a pieces to construct statements. Proposi- tional predicates are joined with the help of Boolean operators: Chapter 2 Preliminaries 11

1. Negation - 0¬0 ; 2. Conjunction - 0∧0; 3. Disjunction - 0∨0 ;

The truth assignment for the a query is derived via combination of the truth assignment of the individual predicates. The resulting subset of objects is the support of the query. Namely, support of query q on D, suppD(q), is a set {o ∈ O : q is true for o}. For example, the query which is satisfied by countries with Atlantic Ocean borders, but without borders with Pacific ocean with maximal elevation less than 4500 meters looks as follows:

q1 = G2 ∧ ¬G4 ∧ (G7 < 4500)

Size of the support of this query is 1, since only Great Britain from our data set is characterized by these features. Now let’s move to possible query languages which deploy predicates and statement from above. Firstly, one of the most limited and restricted query languages is monotone conjunctions. That is, all predicates are allowed to be combined only with conjunction operator. For example, the following query from the running example is a monotone conjunctive query:

q2 = G1 ∧ G4 ∧ [2000 ≤ G7 ≤ 5000]

First query can not be called the member of this query language because it is not monotone. These type of queries (monotone conjunctions) correspond to itemsets in which every predicate represent an item. Itemsets are vigorously studied in the literature. Algorithms to mine frequent itemsets received an increased interest [24, 11]. For example, it is possible to partially arrange them in order on the inclusion principle to verify the downward closeness property. Which means if some query qi is a subset of some query qj, then support of qi is a superset of qj’s support. Thus, a search space in such a case can be explored more efficiently. Monotone conjunctions are easy to find and interpret, at the same time restriction for disjunctions and negations affects expressiveness of the queries mined. The opposed type to monotone conjunctions is unrestricted queries. Here pred- icates are allowed to be combined using any above mentioned operators without any restrictions. Nevertheless, this extreme case provides full expressiveness for the queries. Examples of unrestricted queries can look as follows:

q3 = (G2 ∧ G4 ∧ G1) ∨ ¬(G3)

q4 = G2 ∧ [G6 < 1.9]

q5 = ((¬G1) ∧ ([G5 = Asia] ∧ G3) ∧ [1.9 ≤ G6 ≤ 7.6] ∧ ¬G4

q6 = [2000 ≤ G7] ∧ G1 ∧ [1200 ≤ G7 ≤ 8000] 12 Chapter 2 Preliminaries

Both queries mentioned before belong to this query language as well. Expressiveness of queries without restrictions is maximal but queries can become more difficult to interpret. For example, they can contain deeply nested structures, which means, we have a query which involves numerous attributes with a complex structure. Despite the fact that the support of this query matches the support of other query very well (e.g. the redescription formed by these queries will be highly accurate), interpretation of this redescription will be obstructing. This is caused by many entangled conditions. As a consequence this redescription losses its intiristingness. Moreover, the final space formed by redescriptions looks disordered and becomes difficult to search. Here we can observe a rich structure of queries and full expressiveness, while nested structures make queries hard to interpret. Hence, a balance between expressiveness and interpretability is the most desirable feature. A compromise between these two languages can be linearly parsable query language. Here queries are formed with the help of the simple formal grammar. Moreover, to ensure ease of inteprebility it is possible to apply some moderate restrictions. For example, allow every attribute to appear only once, etc. Selection of a query language theoretically should be performed ahead adopting the algorithm. Practical constraints very often influence the choice. That is, the adopted algorithm might naturally result is a particular query language. For example, linearly parsable queries are more natural for the algorithms with iterative atomic extensions which append on each iteration new literal to a query [20]. In this Thesis we exploit decision tree induction to mine redescriptions which affects the query language we use. We stick to the data set with Boolean predicates on the one side and real-valued - on the other. We avoid usage of negations by flipping the sign. For example, for Boolean predicate instead of ¬G1 we would have G1 < 0, 5 - meaning ’0’ (i.e. ’false’) and G1 ≥ 0.5 meaning ’true’ or ’1’. But, if necessary, negations can be used as well. Also, we allow both: conjunctions and disjunctions to provide expressiveness of the resulting queries and there is no restriction for the predicate to appear only once in a query. Chapter 2 Preliminaries 13

2.4 Exploration Strategies

There exist several strategies for redescription mining. Basically, there are few ap- proaches how one can find redescriptions given a query language and a space of possible queries. Also different constraint on the redescriptions might be used as well. Thus, com- bination of these parameters results in different search spaces. Some properties (such as anti-monotonicity) assists in more effective redescription mining process. There are three main generic exploration strategies for redescription mining.

2.4.1 Mining and Pairing

This simple strategy includes two main steps for redescription mining. Firstly, individual queries are found from different data sets. Secondly, these queries are combined into pairs based on similarity in their supports. Thus, a redescription is formed from two similar queries from different data sets. Within recent times several authors devised algorithms to mine queries over fixed set of propositional predicates [6, 11, 62]. This approach has some treats which make it suitable for data sets which include small amount of views because finding separate queries and pairing them later can be performed very effectively. In contrast, when data sets contain imposing amounts of views, this exploration strategy result in queries over all predicates pooled together. When combining them, the queries with similar supports might appear to have disjoint predicates. This scheme is advantageous because it allows adaptation of frequent itemset mining algorithms for mining redescriptions. As an extension of this independent mining and further pairing the second step can be replaced with a splitting procedure. This includes pooling together all predicates for the first mining step with future splitting the queries depending on views. Nevertheless, the fact that the query exist does not guarantee that it can be split into several smaller ones. When we have data coming from two different views, we can mine monotone conjunc- tive redescriptions in a level-wise fashion. This is similar to the algorithm from [6, 38], which is called Apriori. Support of queries and their intersection is used and can be used safely for pruning since they are anti-monotonic. Finally, this exploration strategy finds its best applications in case of exhaustive search. Hence, when the sets are not big enough to cause an undesirable computational burden.

2.4.2 Greedy Atomic Updates

Next exploration strategy is based on iterative search of the best atomic update to the current query pair. That is, one tries to apply atomic operations to one of the queries such that a resulting redescription becomes better. This process is continued until no improvements further possible. Atomic updates imply operations which include addition, deletion and edition of predi- cates. Hence, a new predicate can be added, removed or changed (for example, negated). In order to prevent the algorithm to form cycles, it is possible to remember the queries which already has been explored. As a starting point a couple of perfectly matching queries from distinct views can be selected. This approach was firstly proposed by Gallo 14 Chapter 2 Preliminaries et al. [20] and it used only addition operations to update the query. Later it was extended to non-Boolean setting with ReReMi algorithm [18]. More to say, the issue of missing entries was also covered, since it is a highly relevant aspect when working with real data.

2.4.3 Alternating Scheme

One more approach to build redescriptions is an alternating scheme. We use it as the main exploration strategy in this Thesis, because the algorithms we elaborate are based on decision tree induction. The main idea behind this strategy is to find one query and then find another one which matches good to it. Then the first query is replaced with a new one, which makes a better match. Alternations are continued until no better match can be made or the stopping criteria is met. (0) For example, we start with query from a left-hand side qL and search for a good (1) matching query from the right-hand side qR . Now, we proceed again to the left-hand (2) side and try finding another query qL that matches the one derived from the right. Hence, the algorithm runs in this manner until termination. If one hand side of the redescription is fixed, the task of finding an optimal query for the other side can be defined as binary classification task. Entities that belong to the support of the fixed query are positive examples, while the elements not in the support are negative examples. Hence, the redescription mining task can be potentially solved with the help of any feature-based classification technique along with query language. Finding the proper starting point for the alternating scheme is a question of the quality of the method on its own. The simplest option is to randomly split data into examples and use this partition for initialization. Or, start with a queries which consist only of one predicate. Having fixed the number of starting points and the number of allowed alternations, the complexity of such an approach depends mainly on the complexity of the chosen classification algorithm used for alternations. In this thesis we focus on the alternating scheme for redescription mining task. And as a classification algorithm we use decision tree induction. This idea is not new. Firstly it was adopted by CARTWheels algorithm [48] which is able to process binary data sets and mine redescriptions by matching the terminal nodes (leaf nodes) into pairs of queries. Chapter 3

Related research

3.1 Rule Discovery

The main feature inherent to redescription mining is ’multi-views’. This implies descrip- tion of entities with the help different set of variables. Nevertheless, this ’multi-views’ feature is not unique for only redascription mining. One of the most common similar approaches is supervised classification [57], yet it is not always perceived as such. In classification entities are characterized by the observations on one hand an by the class label on the other hand. The starting point of viewing same object from different angles was introduced by Yarowsky [60]. He initiated aforementioned multi-view learning approaches. This was followed by Blum and Mitchell [7] and evoked high interest to the topic. Mining single query can be treated as a classification task. When we fix one query, we get binary class labels and we are looking for a good classifier for it. A particular example where we have Boolean attributes and targets is Logical Analysis of Data [8]. It has on purpose finding an optimal classifier of pre-determined form (e.g. DNF, CNF, a horn clause, etc.) Multi-label classification has a bit more resemblance with redescription mining as well [55]. Here classifiers are supposed to to be learned for conjunction of labels. This restriction only to conjunctions and prediction (not description) are the main differences of this approach to redescription mining. Moreover, there are several more instances that can be covered as somehow similar ones to redescription mining. Emerging Patterns [41] is targeted at Boolean data and item sets (monotone conjunctive queries). Thus, it tries to detect those itemsets, whose presence depends statistically on negative or positive label assignment of the objects. In case of perfect outcome the itemset will reside solely in positive example and will compound a perfect classifier for the given data set. One more approach that can be related to redescription mining is Contrast Set Min- ing [41]. It can be used to detect monotone conjunctive queries which gives the best discrimination of some distinct class from all other objects from data. Subgroup Dis- covery [56] can also be mentioned here. It is aimed to find a query such that all objects from determined subgroup posses atypical values for target attribute compared to other objects.

15 16 Chapter 3 Related research

Taking everything into account, it can underlined that the main differences between redescription mining and these approaches are: the goal of redescription mining is finding simultaneously multiple descriptions for a subset of entities which were not previously determined. It selects only several relevant variables among big variety. Moreover, we have one-dimensional redescription mining problem despite there are two sets of describing attributes. Queries are constructed over one set of attributes, determining subgroups of a quality that is measured as their ability to describe queries from the second set of attributes.

3.2 Decision Trees and Impurity Measures

Decision trees. Regardless the domain where decision trees are used they are aimed to use a given set of attributes to classify data into a set of predefined classes. Firstly, a training data set is used to help tree learn about the specific data. Thus, we run the algorithm to split the source set into several subsets based on attribute value. This process is repeated on each resulting subset in a recursive manner and has name recursive partitioning. This recursion is considered complete when the subset which fall into the same node carry the same class label. Or, when further splitting does not result in adding the value to the predictions. Secondly, test data sets are used to evaluate the accuracy of built tree, to determine weather it is able to classify data properly. By properly, we mean placing each attribute into a correct class (i.e. minimize instances of misclassifications). A decision tree that has multiple discrete class labels is called a classification tree. Tree-based model have variety of uses starting from spam filtering [16] going even to astrophysics [28]. The concept of decision trees is not new it was invented in 1966 by Hunt, Marin, and Stone [13]. In this thesis we mainly concentrate on Classification Trees aspect because trees are used redescriptions of the same (or nearly the same) objects. For example, in biological niche finding problem we do not focus on predicting climatic conditions of any species. On the contrary, the idea is to find specific information about a mammals which already live in a particular surroundings. Decision trees were one of the earliest methods used to build classifiers [34]. They have several advantages: they are easily interpretable by human experts; they provide effective induction and accuracy; they are comparatively easy to be built, etc. When using decision trees, it is important to determine the algorithm used to actually build the tree. This includes investigation of different splitting rules used, because the quality of the result might be highly dependable on the choice of the parameters. For example, Information Gain, Entropy, Gini [34, 10], etc. There exist numerous implementations which are scalable and effective [9]. Some of them more suitable for smaller data and vice versa. Thus, the mechanism used to build a decision tree is to be studied in detail in order to provide a strong support for redescriptions mining based on this approach. In general, having a given set of attributes, there are exponential many decision trees can be constructed from it. Resulting trees will differ by their accuracy. However, the finding of optimal tree is usually unfeasible, since the search space is of exponential size. Nevertheless, there are numerous efficient algorithms that can produces decision trees of reasonable consistency within the acceptable time span. They mainly use greedy strategy that deepens a tree by making a succession of locally optimal decisions. Chapter 3 Related research 17

One of the most known algorithms of this type is Hunt’s algorithm [42]. It is used as a base in many common algorithms, e.g. ID3 [46], C4.5 [47], and CART [34]. Hunt’s algorithm [13] grows a tree in a recursive fashion by partitioning training set into several ordered purer subsets. Any algorithm which is used for decision trees induction must deal with two main aspects. One of them, how to split the training set. On each recursive step of growing the tree the algorithm must split the training data into smaller subsets. To embody this algorithm must provide a method which specify the test condition for attributes of diverse types. In addition, the way of measuring the goodness of every test condition should be defined. These ’goodness measures’ are commonly called impurity measures and discussed further. One more aspect, the stopping criterion should be determined as well. The easiest approach to stop the process of tree-growing is to terminate it whenever all of the entries in nodes belong to the corresponding classes (i.e. nodes are pure) or all entries have identical attribute values. These two points are enough to terminate any algorithm which builds decision trees. However, early termination has some advantages. In this thesis we focus on the most famous algorithm for decision tree induction called CART [34]. Classification and Regression Tree (CART) CART was firstly introduced in by Breiman et al. [34]. CART was invented inde- pendently within same time span as ID3 [46], both use similar approach for learning decision tree from training tuples. CART is a non-parametric decision tree training technique which returns classification or regression trees. CART is the most popular data mining technique for classification purposes in the world. It revolutionized the entire field of advanced analytics and allowed data mining to move to the new level [1]. It is a statistical approach that allows selecting from a huge number of explanatory variables those, which are most important for determining the response variable to be explained. Decision trees partition (split) the data into mutually exclusive nodes (groups). Nodes are supposed to be maximally pure. Building process begins with a root node, which contains all objects in it. Further they are split into nodes by recursive binary splitting. Each split is determined by a simple rule based on a single explanatory variable. Steps done in CART to grow a classifier can be expressed in the following [34]:

1. All objects assigned to a root node;

2. All possible splits of an exploratory variable and attribute values (splitting rules) are generated;

3. For each split from previous stage separate objects from the parent into two child nodes based on the value (lower or higher than it);

4. Pick up a the variable and a value from 2 which return highest reduction of impu- rity. Impurity measures are discussed in the Section 3.2;

5. Conduct the split into two child nodes according to selected splitting rule;

6. Repeat steps 2-5 applying then to all child nodes as if they are parents until the tree has a maximal size. 18 Chapter 3 Related research

7. Prune the tree with a help of cross-validation [31] to return the tree of an optimal size. The pruning algorithm here attempts to balance optimistic estimate of em- pirical risk with the help of addition of a complexity term. This complexity term penalizes bigger sub-trees. In cross-validation some objects are randomly removed from the data and than they are used to assess predictive power of the tree.

The one of the common ideas is to stop building the tree early (early termination) can result in not sufficient coverage of interactions between explanatory variables [51]. That is why, in CART it is chosen to allow tree to grow to a maximal size. In these maximal trees all nodes will be either small (will contain a single object or a desirable, predetermined number of objects) or the resulting nodes will be pure (i.e. no further split is needed). This type of a tree is overfitted: not only does it fit the data, but also a noise and idiosyncrasy of a training set. Hence, next steps are dedicated to pruning. The branches which lead to the smallest decrease in accuracy comparing to pruning of the other branches are pruned.

For each sub-tree T, a cost-complexity measure Rα(T ) is [44]:

Rα(T ) = R(T ) + α|T |

Here |T | is a number of terminal nodes (or complexity of a sub tree), α is a complexity parameter; R(T ) is an overall miss-classification rate for classification trees or the total residual sum of squares for regression trees. Every value of alpha has a corresponding unique smallest tree that minimizes the cost complexity measure. Complexity parameter increases from 0. This returns a nested sequence of trees which become smaller in size [44]. All of them has the best size size and selection of the best option can be transferred into a problem of selection of the best size. Cross-validation defines this optimal size. The data set we work with is randomly divided into N subsets (commonly set to 10). One of these subsets is used as a test. All other N − 1 sets a grouped together and used as learning a data set. Tree is grown and pruned N times and in every new case with a usage of different subset these subsets play role of a test sets. A prediction error (sum of squared differences between the observations) is calculated for every size of a decision trees. Then, it is averaged over all subsets and matched with the sub-trees of the complete data set using the values of alpha. The optimal sized tree is the one with the lowest cost-complexity measure [31]. CART implies the assumption that samples are independent in computing of classi- fication rules [36]. Models produced by CART have positive features, such as: input data is not supposed to convey the normal distribution; predictor variables do not have to be independent. It is possible to model non-linear relations between predictor variables and observed data. CART enables evaluation of the importance of the diverse explanatory variables to define a splitting rule and splitting value. The technique used for this is ’variable ranking method’ [45]. Variables which do not show up in the resulting tree can be called less important for data set description. CART has numerous implementations that undergo continues changes, extensions and improvements. New units are written to make it more convenient or specific in distinct Chapter 3 Related research 19 domains. In implementations of our algorithms for redescription mining we use available packages in R, namely rpart [52] and rattle [59]. Impurity measures. The main aspect in decision tree building is to decide how to split the data set. The ’goodness’ of split is evaluated by impurity measures. Which in fact a function which assesses how well the particular split separates the data into classes. Impurity measure is an objective to to by minimized on each intermediate stage of a decision tree building. In general, impurity measure should satisfy:

1. Is should be the largest when data is split evenly for attribute values

2. Should be 0 when all data belongs to the same class

Quinlan’s information measure (Entropy). Originally, Quinlan offered to mea- sure this ’goodness’ based on on a classic formula from information theory:

− P p log(p ) pi i i

With pi - probability of i-th message. Thus, the outcome depends entirely on a probability of possible messages. If their probability is equal, there is the greatest amount of uncertainty. Thus, gained infor- mation will be the greatest. Consequently, if they are now uniform, less information will be gained. Also, the value of this objective function depend on the amount of mes- sages. Thus, entropy of a pure node is zero, because then the probability becomes 1 and log(1) = 0. And vise versa Entropy is maximal when all classes have equal probability to appear. Information gain. One of the most common impurity measures used while building decision trees in various implementations is Information Gain (IG). Which, in fact, is a difference in Entropy (i.e. also involves computation of entropy for the nodes). Information Gain, popularized by Quinlan in [46], is the expected reduction in entropy provoked by partitioning the objects according to a given attribute. Let’s denote set C which has p objects on one class (P) and n objects on another class (N). If the decision tree is accurate, it is supposed to classify there objects in the same proportion as they present in C.

As a root of a decision tree a partition A (which contains values {A1,A2, ··· ,Av} is picked, so it would partition C into {C1,C2, ··· ,Cv}. If Ci contains pi objects of class P and ni of class N and the expected required for the sub-tree for Ci information is denoted as I(pi, ni), then expected information needed for the tree with A as root is defined as the weighted average:

v X pi + ni E(A) = I(p , n ) p + n i i i=i

The information gained by using A as a root is defined as follows:

IG(A) = I(p, n) − E(A) 20 Chapter 3 Related research

Whenever Information Gain is used as impurity measure in decision tree algorithms, all candidate attributes are investigated. The one, which maximizes information gain is chosen. The process is continued further with the residual subsets {C1,C2, ··· ,Cv}. Classification Error. As an impurity measure the classification error can be used as well. It is also aimed to determine the ’goodness’ of a split on a terminal nodes with defying the entries which go to a children nodes. It measures a missclassification error made by a node and looks as follows (for a node t): Error(t) = 1 − max P (i|t). 1 Thus, classification error is maximal (from 1 to N classes ). When all entries are evenly distributed across the classes. This means, we gain the least interesting information. Classification error becomes minimal when all entries represent the same class (Error = 0). The GINI index (Gini). Very similar to Quinlan’s impurity measure was presented by Breiman [34] is called Gini index. Gini measures of how often a randomly chosen element from the data would be er- roneously labeled if it were randomly labeled according to the distribution of labels in the subset. Gini impurity measure is computed as a sum of the probabilities of each item being chosen multiplied with the probability of a mistake in categorizing this item. Thus, it equals to zero when all entries of a node belong to the same class. Formally, Gini looks as follows:

X 2 Gini = 1 − pi

With (pi) - probabilities for each class to appear. In case we have a pure class in each node, the probability becomes 1 (Gini = 1 − 12 = 0). Similarly to Entropy, Gini becomes maximal, when all classes carry equal probability. Originally, Gini measures the probability of missclassification of a set of objects, rather than the impurity of a split. Gini index together with Information Gain are the most commonly used measures in classifiers built with a help of decision trees. However, Gini index behaves with data a bit different. As it was mentioned before, information gain tries split the data into distinct classes, whereas Gini seeks for the largest class and extracts it at first. Than in the residual data looks for the next attribute which would also help in extracting the next largest class. This continues till the final tree is built. If the data was such that splitting into classes was quite clear, the tree will result with pure nodes (i.e. leafs nodes that contain objects of the only class). In practice, pure decision trees are attainable only in very rare circumstances. In our Algorithms for redescription mining we use Information Gain and Gini index as impurity measures. We discuss the affect of them for different real-world data sets in Subsection 5.3.1. Chapter 3 Related research 21

3.3 Redescription Mining Algorithms

Since redescription mining was firstly introduced by Ramakrishnan et al. [48], there have been several other contributions to the topic. In particular, both Kumar [33] and Ramakrishnan et al. [48] were working on redescription mining using decision trees. They introduced alternating approach to grow decision trees, which are later used to derive redescriptions. Beside decision trees, redescription mining was presented through such ideas as Kar- naugh maps in [63], where they depicted it as a simple game with only 2 rules. A pair of two identical maps contains variables on its sides, and the blocks inside maps represent intersection of these variables. Blocks can be uncolored (if there is no intersection of corresponding variables in data set) or colored, if they are non-empty. Rules are: A colored cell can be removed as long as it is removed from both maps and an uncolored cell can be removed from either (or both) maps. If one thinks of objects as transactions and descriptors as items, then a colored cell in the Karnaugh map corresponds to a closed item set from the association mining literature [61]. Redescription mining algorithms based on frequent itemsets are covered in [20]. Here it is considered as the task of finding subgroups having several descriptions. Authors offer algorithms, based on heuristic methods, that produce pairs of formulae that are almost equivalent on the given data set. Methods also use different pruning strategies to avoid useless paths in a search space. Another algorithm for redescription mining, based on co-clusters, was presented in [43]. Redescription mining task is viewed as a form of conceptual clustering, where the goal is to identify clusters that afford dual characterizations, i.e. mined clusters are required to have two meaningful descriptions. Greedy search algorithm used in [20] to mine redescriptions was extended with ef- ficient on-the-fly discretization by ReReMi algorithm, introduced in [18]. Here the algorithm defines initial pairs of variables from a data set and update them until no further improvements can be made. Updates can include addition, deletion and edition of predicates. Since in this thesis we use decision tree induction to elaborate algorithms for redescrip- tion mining over non-binary data sets, existing approach which exploits the same idea is to be covered. Previously CART was incorporated into the CARTwheels algorithm [48] to mine redescriptions. CARTwheels. The main contribution to make redescription mining a relevant re- search topic was made by Ramakrishnan et. al. in 2004 [48]. Here CARTwheels algo- rithm was presented which derives redescriptions with a help of decision trees grown in opposite directions and then matched at their leaves. Further, authors in [43] explicitly showed applications for redescriptions and structured the formalist for the topic. More to say, Kumar in his PhD thesis [33] exploits decision trees as well to character- ize sets of genes. He also extended CARTwheels by presenting a theoretical framework which allowed systematical exploration of the redescription space, involvement of re- description across domains, etc. CARTwheels was the first introduced algorithm for redescription mining which involves decision trees induction [48]. It uses two binary data sets to grow two decision trees 22 Chapter 3 Related research which match in the end with their leaves. When they are matched, the paths which led to a similar leaf nodes can be written as queries to form redescriptions. In original paper [48] authors use data set consisting of 2 matrices, where they assign class labels based on a greedy set covering of the objects using entities of left-hand side part. The decision conditions from one tree are combined with the corresponding decision conditions from the second tree. Hence, the paths which lead to the same class can be treated as queries to form a redescription. Algorithm returns as many redescription as there are matching paths in a pair of resulting trees. This approach selects a paths in the grown trees and combines splitting rules of the cor- responding terminal nodes via Boolean operators. Negations are also involved whenever the path belong to the ’no’ side of the decision tree. As an visual example, a pair of resulting trees, which are used to form redescriptions is depicted on Figure 3.1.

Figure 3.1: Tree growing with alternations by CARTwheels. (left) tree defines set- theoretic expressions to be matched. (middle) bottom tree is grown to match the first one. (right) bottom tree is fixed and the top tree is re-grown again to match leaves. Arrows represent matching paths which form redescriptions.(Following [48])

Fig. 3.1 shows three frames of tree growing process. The right-most frame depicts final version of trees that form redescriptions. Matching paths can be written as following redescriptions:

(X3 ∩ X1) ∪ (X4 − X3) ←→ Y4 (O − X3 − X4) ←→ (Y3 − Y4) (X3 − X1) ←→ (O − Y3 − Y4)

These alternations can be continued until leaves match good enough or until the max- imal number of unsuccessful alternations is reached. However, it is important to notice, that in this approach authors set the depth as a constant (in this example (d = 2)) and re-grown in each iteration trees of the same depth. Chapter 3 Related research 23

CARTwheels algorithm uses duality between path partition and class partition. Thus, the crucial issue here is to combine paths into redescriptions only when they lead to the same class label. Further evaluation of this partition determines the quality of the result. In CARTwheels results in a single pair of trees, which are re-grown with the same, previously fixed by the user, depth and cover the whole data set. In this Thesis we, inspired by the idea of usage decision trees for redescription mining, elaborate two algorithms which grow decision trees to match in their leaves gradually increasing the depth. Also, we enable them to: firstly, work with real-valued data one side. Secondly, extend to them process both sides of real-valued attributes, using data discretization routine described in Section 4.6.

Chapter 4

Contributions

4.1 Redescription Mining Over non-Binary Data Sets

Redescription mining techniques based on decision tree induction previously were able to handle solely Boolean data and were not able to handle other cases without data pre- processing. However, techniques which use other redescription mining approaches, for example, presented by Galbrun et al. [18] are able to handle numerical and categorical data directly. In this Thesis we extend redescription mining techniques which exploit decision tree induction to non-Binary setting, apply in on real-world data, test its ability to find planted redescriptions and compare with existing redescription mining techniques. In particular, we work with two methods which both have decision tree induction as a basis. As a result we expect our algorithms to return interesting and informative redescriptions, which are useful in a particular domain or can assist in solution of existing problem. Except this, these method can be applied and be tested in other domains as well, since redescription mining might be useful for them as well. For example, a good choice is bioinformatics. As long as data sets have determined form, our approaches can be exploited in any domain. Very often domain knowledge is essential for to make conclusions regarding outcomes. For example, one possible domain is biological niche-finding problem. Here we are looking for the rules which in detail determine specific conditions for the species. It is comparatively easy for a layman to assess the quality of for redescription mined in such a domain. For example, if we get the rule which says that a Polar Bear lives in the places where average January temperature is below 2 degrees Celsius, this statement is quite understandable ever for a person without profound knowledge in biology. Nevertheless, user might encounter more specific cases, where background knowledge in domain becomes crucial. Also, configuration of parameters, which very often is a key to success in data mining, might involve some extent of consideration of the data and domain we work with. Redescriptions are aimed to bring new interesting insight on data. Thus, it is crucial for the method to deliver not only intuitively expected rules, but also reveal some specific treats which assist in niche finding problem or any other one. We introduce two algorithms for redescription mining over non-binary data sets. Both

25 26 Chapter 4 Contributions of them involve decision tree induction. In particular, we grow trees in opposite direc- tions to match in the end by gradually increasing the depth. As an input a data set (O; A; v) consisting of two matrices is used. One side contains binary attributes, another side is composed with real-valued attributes. Not all real- world data sets meet this requirements, thus in Section 4.6 we discuss a way to overcome this restriction. Target vectors needed for each step are formed starting from binary data set. Then further they are formed based on the previous split result. Thus, every next iteration is adjusted based on the previous one. In the end we get pairs of decision trees, grown in parallel to match at their leaves. Then queries are derived from resulting trees for future analysis. As an accuracy of the redescription we use Jaccard’s coefficient which is chosen for this due to its computational simplicity and ability to provide a reasonable assessment of similarity for two queries that form a redescription. Statistical significance of the result is determined with the help of p-value computation, since we want the results to be not only informative, but also carry a statistically meaningful information. Chapter 4 Contributions 27

4.2 Algorithm 1

Algorithm 1 extends redescription mining based on decision tree induction to non- Boolean world. As it already mentioned, the starting point of the algorithm with alter- nation scheme is an important aspect to be defined. Algorithm expects data two arrays (e.g. matrices) as income data. Left matrix (L) is expected to contain Boolean data, right matrix (R) contains numerical data. To initialize tree induction, the algorithm needs to have a data set with a target vector (the vector based on which the tree would be built). A target vector consists of all entries from left matrix. Namely, each column from the left-hand side is used as a target vector for one run of the Algorithm 1. Here we initiate tree induction (CART) with the right data set and build the tree with the depth 1. Thus, we have a short classifier which uses some parameter from right-hand side matrix as a splitting rule. Further, we form a new target vector based on the first split. After dividing data we get two child nodes with the class labels which correspond to the majority class in it. In our case: 0 and 1. Having that, we proceed to grow the second tree to match the first one. To do so, the new target vector is formed based on the right hand-side split. And the algorithm is run on the left side with the depth 2. This process of forming new targets and building deeper tress is continued until the one of the stopping criteria is met. Algorithm outline. Figure 4.1 represents steps undertaken of the first algorithm. As initialization stage, we use target vector from left matrix (binary matrix) and perform a split with the depth 1 on the right matrix, which possibly contains real-valued data. Figure 4.1 depicts trees (left and right) with the maximal depth 2. Enumeration for the nodes is used in the following manner: every parent node is marked as n; every left child node is enumerated as 2n; every right child node is enumerated as 2n+1. This holds for both trees and both algorithms. First frame (d=1) depicts the initial split of the both data arrays with the depth 1, where the split of the right array is made with a target vector from the left (an arrow). Further, the algorithm forms a target vector based on the right split and proceeds to split the initial left matrix (but with newly modified target vector) with the depth 2. Thus, every time the tree is re-grown from a scratch using CART algorithm using target vector. It in turn is formed based on the previous split result (i.e. class labels are assigned depending on the leaf nodes they fall into). New targets are formed and the depth is increased until the termination. As a result we get a pair of trees: left tree classifies binary data, right tree classifies real-valued data. For instance, if we work with biological niche finding problem one includes attributes from the data; another consists of climatic data. At each terminal node algorithm picks up a splitting parameter and splitting value (together called a splitting rule) which both maximize purity of the resulting nodes (i.e. purity measure). The actual impurity function used for this does play a crucial role for now. Splitting rules on terminal nodes from both trees will be further used to build redescriptions. 28 Chapter 4 Contributions

Figure 4.1: Tree-growing process in Algorithm 1

Algorithmic framework of Algorithm 1. Table1 describes the Algorithm’s 1 s algorithmic framework in detail. Firstly, the data set suitable for CART induction is formed. Construct tree creates a decision tree with provided parameters using one part of the data set, either left of right. That is, target vector formed based on the previous split result and min bucket parameter which is responsible for the minimal size of tree nodes. Min bucket is an important tunable parameter which controls a trade off between redundancy and interpretability. We pay attention to this parameter, since it prevents overfitting and help to terminate the tree induction earlier. This makes resulting queries less massive and more interpretable. In particular, in problems bound to search of biological niche finding user might be interested for the nodes, which include the majority of particular species because for a reasonable redescription we expect majority of share similar living conditions. If set to high might not give any insight for rare in Europe animals such as Polar Bear or Moose. In other cases min bucket parameter is also crucial. It helps to adjust CART to split a data set in such a way that every node contains at least of defined number of entities. Function construct target vector forms a vector based on the result of previous split to be given to the next split of the data. In the end, the list of redescriptions is formed Chapter 4 Contributions 29 and each of them is evaluated by Jaccard’s coefficient.

Algorithm 1: Algorithmic framework

Data: Descriptor sets {Li}, {Ri} Result: redescriptions Rd, Θ - Jaccard’s Coefficients Parameters: d - maximal depth of the tree min bucket - minimal number of entries in the node md - maximal depth Initialization: Set answer set Rd = {} Set Jaccard’s Coefficients set Θ= {} Set left matrix L = {Li}, Set right matrix R = {Ri} Alternations: foreach column i in L do Set all paths tl = {} Set all paths tr = {} Set target vector tv = construct target vector(Li) Set tree tr = construct tree(R, tv, max depth = 1, min bucket) tv = construct target vector(tr) if all entries in tv are of the same class then Rdi = NULL;Θi = NULL; flag = false else flag = true; depth = 2 while (flag) do if (depth ≤ d) then tl = construct tree(L, tv, depth, min bucket) tv = construct target vector(tl) tr = construct tree(R, tv, depth, minbucket) tv = construct target vector(tr); if (tl current= tl previous&&tr current! = tr previous) then depth = depth+ 1 else flag = false; end else flag = false; end end 0 0 Rdi = paths tl+ ←→ +paths tr Θi = Jaccard(tl, tr) end end 30 Chapter 4 Contributions

4.3 Algorithm 2

Configuration of the Algorithm 2. Algorithm 2 is also based on alternating decision tree induction as well and starts with the same initialization. It contains some specific features: instead of increasing depth with every next step and re-building trees from a scratch, Algorithm 2 continues to build trees and make them deeper and deeper on every iteration. Every new depth of the tree is built on either left or right matrix using the target vectors from either right or left tree respectively. This procedure is continued until the stopping criteria is met. Thus, we start again with initial target vector taken from left-hand side data and build the decision tree with the depth 1 using CART algorithm. It picks up a splitting rule which maximizes the purity of the nodes. After this, new target is formed, based on the class label assignment from first split. Left-hand side data is then split with the depth 1 using formed target. Trees are grown in level-wise fashion until the stopping criteria is met. As a result we get two tree structures (left - for binary matrix classification; right - for real-valued matrix) where each new depth is build based on previous split result from other tree. Final trees are used to extract and evaluate redescriptions. Algorithm’s 2 outline. Figure 4.2 depicts sequence of frames representing steps undertaken by algorithm. Firstly, initial split is performed with a Target vector 1 on the right matrix. Split is performed with CART algorithm. Features used (impurity measure, node size) are to be set based on preferences. Further algorithm proceeds to the split on a left matrix with the Target Vector 2 which is formed from the previous split (on the right part). This process continues until any further split is able to be performed by CART algorithm (i.e. splitting result in more pure nodes) or any of other termination criterion are met. Hence, each branch in the trees receives a target vector formed after the split from previous depth. As an outcome, we get two tree-like structures. These structures in practice are parts of a decision trees built with CART algorithm (several decision trees with the depth 1) can be assembled into a final trees. At the end, we move to query extraction from them. Thus, each tree in a pair returns one query. The extent of their correspondence is evaluated via Jaccard’s coefficient. This coefficient is computed between two resulting vectors which are formed based on the final trees (Figure 4.3). Finally, two queries, mined from final trees, form a redescription. All redescriptions mined from given data set are to be analyzed further. Chapter 4 Contributions 31

Figure 4.2: Tree-growing process in Algorithm 2

Algorithmic framework of Algorithm 2. Table2 describes the Algorithm’s 2 algorithmic framework in detail. As previously, we form a data set consisting of two views and construct in step-wise manner decision trees. Maximal depth is parameter defined by the user. 32 Chapter 4 Contributions

After the whole data set is processed, the redescription set is formed and returned. Each of the redescriptions is to be evaluated for future interpretation.

Algorithm 2: Algprithmic framefork

Data: Descriptor sets {Li}, {Ri} Result: redescriptions Rd, Θ - Jaccard’s Coefficients Parameters: min bucket - minimal number of entries in the node md - maximal depth Initialization: Set answer set Rd = {} Set Jaccard’s Coefficients set Θ= {} Set left matrix L = {Li}, Set right matrix R = {Ri} Alternations: foreach column i in L do Set all paths tl = {} Set all paths tr = {} Set tl = {} Set tr = {} Set target vector tv = construct target vector(Li) flag = true; count = 0 tr = construct tree(R, tv, md = 1, min bucket) while (count ≤ depth(tl))&&(count ≤ depth(tr))&&(count ≤ d)) do foreach leaf in the tree tr do tv leaf = construct target vector(tr, leaf) tl(construct tree(L leaf, tv leaf, md = 1, min bucket)) end foreach leaf in the tree tl do tv leaf = construct target vector(tl, leaf) tr.(construct tree(R leaf, tv leaf, md = 1, min bucket)) end count = count+ 1 end 0 0 Rdi = paths tl+ ←→ +paths tr Θi = Jaccard(tl, tr) end

4.4 Stopping Criterion

While building the decision trees there a very common issue is over-fitting, when trees grow too large and tend to make the whole redescription mining process ineffective. In the end we might get huge trees and redescription derived from them will be massive and will contain big variety of variables. Hence, interpretation of long queries is difficult and not desirable. We adopted several aspects to help early termination of the process. Firstly, user is able to determine the maximal depth of the resulting trees. These enables algorithm to be tailored for many domains. Given flexibility allows building different trees for user to compare the result and discover the suitable depth parameter. Chapter 4 Contributions 33

However, in practice experiments users do not need trees longer than maximal depth=3 since redescription derived from them would be difficult to interpret. Secondly, (min bucket) is a parameter which is responsible for the minimal amount of entries in the node. Limitation of this parameter is very useful. Usually, the lower the minimal number of entries on the node, the bigger tree will be returned. The data set we work with should be taken into consideration when setting this parameter. Thus, we suggest using small (min bucket) parameters at the beginning and gradually increase it until resulting redescriptions will have an optimal size, depending on the data set and problem that are being solved. Moreover, a logical choice to stop the tree building process is when further splitting will not provide any changes. Thus, we check weather the performed split (on the next depth) has resulted in data reorganization across the nodes comparing with the previous result. If yes, we continue to split the data until no change occurs. Both our algorithms contain this check as an built-in feature. However, this stopping criterion in practice sometimes produce quite deep trees (i.e. does not prevent overfitting entirely). Impurity measure used within this algorithm is not principal. Within our experiments with real-world data sets (discussed in Section 5.2) Gini index and Information Gain were used. Hence, user is able to pick the one which is more suitable or try all of the to select the most prolific. As an outcome we get final trees, namely, a set of duplet trees which are used to derive redescriptions. 34 Chapter 4 Contributions

4.5 Extraxting Redescriptions

We use trees to mine one-dimensional rules and partition them with each other to form a redescription. The Figure 4.3 exemplifies a pair of trees derived after algorithms are run. To extract a redescription from it we use splitting parameters (with splitting values) and Boolean operators: ’OR’ and ’AND’. To partition paths into a query from any of these trees we join splitting rules within one path via ’AND’ operator, and these paths are joined with each other with ’OR’ operator. Labels which correspond to ’yes’ and ’no’ assignment are also taken into consideration by flipping of the sign. For instance, a query corresponding to the left-most given tree would look like: (1 ∧ 2 < 0.5) ∨ (1 < 0.5 ∧ 3 < 0.5). Or, if we use negations: (1 ∧ ¬2) ∨ (¬1 ∧ ¬3) The Figure 4.3 also depicts an example how we assess mined redescriptions with Jac- card’s coefficient derived from two trees grown by any of two presented algorithms. In fact, when having one side with a Boolean data and another with real-valued, after processing with Algorithm 1 or 2 we get nodes which belong to class ’0’ or ’1’. Leaf nodes from resulting trees then are grouped into two binary vectors (left and right) and Jaccard’s coefficient is computed.

Figure 4.3: Redescription extraction and evaluation

Jaccard’s coefficient is equal to 1 when we have a perfect match. In theory, we should be interested only in such redescriptions. However, in practice redescriptions even with lower Jaccard coefficient pose an interest. In addition, support of two queries is especially important (i.e. E1,1- where both queries hold), since we do not want to get a redescription which covers all attributes. This would mean that a redescription does not provide any interesting insight. Or vice versa, if the support is really low, it holds for almost no entries from the data set. Chapter 4 Contributions 35

4.6 Extending to Fully non-Boolean setting

Before this we were considering data sets which contain one binary and one real-valued matrices. However, this setting poses quite imposing constraint when solving real-world redescription mining problems, because many domains produce real-valued data. Thus, data discretization is to be performed. This issue has been studied by in a clue of Association Rule Discovery so far by Srikant and Agrawal in [50]. Their methods are based on a priori bucketing, but they are very specific to association rule discovery making them inappropriate for redescription mining. Thus, we adopt discretization routine of real-valued hand side of our data set based on clustering. Further this binarized matrix is used for initialization for both of our Algorithms. Having used each column from it as a target vector for the very first slit of the data set, we can further use initial (before binarization) left-hand side matrix, because CART requires only target vectors to binary.

4.6.1 Data Discretization

It is possible to apply Algorithms 1 and 2 to fully non-Boolean data set as well. To enable them working with both-sided real-valued we apply a binarization routine for one of the sides. This routine can be considered as a pre-processing step which prepares the data set to look exactly the way algorithms to expect it to be. This kind of pre-processing is applicable to real-valued matrices before the algorithms’ application. To implement this we use three of available clustering techniques. However, this list is not limited to only those three. A good example of the data that can be transformed from real-valued to binary can be DBLP data [2] which contains information about computer science bibliography (more details in Section 5.4). Here left matrix corresponds to the conferences and the number of papers published by some author within it. For example, author N has submitted 4 papers for FOCS conference. The right hand side matrix contains same authors and the number of co-authoring between them. Thus, it describes how often each author co-worked on a paper with other author. Left hand side matrix can be transformed into binary with the help of clustering techniques covered in Section 5.4 to allow application of elaborated methods for redescription mining. Regardless the the clustering method used, the binarization routine is conducted with the following steps.

1. Select the first column as initial point;

2. Perform clustering of this column into several columns using one of the available clustering techniques;

3. Split taken column into several based on clustering result (initial attribute values are split into several intervals);

4. Assign to the attributes new values ’0’ or ’1’, according to initial values;

5. Repeat the procedure until all columns from initial matrix are split into several intervals and filled with ’0’ or ’1’. 36 Chapter 4 Contributions

Algorithmic framework for Data Discretization is in a Table3

Algorithm 3: Algorithmic framework for data discretization Data: Real-valued descriptor set {L} of a size i × j(real-valued) Result: Real-valued descriptor set {Lnew} of a size i · n × j (Boolean) Parameters: Cluster - function to perform clusterization, {DBSCAN, hclust, hclust} parms - parameters for selected cluterization method n - number of clusters Rangecluster - range of values which fall into cluster n Algorithm: Set {Lnew}= {} Set Cluster = {DBSCAN, hclust, k − means} foreach column {Lj} in {L} do Clusterparms(Lj) into n clusters and Split into n columns with Rangecluster foreach Li,j do

if valueLi,j ∈ Rangecluster then set Lnewi,j = 1 else

Lnewi,j = 0 Return {Lnew} end end end

As a result we would get the binarized matrix with increased number of columns. This matrix can be easily used with both methods to find redescriptions. The parameters used within clustering routine are mostly determined by the user and the data we have on hand. Some treats and peculiarities are discussed in Section 5.4 on the real-world data experiments. Chapter 4 Contributions 37

4.7 Quality of Redescriptions

The quality of the redesctiption is more abstract term. It is a compound of several characteristics which we try to evaluate with objective criteria. For example, a good redescription can be the one which is easy to interpret, have reasonable support, and it is statistically significant.

4.7.1 Support and Accuracy

One of the most defining feature for a redescription is support (E1,1). There is no strict bounds on support, which make a redescription good or bad. This depends on the data set we work with. Intuitively, we are not interested in redescriptions which are supported by either one row or by almost all rows from data set. It might be desirable to fix lower or upper bounds on the support cardinality of the queries and possibly on that of the individual involved predicates in each case individually. In our experiments we adopt Jaccard measure [27] to assess the accuracy of mined redescriptions. It provides a nice balance between the simplicity of computation and its agreement with the symmetric approach. Weights in Jaccard’s coefficient take into account the support of both queries equally. It is also scaled to the unit interval and does not involve entities that are not supported by any of two queries. Jaccard’s coefficient for both methods (Algorithm 1 and 2) is computed analogously. Two resulting vectors are formed based on final structure of the decision trees. The entities which fall into corresponding nodes are arranged into resulting vectors: l-vector for the left tree and r-vector for the right tree. Since rows from both sides of data are keyed by id, it becomes convenient to compute indexes we need for final assess- ment. This depicted on the Figure 4.3. Green arrows indicate paths in the trees that match. Namely, they compound a redescription mined from a particular pair of trees. Then, based on resulting vectors we compute the following to be plugged into Jaccard’s similarity function:

1. E1,1 - number of entries where both queries holds (i.e paths leading to ’1 class’ assignment;

2. E1,0 - number of entries where only the first (left) query holds;

3. E0,1 - number of entries where only the second (right) query holds.

Depending on the purpose, the user is able to determine the minimal Jaccard coef- ficient associated with each redescription to make it relevant for future analysis. In many domains, queries with similarity lower than 1 are also desirable and pose scientific interest. The quality of queries involved in a redescription also determine the its expressive- ness and interestingness For instance, long and nested expressions are hard to interpret and, hence, they carry minor interest for data mining tasks. Nevertheless, very strong restrictions applied to the syntactic complexity of queries might severely limit the ex- pressiveness. Thus, the balance between these two partly conflicting characteristics, which at the same time are difficult to assess, is needed. The expressiveness of the language and the extent of interpretability of individual elements of it are highly defined by syntactic restrictions applied within the construction 38 Chapter 4 Contributions of queries (rules). One way to keep queries interpretable is to limit the maximal length of them. In our algorithms we limit them by adopting the maximal depth of decision trees. We combine paths in the trees to form a one hand side of a redescription and avoid negations by flipping the sign represents the node of the tree which is connected with its child with ’no’ label on the tree’s edge.

4.7.2 Assessing of Significance

It is important to have an ability to determine how significant mined redescriptions are. Statistical significance is a crucial feature to assess the quality of returned results. The present-day concept of statistical significance, originated from R. Fisher [32], is being widely used in statistical analysis and we exploit it in our experiments as well. The simplest constraint applied to the mined redescription can be an accuracy, leaving out the redescriptions which do not overcome the given accuracy threshold. Nevertheless, statistical significance of the redescriptions is also important. That is, the support of a redescription (ql, qr) should have some information, given the support of the queries. To measure this, we test against null-model when these two queries could be independent [21]. Statistical significance plays a vital role in statistical hypothesis testing, where it is used to determine whether a null hypothesis should be rejected or retained [39]. The intuition if as follows: a redescription should not be likely to appear at random from the underlying data distribution. That is, accuracy of redescriptions should not be readily deducible from the support of its queries. In particular, if both queries that form a redescription cover almost all objects, the overlap of their supports is definitely large as well. Thus, the high accuracy of this redescription is naturally predictable. P-value is computed to represent the probability that two random queries with marginal probabilities equal to ql and qr have an intersection equal or greater than |supp(ql, qr)|. Binomial distribution [58] is used by this probability, given as follows:

|E| X |E| pvalM(q ; q ) = (p )s(1 − p )|E|−s L r s R R s=|(ql;qR)|

2 with pR = |supp(qL)||supp(qr)/|E| . This is the probability to obtain a set of same cardinality |E1;1| or greater, if each element of a set of size |E| has a probability equal to the product of marginals pL and pR to be selected, according to the independence assumption. Authors is [18] used the same approach to evaluate statistical significance of redescriptions. The higher the p-value, the more likely it is to encounter the same support of two independent variables. Thus, the query becomes less significant, i.e. the null hypothesis cannot be rejected and the redescription becomes less significant. This theoretical p-value computation relies on underlying data distribution assump- tion. Namely, that all elements of the population can be sampled with equal proba- bility out pre-defined distribution. The sampling distribution is calculated based only on expectation in the past, while the future relies on the stronger assumption of fixed marginals. But real data sets can differ from these assumptions, which make these Chapter 4 Contributions 39 significance tests weaker. These questions do not compound the main idea of our con- tribution, so we do not discuss them here in detail. Instead we refer reader to relevant literature [35, 14].

Chapter 5

Experiments with Algorithms for Redescription Mining

5.1 Finding Planted Redescriptions

To asses the power of any elaborated method or technique it is essential study its behavior on synthetic data, where we have a complete control on the data format and parameters. Thus, we crate synthetic data sets to imitate the real-world setting in order to assess our algorithms’ ability to find previously planted redescriptions which would give an insight into the performance of algorithms. To implement this, it is necessary to make sure that planted redescriptions consist of queries in both sides of data set have perfect correspondence i.e. their Jaccard’s is 1. For Algorithm 1 the size of each matrix is set to be 300 × 5. Further, two queries that involve 3 parameters are planted in this pair in such a way to form an exact correspondence. For Algorithm 2 the size of each matrix is set to be 300 × 10, since it builds several decision trees with the depth 1 and every new depths is restricted from picking up the splitting rule which has been already used. Thus, planting of redescription involving 6 variables is vital for the algorithm to be able to reach the maximal allowed depth. Planting queries into such massive data array, especially when using randomized procedure to turn right-hand side into real-valued, results in a noisy data set. However, to study the behavior and ability of the algorithm to deal with the noise, we can track a the accuracy in the same manner as we did in Algorithm 1. In total we planted different looking redescription with support from 30 to 50 rows for both algorithms. Later the random noise is added with different density ranging between 0.01 and 0.1. Noise can be both: constructive (not interfering the actual query) and destructive (damaging queries). To generate a real-valued side of a data set it is possible to substitute values in one matrix. To implement this, we substitute each 0 by values uniformly distributed on the interval [0, 0.25], and each 1 by value on interval [0.75, 1]. In data sets without noise, Algorithm 1 were able to find planted redescriptions with the highest accuracy. Having applied constructive noise Algorithm 1 were able to find planted redescription up to density 0.03. In other cases it returned redescriptions which had better accuracy, than the planted one in a ’noisy’ matrices. This confirms anticipated behavior of the Algorithm.

41 42 Chapter 5 Experiments with Algorithms for Redescription Mining

Figure 5.1 illustrates comparison of Jaccards’ coefficients for Algorithm 1 (a) and 2 (b). Red line on a chart shows Jaccard’s coefficients for the planted redescription in matrices with noise, (x-axis determines the density of applied noise - from 0.01 to 0.1). While blue line represents Jaccard’s coefficients of mined redescriptions returned by the algorithms on noisy data. Here can be seen, that Jaccard’s coefficient of planted is lower (or equal) than Jaccard’s of mined redescription. This happens because the effect of applied noise managed to form a better match and it was naturally mined by algorithms. Thus, found redescriptions in ’noisy’ data possess greater accuracy than the planted one, so algorithms are not to be blamed.

(a) Algorithm 1 (b) Algorithm 2

Figure 5.1: Jaccard’s coefficients of planted and mined redescriptions on ’noisy’ data

Note, due to the character of input data for Algorithm 2, is was able to mine redescrip- tion with Jaccard’s coefficient of 0.67. The reason for that, is that generated synthetic matrix with query involving 6 variables naturally contain noise. Hence, more detailed tests are needed fo overcome this issue. With destructive noise we were destroying planted redescriptions gradually and were able to find planted redescriptions up to density 0.09. For example, when planted the redescription of a form (with support of 30 rows and Jaccard 1):

(x1 ≥ 0.5 ∧ x3 < 0.5) ←→ (x3 ≥ 0.7602 ∧ x2 < 0.4984)

with destructive noise of density 0.9 the Algorithm 1 mined (support 26 and Jaccard 0.838):

(x1 ≥ 0.5 ∧ x3 < 0.5) ←→ (x3 < 0.7602 ∧ x3 ≥ 0.2132 ∧ x3 < 0.2137) ∨(x3 ≥ 0.7602 ∧ x2 < 0.4984)

Note, CART is eligible to select the same splitting parameter several times within one decision tree, uncovering the context dependency of the effects of certain variables [3]. Thus, sometimes discovered redescription involved additional branches, which were composed with the same splitting rule. The red part of the mined redescription is formed by additional branch of the tree induced by CART. Here x3 variable is picked by several times as it is claimed in CART algorithm [3]. Yet, the planted redescription was mined accurately taking into account noise level. These kind of ’additional branches’ arise in experimental runs and caused solely by peculiarities of CART. Tree building procedure Chapter 5 Experiments with Algorithms for Redescription Mining 43 in Algorithm 2 prevent CART from usage of the same splitting rule. Thus, it managed to mine planted redescriptions precisely up to level noise 0.4 (depending on a particular run). For both Algorithms additional tests would be advantageous for more profound assessment. In Section 5.5 we provide comparison to ReReMi algorithm to give a better insight into performance of our contributions. 44 Chapter 5 Experiments with Algorithms for Redescription Mining

5.2 The Real-World Data Sets

In order to actually test and evaluate devised methods in practical conditions it is important to apply them into real-world data. For this we use two data sets: Bio - for biological niche finding problem and DBLP - mining redescription from data for Computer Science bibliography. Both our algorithms where initially implemented in R. In this section we exemplify mined results using available tool for interactive redescription mining, called Siren [19]. Its plotting capacities provide some visual impression. As an input data set we use two matrices. Data is composed from publicly available sources: Bio 1 data set uses data from European mammals atlas [40] and climatic data from [26]. DBLP 2 data set is formed from [2], where left matrix contains the conferences and the number of papers published by some author within it, and right - authors and the frequency of their cooperation with each other. Table 5.1 describes the distinct data sets used in experiments with real-world data.

Table 5.1: Real-world data sets used in experiments Data set Descriptions Dimensions Type Bio Locations×Mammals 2575 × 194 Boolean Locations×Climate 2575 × 48 Real values DBLP Authors×Conferences 2345 × 19 Integer Authors×Authors 2345 × 2345 Integer

5.3 Experiments With Algorithms on Bio-climatic Data Set

Algorithm 1. Firstly experiments are run on the biological data from Table 5.1, called Bio. In biology, for the species to survive, a terrain where they live should maintain certain bio-climatic constrains which forms that species’ bioclimatic envelope (or niche) [23]. Finding this constrains with algorithms for redescription mining assists in determining of bio-climatic envelopes. In Bio data set left side is represented by the matrix, which contain locations in Europe and mammals living there. If an animal is present in a particular area, there is a 1, and vise versa: 0 for the places where this animal does not live. Thus, the left matrix contains only Boolean data. The right side (matrix R) consists of the same locations (keyed by IDs) and climatic data. In particular, we take into consideration minimal, maximal and average tempera- ture in each month and average rainfall measurements (in millimeters per months). Algorithm 1 was run on Bio data with different parameters (impurity measures and min bucket) and returned redescription sets for each of them. Example redescriptions are indicated in Tables 5.2 and 5.3. Each of them has been composed with several redescriptions mined by Algorithm 1 with different parameters (indicated in a tables’

1http://www.worldclim.org. 2http://www.informatik.uni-trier.de/∼ ley/db/ Chapter 5 Experiments with Algorithms for Redescription Mining 45 headers). Resulting p-value makes these redescription statistically significant with the highest level (99%), we did not encounter any redescription with p-value higher than 0.0003 with all selected parameters on Bio data set.

Table 5.2: Redescriptions mined by Algorithm 1 from Bio data set (with Gini impu- rity measure and min bucket=20.) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1- support; min max avg tX ; tX ; tX stand for minimum, maximum, and average temperature of month X in avg degrees Celsius, and pX stands for average precipitation of month X in millimeters.

LHS RHS J E1,1 max (P olar.bear ≥ 0.5) (tMar < −7.05) 0.947 36 max min (Arctic.F ox < 0.5 ∧ P olar.bear < (tAug < 14.65 ∧ tJul ≥ 6.35) ∨ 0.979 2379 max 0.5) (tAug ≥ 14.65) max max (Moose ≥ 0.5 ∧ W ood.mouse < (tF eb ≥ −1.15 ∧ tApr < 7.55 ∧ 0.801 449 max max 0.5) tJul ≥ 14.05) ∨ (tF eb < −1.15 ∧ avg pAug ≥ 58.85)

Thus, we have a scope of rules (redescriptions) which can be interpreted by analysts to find environmental envelopes for species. If analyzing a bit Table 5.2, we can see that rule:

max (Polar.bear≥0.5) ←→ (tMar <-7.05)

or equivalently:

max Polar.bear ←→ (tMar <-7.05)

implies that Polar Bear lives areas where maximal temperature in March is lower than -7.05 degrees Celsius with the support of 36 rows and high Jaccard’s coefficient (above 0.9). This redescription outlines logical conditions for Polar Bear to live in (cold climate). Decision tree pair which resulted in this redescription is depicted on the Figure 5.2. The redescription and trees in this particular case look simple and interpretable. Yet, very often user may encounter more complex cases (exemplified further), so the visualization becomes more essential.

Figure 5.2: A pair of decision trees returned by the Algorithm

The second redescription from Table 5.2: 46 Chapter 5 Experiments with Algorithms for Redescription Mining

max min max (¬Arctic.F ox ∧ ¬P olar.bear) ←→ (tAug < 14.65 ∧ tJul ≥ 6.35) ∨ (tAug ≥ 14.65)

Can be formulated as follows:

In places where neither Arctic Fox, nor Polar Bear live, maximal temperature in August is below 14.65 and minimal temperature in July is greater or equal to 6.35 or maximal temperature in August is greater or equal than 14.64 degrees Celsius.

This redescription rather describes living conditions which are not suitable for Polar Bear and Arctic Fox, since these mammals in the left query are negated. The corresponding pair of decision trees are depicted of the Figure 5.3

Figure 5.3: A pair of decision trees returned by the Algorithm

The final redescription from Table 5.2 is longer and has more complex structure:

max max max max (Moose ∧ ¬W ood.mouse) ←→ (tF eb ≥ −1.15 ∧ tApr < 7.55 ∧ tJul ≥ 14.05) ∨ (tF eb < avg −1.15 ∧ pAug ≥ 58.85)

can be expressed as follows:

Moose lives in places without Wood Mouse where maximal temperature in February is above -1.15, in April - below 7.55 and in July - above 14.05 degrees, or in a places where maximal temperature in February is below -1.15 degrees and average rainfall in August is greater than 58.85 millimeters.

Two decision trees, from which this redescription was formed are depicted on the Figure 5.4

Figure 5.4: A pair of decision trees returned by the Algorithm Chapter 5 Experiments with Algorithms for Redescription Mining 47

In similar manner all results can be interpreted. With parameters, indicated in Table 5.2, Algorithm 1 found 91 unique redescriptions, 55 of them have Jaccard’s coefficient above 0.8. Redescriptions vary in support size: some of them cover only small part of the data (below 200 rows), while others cover almost whole data (above 2000 rows out of 2575). Yet, all of them have high accuracy and statistically significant. We limited maximal depth of the trees to 3, because longer redescriptions have more nested structure and harder to interpret. However, for many instances Algorithm 1 terminated earlier since there either were no changes comparing to previous depth, or resulting leaf nodes were pure, consequently, resulting in shorter redescriptions. Table 5.3 presents one more run of Algorithm 1 on Bio data set, e.g. we use Infor- mation Gain as an impurity measure, and indicate min bucket = 100, meaning we limit underlying decision tree induction algorithm perform splits in such a way, that there are at least 100 entries in each node.

Table 5.3: Redescriptions mined by Algorithm 1 from Bio data set (with IG impu- rity measure and min bucket=100.) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1- support; min max avg tX ; tX ; tX stand for minimum, maximum, and average temperature of month X in avg degrees Celsius, and pX stands for average precipitation of month X in millimeters.

LHS RHS J E1,1 avg max (Arctic.F ox < 0.5) (tJun <10.25 ∧tSep ≥ 10.75) ∨ 0.965 2347 avg (tJun ≥10.25) max max (W ood.mouse < 0.5 (tOct < 10.85 ∧ tF eb < −1.45 ∧ 0.701 353 avg ∧Mountain.Hare ≥ 0.5) tJul ≥ 10.65) avg avg (European.Hamster ≥ 0.5) (pOct < 45.15 ∧ pJun ≥ 61.85 ∧ 0.483 151 avg pApr < 48.25)

Let’s discuss redescription from this experimantal run as well. The first redescription here is:

avg max avg (¬Arctic.F ox) ←→ (tJun <10.25 ∧tSep ≥ 10.75) ∨ (tJun ≥ 10.25)

can be expressed as follows:

Arctic Fox does not live in places, where June average temperature is below 10.25 degrees and maximal temperature in September is greater than 10.75. Or, where average temperature in June is greater than 10.25 degrees Celsius.

This rule also describes living conditions which are not suitable for a mammal. The information about conditions which does not allow to some species to survive combined with other redescriptions that involve same animal can put all aspects of its preferences together and describe both (suitable and inappropriate) living conditions for a particular animal. Yet, in this particular case, redescription cover almost a whole Bio data set, which diminishes its value. 48 Chapter 5 Experiments with Algorithms for Redescription Mining

Decision trees which were built using Algorithm 1 and formed this redescription are on the Figure 5.5:

Figure 5.5: A pair of decision trees returned by the Algorithm

Let’s consider the second redescriprion from Table 5.3:

(Wood.mouse < 0.5∧Mountain.Hare ≥ 0.5) ←→ max max avg (tOct < 10.85 ∧ tF eb < −1.45 ∧ tJul ≥ 10.65)

or equivalently:

max max avg (¬W ood.mouse ∧ Mountain.Hare )←→ (tOct < 10.85 ∧ tF eb < −1.45 ∧ tJul ≥ 10.65)

This redescription is formed from a pair of decision trees, which had been grown by the Algorithm 1. They are depicted on the Figure 5.6 and one can assess them even without text representation of the redescription.

Figure 5.6: A pair of decision trees returned by the Algorithm

This redescription can be expressed in natural language as follows:

Mountain Hare lives in the places without Wood Mouse, where maximal temperature in October is below 10.85, maximal temperature in February is lower than -1.54 and average temperature in July is greater or equal to 10.65 degrees Celsius.

Final redescription from Table 5.3:

avg avg avg (European.Hamster) ←→ (pOct < 45.15 ∧ pJun ≥ 61.85 ∧ pApr < 48.25)

describes conditions for European Hamster and can be formulated as: Chapter 5 Experiments with Algorithms for Redescription Mining 49

European Hamster dwells territories in Europe where rainfall in October is lower than 45.15, in June is greater than 61.85 and in April is lower than 48.25 millimeters.

A pair of decision trees for this particular case can be found on Figure

Figure 5.7: A pair of decision trees returned by the Algorithm

This rule shows, that for the European Hamster precipitation is more essential than temperature conditions, because on each new depth of the decision tree CART selected rainfall as a splitting rule which maximizes nodes’ purity. Yet, only an expert can confirm the importance of rainfall measurements for the hamster. Moreover, in some instance trees in a pair have different depth. This aspect is not surprising because Algorithm 1 builds each of them increasing the depth with each iteration and compares the result with the previous one. Termination happens when there were no improvements comparing to the split on the previous depth or resulting nodes are pure. Differently looking trees are assessed in the same manner as it was described in detail in Section 4.5. Using parameters indicated in Table 5.3, Algorithm 1 found 44 unique redescriptions, 25 of them with Jaccard’s coefficients above 0.8. Similarly to the experiment with parameters from Table 5.2 the received different supports (from few rows to almost whole data) and high accuracy. All redescriptions involve different parameters, yet they are easy to be interpreted, since they do not include very complex and nested structures. Full results can be found in AppendixA. 50 Chapter 5 Experiments with Algorithms for Redescription Mining

Plotting on a map. Biological data set provides one more flexibility option: spacial coordinates associated with each location in the Europe assist in visualization of derived results. Plotting on the map make is easier to evaluate and interpret the outcomes. So, whenever user encounter difficulties in reading mined redescriptions, plots of resulting trees solve this issue. Let’s exemplify all example redescriptions discussed before from tables 5.2 and 5.3. Figure 5.8 represents three aforementioned redescription from Table 5.2 on a map. (a) shows first redescription. On (b) there is a plot of the second redescription. (c) represents the third redescription from Table 5.2. For all plots: red color indicates places whre only left-hand side query holds, blue - right-hand side query, purple ares depict places where both queries hold.

(a) (b) (c)

Figure 5.8: Example redescriptions from Table 5.2.(a) first; (b) second; (c) third.

Moreover, plots on a map of redescriptions from Table 5.3 can be found on a Figure 5.9:

(a) (b) (c)

Figure 5.9: Example redescriptions from Table 5.3.(a) first; (b) second; (c) third. Chapter 5 Experiments with Algorithms for Redescription Mining 51

Algorithm 2. We tested Algorithm 2 on the same data set (Bio) using equal param- eters. Some example redescriptions for one of the runs of the Algorithm 2 using IG as impurity measure and min bucket = 50 are presented in a Table 5.4. Full version of results can be found in AppendixA.

Table 5.4: Redescriptions mined by Algorithm 2 from Bio data set (with IG impu- rity measure and min bucket=50.) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1- support; min max avg tX ; tX ; tX stand for minimum, maximum, and average temperature of month X in avg degrees Celsius, and pX stands for average precipitation of month X in millimeters.

LHS RHS J E1,1 avg avg (Mediterranean.W ater. ≥ (pMay ≥ 58.65 ∧ pJun ≥ 86.85) ∨ 0.912 1406 0.5 ∧ Alpine.Shrew ≥ 0.5) ∨ avg max (pMay < 58.65 ∧ tNov ≥ 6.85 ∧ max (Mediterranean.W ater.Shrew < tSep ≥ 10.75) 0.5 ∧ Moose < 0.5 ∧ Arctic.F ox < 0.5) max min Kuhl.s.P ipistrelle ≥ (tMar ≥ 11.05 ∧ tF eb ≥ −3.95) ∨ 0.802 759 max avg 0.5 ∧ Alpine.marmot < (tMar < 11.05 ∧ tMar ≥ 6.375 ∧ max 0.5) ∨ (Kuhl.s.P ipistrelle < tJan ≥ 3.55) 0.5 ∧ Common.Shrew < 0.5 ∧ House.mouse ≥ 0.5) min max (Brown.Bear ≥ 0.5 ∧ (tJan < −8.25 ∧ tSep < 17.35) ∨ 0.762 492 Mountain.Hare ≥ 0.5) ∨ min max (tJan ≥ −8.25 ∧ tF eb < −1.15 ∧ (Brown.Bear < 0.5) ∧ max tMar < 5.55) W ood.mouse < 0.5 ∧ Moose ≥ 0.5)

These redescriptions constitute a good example to show that visualization options be- come crucial. Let’s consider the first redescription:

(Mediterranean W ater Shrew ∧ Alpine Shrew) ∨ (¬Mediterranean W ater Shrew ∧ avg avg avg max ¬Moose ∧ ¬Arctic F ox) ←→ (pMay ≥ 58.65 ∧ pJun ≥ 86.85) ∨ (pMay < 58.65 ∧ tNov ≥ max 6.85 ∧ tSep ≥ 10.75)

This rule implies that Mediterranean Water Shrew and Alpine Shrew or neither Mediter- ranean Water Shrew, nor Moose, nor Arctic Fox live in areas where either in May it rains more than 58.65 millimeters and in June more than 86.85 millimeters, or it rains in May less than 58.65 millimeters and where maximal temperature in November is above 6.85 and in September - above 10.75 degrees Celsius. In similar manner residual redescrip- tions can be expressed. A pair of resulting decision trees for this redescription from Table 5.4 is on the Figure 5.10. 52 Chapter 5 Experiments with Algorithms for Redescription Mining

Figure 5.10: A pair of decision trees returned by the Algorithm

Second redescription from the Table 5.4:

Kuhl0s P ipistrelle ≥ 0.5 ∧ Alpine.marmot < 0.5) ∨ (Kuhl0s P ipistrelle < max min 0.5 ∧ Common Shrew < 0.5 ∧ House.mouse ≥ 0.5) ←→ (tMar ≥ 11.05 ∧ tF eb ≥ max avg max −3.95) ∨ (tMar < 11.05 ∧ tMar ≥ 6.375 ∧ tJan ≥ 3.55)

was formed from a pair of trees depicted on the Figure 5.11.

Figure 5.11: A pair of decision trees returned by the Algorithm

Final redescription from Table 5.4

(Brown Bear ∧ Mountain Hare) ∨ (¬Brown Bear) ∧ ¬W ood mouse ∧ Moose) ←→ min max min max max (tJan < −8.25 ∧ tSep < 17.35) ∨ (tJan ≥ −8.25 ∧ tF eb < −1.15 ∧ tMar < 5.55) was formed from a pair of decision trees which are depicted on the Figure 5.12: Chapter 5 Experiments with Algorithms for Redescription Mining 53

Figure 5.12: A pair of decision trees returned by the Algorithm

With parameters, indicated in Table 5.4 Algorithm 2 returned 17 unique redescrip- tions, all of them are statistically significant. Support of all is in a range [300; 1600] rows, which is acceptable for Bio data set. Half of redescriptions have the highest accuracy (above 0.8), yet even others have Jaccard’s coefficient above 0.5. Plotting on a map. For the second Algorithm plotting is available as well. Figure 5.13 illustrates all redescriptions from Table 5.4 on a Europe’s map. Representation on a map makes it easier to evaluate the quality of the redescription. As previously, red color is responsible for left query, blue - for the right and overlap of both is purple color on a map:

(a) (b) (c)

Figure 5.13: Support of redescription from Bio data set.(a) First redescription; (b) Second redescription; (c) Third redescription from Table 5.4

Maps also help to consider the ares where actually animals live. The overlap part indicates places in Europe where the whole redescription is true, i.e. animals from left query live (or do not live, if they are negated in the left query) and climatic conditions hold from right query. The size of overlap is actually the support of the redescription (i.e. E1,1) which becomes a defying feature when assessing the quality of results in real-world data sets. 54 Chapter 5 Experiments with Algorithms for Redescription Mining

5.3.1 Discussion

While running experiments with Bio data set, we used two Impurity measures: Gini index and Information Gain (their default implementations from R’s package rpart [52]). Yet, any others can be plugged in both Algorithm. In addition to this, increasing of min bucket parameter (minimal number of entities in a leaf node) results in smaller number of final redescriptions. Yet, they are of slightly higher Jaccard similarity and shorter structure. Note, decision tree induction routine will not be able to split the data if min bucket parameter is indicated erroneously, i.e. min bucket is set to be greater than the total number of places in Europe where a particular animal lives divided by two, since we split each parent node into 2 children P Li (e.g. min bucket > 2 ;). We used the minimal size of min bucket to cover least 1% of the rows from data set in our experiments. In such a setting, Algorithms that had been run on Bio data set, returned quite big trees for the animals, which live in many areas in Europe (more than in 1000 areas out 2575 available). Nevertheless, animals such as Polar Bear or Moose, which have quite specific living conditions (cold climate) returned nice, easy to analyze trees. When minimal number of entries to the node is set to 100 or 50, trees naturally tend to be smaller. It is more suitable for the animals which are live in more than 500 locations in Europe (i.e. Shrew, Fox, Mice, etc.). Since all of them live all over the Europe, this limitation helps in finding a meaningful redescription for them, with more specific climatic features. Such animals as Moose, Polar Bear or Seal are not very wide-spread and live in a very specific climatic conditions (e.g. cold areas, and for the Seal - coast- line). Setting min bucket too high is not suitable here. Setting minimal size of a node size to the half of the number of places where a particular animal lives, can be considered as a 50% threshold. This implies, that we expect at least a half of the population of a particular animal share the same conditions. Which makes sense, since in such a case, we know that the given redescriprion describes a niche which is shared at least by a half of the considered animal. If we compare both Algorithms based on their results on Bio data set it can be seen:

• Both algorithms returned statistically significant redescriptions (with p − value < 0.01 within all runs).

• If we compare accuracy of the results (e.g. Jaccard’s), it can be concluded that in every run of Algorithm 1 it returns up to 67% of redescriptions with accuracy above 0.8, while Algorithm 2 only up to 50%. Note that only Algorithm 1 managed to mine redescription with a perfect accuracy (Jaccard is exactly 1) in some runs Li (for instance, with Gini and IG, min bucket = 100 ) but they have low support size (below 30 rows) making them less informative. Algorithm 2 did not return redescriptions with Jaccard’s coefficient exactly 1.

• The size of support of the redescription is an important parameter which deter- mines the extent of its interestingness. If we apply a threshold of 100 rows for the redescription to regard them as interesting and compare only those redescriptions which has E1,1 > 100, it can be seen that Algorithm 1 mines up to 75% of redescrip- tions with accuracy above 0.8 and support size at least 100, while In Algorithm 2 this parameter is not greater than 48% for each run of both algorithms. Chapter 5 Experiments with Algorithms for Redescription Mining 55

• If we look at Top-20 (per Jaccard) redescriptions that have supp > 100 and p − value < 0.01 we can conclude that, redescriptions from Algorithm 1 in most cases cover above 1700 rows, while support sizes in Algorithm 2 are more diverse (from 150 up to 2000). Based on this, Algorithm 1 returns rules which hold for majority (or almost all) rows from data set, e.g. redescriptions describe conditions which are true all over the Europe. Also, queries are shorter, involving fewer attributes, because decision tree induction very often terminates before reaching the maximal allowed depth of the trees (we were using d=3 ). In Algorithm 2 more parameters are involved and the structure of queries is more nested (e.g. trees mainly terminate when reaching maximal depth). These redescriptions reveal more specific details concerning fauna and climate peculiarities in Europe.

• One more aspect to be compared is overlap of the redescriptions (e.g. queries involve same parameters making some redescriptions similar to each other). Algo- rithm 1 tends to produce more overlapping redescriptions (approx. 65%), because CART selects same splitting rules from the whole data set over and over again re- gardless the initialization point used. For the Algorithm 2 overlapping redescrip- tions happen less frequently (approx. 50%), because we build every depth and branch of the tree independently, using corresponding part of the data set. These percentages vary slightly depending on parameters used within each run, but the global tendency holds for all experimental runs. Hence, if we sort redescriptions by Jaccard (from highest to lowest) and discard redescriptions which involve identical animals, we can see that accuracy of residual redescriptions for both algorithms is similarly high (around 0.8 on average) and support of the redescriptions for Algorithm 1 is about 10% greater (depending on the parameters used) than in Algorithm 2.

• Usage of Gini impurity measure on Bio data set (with other equal conditions) in Algorithm 1 resulted in slightly deeper trees and consequently in longer re- descriptions. For example, in Algorithm 1 with min bucket = 20 Gini returned 91 redescriptions (average length of a query - 5.51 variables), IG - 71 unique re- descriptions (average length of a query - 4.87 variables). However, for Algorithm 2, Information Gain returned slightly longer redescriptions than the ones mined with Gini index. For example, the Algorithm 2 with min bucket = 20 mined with Gini 67 redescriptions (average length of a query - 7.06 variables); while with IG - 37 unique redescriptions (average length of a query - 7.75 variables). For both Algorithm, usage of Information Gain tends to produce higher percentage of repeating redescriptions in each experimental run and fewer unique redesctip- tions in total comparing to Gini index.

All in all, both algorithms return reasonable redescriptions for Bio data set with high accuracy. While testing them with equal parameters, it can be seen that Algorithm 1 found greater number of redescriptions which are shorter than the ones mined by Algorithm 2 and they are more similar between each other. For example, Moose, House Mouse and Stoat participate in many of them. This is caused by the fact, that CART algorithm tend to pick up splitting rules which maximize purity of resulting nodes and very often these splitting rules coincide for different initialization points, since they provide the greater contribution to nodes’ purity. Redescriptions mined my Algorithm 2 consist of more various variables in both (left and right) queries. This is due to the fact, that we build each layer of the decision tree 56 Chapter 5 Experiments with Algorithms for Redescription Mining independently using a corresponding part of a target vector. Moreover, each branch of a decision tree in Algorithm 2 also grows independently until stopping criteria is met. Basically, we induce several decision trees with depth 1 using the corresponding part of either left or right matrix. This fact explains the inclination of the Algorithm 2 to produce deeper trees: whenever there are attributes of both classes (’1’ class and ’0 class) CART is able to perform a split of them into two leaf nodes. Both elaborated algorithms found interesting rules while applied to Bio data set. All of them were statistically significant and had varying support (from few rows to an almost whole data set). Thus, they can be used in a problem of finding of bio-climatic envelopes for species. Nevertheless, some resulting redescription with high supports, despite being accurate, might pose little interest to the biologists. They combine complex climatic queries with several, possibly unrelated, species. Usage of p-value to check the significance of the results mitigates this issue but does not resolve it completely. Chapter 5 Experiments with Algorithms for Redescription Mining 57

5.4 Experiments With Algorithms on Conference Data Set

Many real-world data can not be represented as a data set consisting of two matrices, one of which is binary. That is why, the binarization routine is essential to enable application of our algorithms in such cases. For left-hand side of DBLP data set we tried three different clusterization techniques plugged in discretization procedure, described in Subsection 4.6.1. In total, we used three clustering approaches, but this set can be extended to other ones, if needed. Generally, when analyzing bibliography data, mined redescriptions shed light on the communities of researches that contribute to the field at most. Density-based spatial clustering. (DBSCAN) This clustering technique does not require specification of the number of clusters and it is automatically tailored to detect necessary number of clusters based on the notion of density reachability [15]. A cluster, which is a subset of the points of the database, should satisfy two properties: all points within it are mutually density-connected and if a point is density-reachable from any point of the cluster, it belongs to cluster too. Algorithm on its own requires two parameters from a user: minimal number of points in the cluster and distance [4]. We have applied DBSCAN to our DBLP data set, where initially we had 19 conferences in left matrix and used clusterization to split each of them into intervals and transform data into binary matrix. These intervals represent the number of papers each author published within a particular conference. Columns from discretized matrix are used as initial target vectors for both Algorithms. In this particular case we should take into account the characteristics of the data set. Namely, we have most of the authors submitted somewhere below 7-10 papers to the conference. However, there are rare instances when author submits more than 15 papers. Thus, first clusters are quite dense and last ones are mainly sparse. We picked up such distance and number of points for DBSCAN to receive segregation of each conference into 5-10 clusters. Every new data sets may require different set of initial parameters to perform well. Very often in data mining results are highly dependable on parameter selection. DBSCAN with Algorithm 1. Table 5.5 illustrates several redescriptions mined by Algorithm 1 after aforementioned discretization of the left-hand side matrix.

Table 5.5: Redescriptions mined by Algorithm 1 from DBLP data set (with DBSCAN binarization routine; Gini-impurity measure; min bucket= 5 ) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1- support.

LHS RHS J E1,1 ECML ≥ 2.5 ∧ UAI ≥ 1.5 P eterGrunwald ≥ 0.5 0.571 4 ICDE ≥ 12.5 ∧ EDBT < 3.5 AnthonyK.H.T ung ≥ 1.5 ∧ 0.5 5 JeffreyXuY u ≥ 0.5 STOC ≥ 8.5 ∧ SODA < 5.5 AviW igderson ≥ 0.5 ∧ 0.133 10 SilvioMicali ≥ 0.5 58 Chapter 5 Experiments with Algorithms for Redescription Mining

With these parameters Algorithm 1 mined 15 unique redescriptions with high accuracy. Majority of the cover either below 10 rows or almost whole data set, which makes them quite obvious or expected, because they are supported by insufficient amount of rows. Complete list of the results can be found in AppendixB. First redescription from Table 5.5

ECML ≥ 2.5 ∧ UAI ≥ 1.5 ←→ P eter Grunwald ≥ 0.5

implies that if the some author has published at least 3 papers within ECML and at least 2 papers within UAI, he/she has likely co-authored with Peter Grunwald at least once. This redescription hold only for 4 rows in the DBLP data set, which makes it less informative. Formally there no strict bounds for minimal or maximal support of the redescription which make it interesting. The following redescription from the Table 5.5:

ICDE ≥ 12.5 ∧ EDBT < 3.5 ←→ Anthony K. H. T ung ≥ 1.5 ∧ Jeffrey Xu Y u ≥ 0.5

claims that if you published more than 13 papers on ICDE and from 0 to 3 papers within EDBT, you have probably co-authored twice (or more) with Anthony K. H. Tung and at least once with Jeffrey Xu Yu. This redescription has support 5, which can be considered as low as well. But there are not so many people who submit more than 13 papers for a single conference, so the size of support in this case can be considered as acceptable to regard this redescription as informative. Let’s consider the last redescription from this table:

STOC ≥ 8.5 ∧ SODA < 5.5 ←→ Avi W igderson ≥ 0.5 ∧ Silvio Micali ≥ 0.5

It can be formulated with natural language as follows:

If you have more 9 or more papers accepted in STOC and 5 or fewer papers in SODA conferences, you have co-authored at least once with Avi Wigderson and at least once with Silvio Micali

In analogous way all rules can be interpreted as well. The decision tree pair for this redescription is depicted on the Figure 5.14. We provide a decision tree exemplification only for this redescription among others from DBLB data set, they can be plotted analogously. DBSCAN with Algorithm 2. Similarly, we run experiments on the same DBLP data set using Algorithm 2. When using Information Gain impurity measure and P Li min bucket = 100 , Algorithm 2 found 110 unique redescriptions with diverse support size (from few rows to almost whole data set) 15 of which have p − value > 0.01 which make the statistically insignificant. 31 out of all number redescriptions have Jaccard’s coefficient higher than 0.8. Unlike with first algorithm, here we received lower Jaccard’s similarity but greater support of each redescription. Some of the results are listed in the Table 5.6, while full report including p-values for each redescription can be found in AppendixB. Chapter 5 Experiments with Algorithms for Redescription Mining 59

Figure 5.14: A pair of decision trees returned by the Algorithm 1

Table 5.6: Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN P Li binarization routine; IG-impurity measure; min bucket= 100 ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1- support

LHS RHS J E1,1 ST OC < 0.5 ∧ FOCS ≥ 0.5 ∧ T omasF eder < 0.5 ∧ 0.919 711 SIGMODConference < 0.5 ∨ AviW igderson ≥ 0.5 ∧ STOC ≥ 0.5 ∧ SODA ≥ 0.5 ∧ CatrielBeeri < 0.5 ∨ ICDE < 0.5 T omasF eder ≥ 0.5∧AmosF iat ≥ 0.5 ∧ SergeA.P lotkin < 1.5 V LDB ≥ 1.5 ∨ V LDB < 1.5 ∧ RakeshAgrawal ≥ 0.809 689 SIGMODConference ≥ 0.5 ∧ 0.5 ∨ RakeshAgrawal < ICDM < 0.5 0.5 ∧ HamidP irahesh ≥ 0.5 ∧ JiaweiHan < 0.5 COLT ≥ 3.5 ManfredK.W armuth ≥ 0.5 0.226 19

The first redescription from Table 5.6

ST OC < 0.5 ∧ FOCS ≥ 0.5 ∧ SIGMODConference < 0.5 ∨ STOC ≥ 0.5 ∧ SODA ≥ 0.5 ∧ ICDE < 0.5 ←→ T omas F eder < 0.5 ∧ Avi W igderson ≥ 0.5 ∧ Catriel Beeri < 0.5 ∨ T omas F eder ≥ 0.5 ∧ Amos F iat ≥ 0.5 ∧ Serge A. P lotkin < 1.5

states that if you have not published any paper in either STOC, or SIGMOD conference but have at least one publication in FOCS, or you have at least 1 paper in STOC and SODA but no papers in ICDE, you likely co-authored neither with Tomas Feder, nor with Catriel Beeri but have collaborated with Avi Wigderson at least once. Or, you have collaborated with Tomas Feder and Amos Fiat at least once, and have worked with Serge A. Plotkin from 0 to 1 times. The support of this redescription is in acceptable range to claim it is informative and accurate enough. p-value of these redescription is zero, which makes the result statistically significant as well. The second redescription from Table 5.6

V LDB ≥ 1.5 ∨ V LDB < 1.5 ∧ SIGMODConference ≥ 0.5 ∧ ICDM < 0.5 ←→ Rakesh Agrawal ≥ 0.5 ∨ Rakesh Agrawal < 0.5 ∧ Hamid P irahesh ≥ 0.5 ∧ Jiawei Han < 0.5 60 Chapter 5 Experiments with Algorithms for Redescription Mining

claims that if you have published more than 2 papers in VLDB or from 0 to 1 paper ind VLDB and at least 1 paper in SIGMODConference, you have probably co-authored with Rakesh Agrawal at least once, or you have co-authored with neither him, nor Jiawei Han but you have at the same time at least one publication with Hamid Pirahesh. And final redescription from Table 5.6

COLT ≥ 3.5 ←→ Manfred K. W armuth ≥ 0.5

states that if you have published 4 or more papers within COLT, than you have co- authored with Manfred K. Warmuth once ore more times. Note, that this redescription has quite low accuracy (0.226) which make it less interesting regardless the acceptable level of support. Thus, Algorithm 2 mined more interesting and diverse redescriptions which are still statistically significant. They have longer structures and involve grater amount of vari- ables comparing to Algorithm 1. Whenever the length of the resulting redescription start to be bigger than desired, user may use limitation of the maximal depth of the trees (we were using max depth = 3 so far). Usage of DBSCAN is advantageous due to its ability to detect necessary amount of clusters automatically. Thus, the user has no need to specify which makes the whole process easier. k-means. As one more option to test our algorithms and compare result, we adopted k-means clustering technique to be used in discretization of data set. Unlike DBSCAN, k-means clusterization [37] require from user indication of desired number of clusters. This poses an issue on its own which is vigorously discussed in scientific literature [30, 53, 22]. The correct choice of clusters’ number is often ambiguous, with interpretations depend- ing on the shape and scale of the distribution of points in a data set and the desired clustering resolution of the user. Increasing number of clusters without penalty will always reduce the amount of error in the resulting clustering, to the extreme case of zero error when each data point is considered its own cluster (i.e. when there are as many clusters as number of data points). Intuitively, the optimal number of clusters is a balance between these extreme cases. Nevertheless, when working with DBLP data set, we experimented in partition of each conference into 5 clusters. This choice is caused by previous knowledge about data. Partition in smaller number of clusters results in highly dense clusters which represent from 0 to 7 submitted papers within one conference. When partitioning into more than 10 clusters, some data points would be considered as a separate cluster and put unwanted computational burden. Some redescriptions returned by Algorithm 1 are listed in a Table 5.7. Chapter 5 Experiments with Algorithms for Redescription Mining 61

Table 5.7: Redescriptions mined by Algorithm 1 from DBLP data set (with k-means (5 clusters) binarization routine; Gini-impurity measure; min bucket= 5) LHS is a left- hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1- support.

LHS RHS J E1,1 UAI ≥ 2.5 ∧ KDD ≥ 2.5 T omiSilander ≥ 0.5 0.500 4 V LDB ≥ 18.5 ∧ ShaulDar ≥ 0.5 0.357 5 SIGMODConference < 26.5 STOC ≥ 8.5 ∧ SODA < 5.5 AviW igderson ≥ 0.5 ∧ 0.113 10 SilvioMicali ≥ 0.5

With these parameters Algorithm 1 mined 8 unique redescriptions (2 of them have p − value > 0.1). Other statistically significant redescriptions have support around 10 rows, leading to the conclusion that with k-means clustring used within discretization routine, Algorithm 1 returns more intuitively expected rules (full set of outcomes can be found in AppendixB). Algorithm 2 was also applied to this data set, some resulting redescriptions are listed in the Table 5.8, while full set is presented in AppendixB.

Table 5.8: Redescriptions mined by Algorithm 2 from DBLP data set (with k-means Li (5 clusters) binarization routine; Gini-impurity measure; min bucket = 100 ) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1- support

LHS RHS J E1,1 ICDE ≥ 12.5 ∧ EDBT ≥ F lipKorn ≥ 0.5 ∧ 0.833 15 2 ∨ ICDE < 12.5 ∧ KrithiRamamritham < SIGMODConference ≥ 4.5 ∨ F lipKorn < 0.5 ∧ 19.5 ∧ WWW ≥ 0.5 SudarshanS.Chawathe ≥ 3 ∧ MayankBawa ≥ 0.5 SODA ≥ 17.5 ∨ SODA < 17.5 ∧ RichardCole ≥ 0.5 ∨ 0.534 43 FOCS ≥ 10.5 ∧ STOC ≥ 9.5 RichardCole < 0.5 ∧ LaszloLovasz ≥ 0.5 ∧ JurisHartmanis < 0.5 STOC ≥ 8.5 ∨ ST OC < 8.5 ∧ AviW igderson ≥ 0.5 ∨ 0.337 33 FOCS ≥ 8.5 ∧ SODA < 1.5 AviW igderson < 0.5 ∧ SalilP.V adhan ≥ 4.5 ∧ ShafiGoldwasser ≥ 0.5 62 Chapter 5 Experiments with Algorithms for Redescription Mining

With these parameters Algorithm 2 returned 70 unique redescriptions, 30 of which have Jaccard’s coefficient above 0.8, with support from several rows up to covering of a whole data set. When comparing the performance of both algorithms on this data set, it can be seen, that, as before, Algorithm 1 results in more simple and intuitive rules with lower sup- port (around 10 rows) and high Jaccard’s similarity between queries, while Algorithm 2 returns longer, more detailed redescriptions with higher support, but lower Jaccard’s similarity. Hierarchical clustering. To exploit diversity we have tested one more clustering technique to discretize DBLP data set. Hierarchical clustering [29, 25], similarly to k-means, does not detect the number of cluster automatically. We tried splitting each conference into 5 clusters and run presented algorithms for redescription mining to see the performance. Some resulting redescriptions for Algorithm 1 are presented in Table 5.9. Full results can be found in AppendixB

Table 5.9: Redescriptions mined by Algorithm 1 from DBLP data set (with hierar- chical (5 clusters) binarization routine; IG-impurity measure;) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1- support.

LHS RHS J E1,1 ICDT ≥ 4.5 ∧ V LDB ≥ 0.5 ∧ GostaGrahne < 2.5 ∧ 1 3 ICDE ≥ 2.5 KotagiriRamamohanarao ≥ 13 ∨ GostaGrahne ≥ 2.5 ∧ JigneshM.P atel < 0.5 WWW ≥ 4.5 ∧ ICDM ≥ 3.5 BenyuZhang ≥ 12 1 2 ECML ≥ 2.5 ∧ UAI ≥ 1.5 P eterGrunwald < 1.5 ∧ 1 5 StephenD.Bay ≥ 3.5 ∨ P eterGrunwald ≥ 1.5

With indicated parameters in Table 5.9 Algorithm 1 was able to return 15 unique redescriptions with high Jaccards’ and low supports (below 10 rows), Algorithm 2 returned 76 unique redescriptions. They share analogous features as the ones returned by Algorithm 2 before (i.e. using DBSCAN and k-means). Table 5.10 contains some examples of them, while full report can be found in AppendixB. Chapter 5 Experiments with Algorithms for Redescription Mining 63

Table 5.10: Redescriptions mined by Algorithm 2 from DBLP data set (with hierar- chical (5 clusters) binarization routine; IG-impurity measure;) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1- support.

LHS RHS J E1,1 SDM ≥ 1.5 ∨ SDM < 1.5 ∧ P hilipS.Y u ≥ 4.5∨P hilipS.Y u < 0.621 125 ICDM ≥ 0.5 ∧ KDD ≥ 0.5 4.5 ∧ V ipinKumar ≥ 0.5 ∧ SunilP rabhakar < 1.5 PODS ≥ 2.5 ∨ P ODS < 2.5 ∧ CatrielBeeri ≥ 0.5 ∨ 0.342 13 ICDT ≥ 0.5 ∧ ST ACS ≥ 0.5 CatrielBeeri < 0.5 ∧ LeonidLibkin ≥ 0.5 ∧ T homasSchwentick ≥ 0.5 SODA ≥ 5.5 ∨ SODA < 5.5 ∧ MosesCharikar ≥ 0.245 39 FOCS ≥ 0.5 ∧ STOC ≥ 8.5 0.5 ∨ MosesCharikar < 0.5 ∧ AviW igderson ≥ 0.5 ∧ MoniNaor ≥ 0.5 64 Chapter 5 Experiments with Algorithms for Redescription Mining

5.4.1 Discussion

When running two algorithms on DBLP data set, regardless the binarization procedure, Algorithm 2 tend to find considerably greater amount of redescriptions. While at the same time they are more complex in structure and longer than the ones found by Algo- rithm 1. However, first algorithm finds more intuitive, or obvious, redescriptions which are shorter and have Jaccard coefficient either 1, or very close to it. Majority of them supported by either by below 10 or above 2000 rows. This leads to the conclusion that Algorithm 1 with DBLP data tend to select obvious rules which hold only for few rows of data set. This issue can be slightly adjusted by parameter min bucket. Whenever we increase it, we tend to mine redescriptions with higher supports. This effect can by exploited on other data set as well. Algorithm 2 in this data set tend to find more interesting results, supported by the number of rows which is greater than 10 but small enough not to cover the whole data. This makes results more informative. However, redescriptions which carry almost no useful information happen to appear here as well. As before, increasing min bucket parameter can fix this. If we investigate the performance of both algorithms on DBLP data set in detail, we can see the following:

• When using DBSCAN, Algorithm 1 tends to mine considerably fewer redescrip- tions than Algorithm 2. The accuracy of the results vary as well. Namely, Al- gorithm 1 returns redescriptions with perfect Jaccard’s (e.g. exactly 1) in most cases, but the support of these redescriptions is below 10 rows, yet all of them have p − value = 0 making the significant of the highest level. Algorithm 2 returns less uniform outcomes, which means we observed variety of supports (from few rows up to almost all)and accuracy. Here Jaccard’s coefficients drop from 0.99 to 0.06. Algorithm 2 returned up to 20 % of statistically insignif- icant results (e.g. p − value > 0.01), They happen in redescriptions which have high support - E1,1 > 1500. • The structure of resulting redescription is similar to the results recieved on Bio data set, e.g. Algorithm 1 returns more compact structures, involving fewer at- tributes. Decisions tree induction routine terminated before reaching maximal allowed depth. Respectively, Algorithm 2 returned deeper trees which resulted in longer redescriptions, involving greater amount of parameters.

• When using k-means for discretization of the left-hand side of the data set, both Al- gorithms returned greater amount of statistically insignificant results. Algorithm 1 - up to 30% and Algorithm 2 - up to 10%. If we look at Top-5 redescriptions (per Jaccard) it can be underlined, that Algorithm 1 returns rules with support around 5 row, but the redescriptions involve quite extreme cases. For example, one author which published above 10 papers within 1 conference, so low support here is not surprising, because there are not so many researches in Computer Science who publish that many of scientific articles.

Algorithm 2 in Top-5 redescriptions (per Jaccard) returned rules with E1,1 > 1700 rows. And parameters inside reflect more common amount of papers that a researcher submit within one conference (from 0 to 7 papers). Thus, these high supports are not surprising as well. Chapter 5 Experiments with Algorithms for Redescription Mining 65

• Having applied Hierarchical clustering to turn left-hand side of the data set into a binary matrix, both algorithms behave as before. Yet, Algorithm 1 returned all statistically significant redescriptions, but in Algorithm 2 up to 15% of them did not pass the p−value < 0.01 threshold. The accuracy of the results in Algorithm 1 is perfect (Jaccard is exactly 1), but redescription again describe cases when author submits unusually big amount of papers within conferences (above 10). Hence, support sizes of results are low. In Algorithm 2 we observed diverse supports, majority of which are between 20 and 700 rows giving the redescriptions desirable interestingness.

There are no strict formal limitation a of the support of the mined redescriptions. These criterion rather caused by the data set we work with. Thus, in DBLP data we adopt the idea that support between 10 and 1800 rows would pose an interest. However, this choice is influenced only by the nature of this particular data set. All in all, selection of clustering method within binarization routine (DBSCAN, k- means, Hierarchical clustering) on DBLP data set does not affect significantly neither the amount of mined redescriptions, nor quality of them. The only noticeable difference is when using k-means both algorithms return more statistically insignificant redescrip- tions. Both algorithm tend to return typical for them results with all clustering methods used. This caused by the fact, that discretized data participates in the inception of the algorithm’s run only. Algorithm further uses fully non-Binary setting to work through the data. Hence, in cases when the user has no previous knowledge of data, we suggest using DBSCAN, because it defines amount of clusters automatically and can prevent from clustering of the data set into to many clusters, which would lead to unwanted computa- tional burden. On the other hand, when user want to segregate data in a certain amount of clusters, k-means, hierarchical clustering on any other available clustering algorithms can be used for this purpose. 66 Chapter 5 Experiments with Algorithms for Redescription Mining

5.5 Experiments against ReReMi algorithm

To evaluate Algorithm 1 and 2 we compared them to ReReMi algorithm presented in [20] and extended with on-fly-line-bucketing in [18]. ReReMi reported meaningful results for redescription mining both with real-world an synthetic data. Hence, it is a logical choice to compare Algorithm 1 and 2 using the same data sets. Comparing algorithms for redescription mining on real-world data sets is an intricate task, since they might produce different type of the redescriptions. Such parameter as ’interestingness’ is hard to measure, yet it is important when analyzing set of mined redescriptions. We run ReReMi algorithm with analogous parameters on both Bio and DBLP data set. We used limitation of the depth=3 when running Algorithms 1 and 2 which corresponds to maximum 7 number of variables involved into each query in ReReMi algorithm. We allowed on both sides of a redescription usage of conjunction and disjunction operators in Algorithms 1, 2 and ReReMi. However, when running Algorithm 1 and 2, we change Impurity measure, which does not have identical equivalents in ReReMi algorithm. min bucket parameter could be related to minimal contribution in ReReMi (used 0.05; details in [19]). In addition, we allow as much initial pairs as there runs of our Algorithms 1 and 2 for each particular case. Bio. Using the same Bio data set, ReReMi returned 209 unique statistically significant redescriptions. 201 of them are of Jaccard higher than 0.8. However, the size of the support tend to be large, i.e. E1,1 > 1300, meaning that most redescription cover high percentage of rows from data set. Algorithm 1 returned 140 unique redescriptions which are statistically significant. They also have diverse supports. Yet, unlike ReReMi, we observed redescriptions which have Jaccard’s exactly 1 and low support (around 10). This means, that Algorithm 1 tends to mine more obvious and less informative rules than ReReMi. Algorithm 2 returned 156 unique statistically significant redescriptions, with support not lower than 30 rows. which make them informative. In general results are closer to the ReReMi. Many of mined redescriptions (by either ReReMi, or Algorithms 1 and 2) overlap, describing similar parts of a Bio data set. DBLP. We used the same DBLP data set to test with ReReMi algorithm and Algo- rithms 1 and 2. ReReMi returned 102 redescriptions with support mainly around 10 rows, yet, many of them have higher support (up to 68 rows) making them quite interesting. Jaccard coefficients of 37 out of total number are above 0.5. Algorithm 1 (with Gini index) mined only 32, majority of which have support below 10 rows. These redescriptions have shorter, comparing to ReReMi, structure. And despite the fact we allowed maximal depth to be 3 (e.g. involving at most 7 parameters) include fewer parameters. Thus, this results show obvious rules which carry small interesting information. At the same time, Algorithm 2 returned 81 redescriptions which are statistically sig- nificant, whose support confirms their interestingness (above 15 rows). 30 redescriptions have Jaccard above 0.5 They are more complex in structure than the ones returned by Algorithm 1, yet similar to the ones returned by ReReMi. General remark. Algorithms 1 and 2 are different from ReReMi, since the use distinct approaches to mine and assess redescriptions (decision tree induction versus Chapter 5 Experiments with Algorithms for Redescription Mining 67 greedy atomic updates). Underlying in Algorithms 1 and 2 CART approaches involves usage of Impurity measures, while in ReReMi this can not be anyhow used. In addition, ReReMi allows direct indication of the resulting minimal support size for a redescription, while min bucket parameter we use only adjusts the minimal amount of entities in each (terminal and leaf) nodes of the decision tree, which does not guaranty to provide minimal support size. Query language and the way we extract redescriptions from a pair of decision trees involves participation of the same variable in a query for several times, which is not applicable in ReReMi. This makes results difficult to compare with each other. Finally, Algorithms 1 and 2 when processing fully-numerical data set, use a clusterization as a pre-processing step, since they require binary targets in the inception. This brings up one more aspect to incompatibility of results, because ReReMi uses on-fly-bucketing approach [18] when processing fully-numerical data sets. Despite this, rules, mined by Algorithm 2 over DBLP data set resemble the ones mined by ReReMi. Note, that there is no strict limits on minimal/maximal support of the redescription which is considered acceptable or not. Usually, this indicator is set based on the particular data set we work with. In experiments with computer science bibliography we consider redescriptions interesting when their support is higher than 10 rows. Logically, redescriptions which are supported by nearly whole data set carry no useful information as well, because the simply describe the rule which is true fro all attributes of a data.

Chapter 6

Conclusions and Future Work

This Thesis is dedicated to the data analysis task called redescriptions mining, which aims to discover objects which have multiple common descriptions. Or, vice versa, revealing shared characteristics for the set of objects. Redescription mining gives insight into the data with the help of queries that relate different views to the objects from it and provides domain-neutral way to cast complex data mining scenarios in terms of simpler primitives. In this Thesis we extended alternating algorithm for redescription mining beyond propositional Boolean queries to real-valued attributes and presented two algorithms based on decision tree induction to mine redescriptions. Peculiarities of used parame- ters were discussed in detail and their influence in real-world data set was explored. We run our Algorithms on two distinct real-world data set and received results that can be used for discussed problems in these domains. Numerous runs of algorithms proved, that they are able to find reasonable, statistically significant redescriptions in studied domains. The actual value of outcomes can only be evaluated by putting them to use in collaboration with experts of corresponding fields. Underlying principle of redescription mining seems easy and intuitive, yet is forms a powerful tool for data exploration, that can find practical application in numerous domains. Existing algorithms for redescription mining, augmented by our contributions, empowers scientists to create their own descriptors and reason with them for better understanding of scientific data sets. There is a big field for the future work with redescription mining. In particular, ef- fective methods with profound theoretic foundations to model information content of the redescriptions in the subjective interestingness framework. Cooperation of elabo- rated algorithms with existing methods of filtering and post-processing of redescriptions poses an interest as well. Since uncertainties are inherent to most real-world scenarios, redescription mining is to be tailored to take them into consideration, potentially by usage of other data analysis developments [5].

69

Bibliography

[1] http://www.salford-systems.com/products/cart. [2] http://www.informatik.uni-trier.de/∼ ley/db/. [3] https://www.salford-systems.com/resources/whitepapers/115-technical-note-for- statisticians. [4] http://en.wikipedia.org/wiki/DBSCAN. [5] C. C. Aggarwal. Managing and Mining Uncertain Data: 3, A., volume 35. Springer, 2010. [6] R. Agrawal, R. Srikant, et al. Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, VLDB, volume 1215, pages 487–499, 1994. [7] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, pages 92– 100. ACM, 1998. [8] E. Boros, P. L. Hammer, T. Ibaraki, A. Kogan, E. Mayoraz, and I. Muchnik. An implemen- tation of logical analysis of data. Knowledge and Data Engineering, IEEE Transactions on, 12(2):292–306, 2000. [9] P. Bradley, J. Gehrke, R. Ramakrishnan, and R. Srikant. Scaling mining algorithms to large databases. Communications of the ACM, 45(8):38–43, 2002. [10] L. Breiman. Technical note: Some properties of splitting criteria. Machine Learning, 24(1):41–47, 1996. [11] T. Calders and B. Goethals. Mining all non-derivable frequent itemsets. In Principles of Data Mining and Knowledge Discovery, pages 74–86. Springer, 2002. [12] J. Crawford and F. Crawford. Data mining in a scientific environment. In In Proceedings of AUUG 96 and Asia Pacific World Wide Web, 1996. [13] P. S. E. Hunt, J. Marin. Experiments in induction. Academic Press, New York, 1966. [14] E. Edgington and P. Onghena. Randomization tests. CRC Press, 2007. [15] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, volume 96, pages 226–231, 1996. [16] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1. Springer Series in Statistics New York, 2001. [17] E. Galbrun et al. Methods for redescription mining. 2013. [18] E. Galbrun and P. Miettinen. From black and white to full color: extending redescription mining outside the boolean world. Statistical Analysis and Data Mining, pages 284–303, 2012. [19] E. Galbrun and P. Miettinen. Siren demo at sigmod 2014. 2014. [20] A. Gallo, P. Miettinen, and H. Mannila. Finding subgroups having several descriptions: Algorithms for redescription mining. In SDM, pages 334–345. SIAM, 2008.

71 72 BIBLIOGRAPHY

[21] G. Gigerenzer and Z. Swijtink. The empire of chance: How probability changed science and everyday life, volume 12. Cambridge University Press, 1989. [22] C. Goutte, P. Toft, E. Rostrup, F. A.˚ Nielsen, and L. K. Hansen. On clustering fmri time series. NeuroImage, 9(3):298–310, 1999. [23] J. Grinnell. The niche-relationships of the california thrasher. The Auk, pages 427–433, 1917. [24] J. Han, H. Cheng, D. Xin, and X. Yan. Frequent pattern mining: current status and future directions. Data Mining and Knowledge Discovery, 15(1):55–86, 2007. [25] T. Hastie, R. Tibshirani, J. Friedman, T. Hastie, J. Friedman, and R. Tibshirani. The elements of statistical learning, volume 2. Springer, 2009. [26] R. J. Hijmans, S. E. Cameron, J. L. Parra, P. G. Jones, and A. Jarvis. Very high resolution interpolated climate surfaces for global land areas. International journal of climatology, 25(15):1965–1978, 2005. [27] P. Jaccard. Distribution de la flore alpine dans le bassin des dranses et dans quelques rgions voisines. Bulletin de la Socit Vaudoise des Sciences Naturelles, 37. [28] C. Kamath, E. Cant´u-Paz, I. K. Fodor, and N. A. Tang. Classifying of bent-double galaxies. Computing in Science & Engineering, 4(4):52–60, 2002. [29] L. Kaufman and P. J. Rousseeuw. Finding groups in data: an introduction to cluster analysis, volume 344. John Wiley & Sons, 2009. [30] D. J. Ketchen and C. L. Shook. The application of cluster analysis in strategic management research: an analysis and critique. Strategic management journal, 17(6):441–458, 1996. [31] J. P. Kleijnen. Cross-validation using the t statistic. European Journal of Operational Research, 13(2):133–141, 1983. [32] M. Krzywinski and N. Altman. Points of significance: Significance, p values and t-tests. Nature methods, 10(11):1041–1042, 2013. [33] D. Kumar. Redescription mining: Algorithms and applications in bioinformatics. PhD thesis, Virginia Polytechnic Institute and State University, 2007. [34] R. O. L. Breiman, J.H. Friedman and C. Stone. Classification and Regression Trees. Chap- man and Hall/CRC, 1984. [35] E. L. Lehmann and J. P. Romano. Testing statistical hypotheses. springer, 2006. [36] S. C. Lemon, J. Roy, M. A. Clark, P. D. Friedmann, and W. Rakowski. Classification and regression tree analysis in public health: methodological review and comparison with logistic regression. Annals of behavioral medicine, 26(3):172–181, 2003. [37] J. MacQueen. Some methods for classification and analysis of multivariate observations, 1967. [38] H. Mannila, H. Toivonen, and A. I. Verkamo. Eficient algorithms for discovering association rules. In KDD-94: AAAI workshop on Knowledge Discovery in Databases, pages 181–192, 1994. [39] K. Meier, J. Brudney, and J. Bohte. Applied statistics for public and nonprofit administra- tion. Cengage Learning, 2011. [40] A. J. Mitchell-Jones. The atlas of european mammals,. Academic Press, London,, 1999. [41] P. K. Novak, N. Lavraˇc,and G. I. Webb. Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining. The Journal of Machine Learning Research, 10:377–403, 2009. [42] V. K. Pang-Ning Tan, Michael Steinbach. Introduction to Data Mining. Addison Wesley, 2006. BIBLIOGRAPHY 73

[43] L. Parida and N. Ramakrishnan. Redescription mining: Structure theory and algorithms. In AAAI, volume 5, pages 837–844, 2005. [44] F. Questier, R. Put, D. Coomans, B. Walczak, and Y. V. Heyden. The use of cart and mul- tivariate regression trees for supervised and unsupervised feature selection. Chemometrics and Intelligent Laboratory Systems, 76(1):45–54, 2005. [45] J. R. Quevedo, A. Bahamonde, and O. Luaces. A simple and efficient method for variable ranking according to their usefulness for learning. Computational Statistics & Data Analysis, 52(1):578–595, 2007. [46] J. R. Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986. [47] J. R. Quinlan. C4. 5: programs for machine learning, volume 1. Morgan kaufmann, 1993. [48] N. Ramakrishnan, D. Kumar, B. Mishra, M. Potts, and R. F. Helm. Turning cartwheels: an alternating algorithm for mining redescriptions. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 266–275. ACM, 2004. [49] J. Sober´onand M. Nakamura. Niches and distributional areas: concepts, methods, and assumptions. Proceedings of the National Academy of Sciences, 106(Supplement 2):19644– 19650, 2009. [50] R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. In ACM SIGMOD Record, volume 25, pages 1–12. ACM, 1996. [51] D. Steinberg and P. Colla. Cart: classification and regression trees. The Top Ten Algorithms in Data Mining, 9:179, 2009. [52] T. M. Therneau, B. Atkinson, and M. B. Ripley. The rpart package, 2010. [53] R. L. Thorndike. Who belongs in the family? Psychometrika, 18(4):267–276, 1953. [54] A. Tripathi, A. Klami, M. Oreˇsiˇc,and S. Kaski. Matching samples of multiple views. Data Mining and Knowledge Discovery, 23(2):300–321, 2011. [55] G. Tsoumakas, I. Katakis, and I. Vlahavas. Mining multi-label data. In Data mining and knowledge discovery handbook, pages 667–685. Springer, 2010. [56] L. Umek, B. Zupan, M. Toplak, A. Morin, J.-H. Chauchat, G. Makovec, and D. Smrke. Subgroup Discovery in Data Sets with Multi–dimensional Responses: A Method and a Case Study in Traumatology. Springer, 2009. [57] V. N. Vapnik. The nature of statistical learning theory. statistics for engineering and infor- mation science. Springer-Verlag, New York, 2000. [58] G. P. Wadsworth and J. G. Bryan. Introduction to probability and random variables, vol- ume 7. McGraw-Hill New York:, 1960. [59] G. J. Williams. Rattle: a data mining gui for r. The R Journal, 1(2):45–55, 2009. [60] D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd annual meeting on Association for Computational Linguistics, pages 189–196. Association for Computational Linguistics, 1995. [61] M. J. Zaki. Generating non-redundant association rules. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 34–43. ACM, 2000. [62] M. J. Zaki and C.-J. Hsiao. Charm: An efficient algorithm for closed itemset mining. In SDM, volume 2, pages 457–473. SIAM, 2002. [63] M. J. Zaki and N. Ramakrishnan. Reasoning about sets using redescription mining. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 364–373. ACM, 2005. 74 BIBLIOGRAPHY

* Appendix A

Redescription Sets from experiments with Bio Data Set

75 76 Appendix A Redescription Sets from experiments with Bio Data Set ∧ ∧ ∧ ∧ ≥ ≥ ≥ < − ≥ 5 72) 65) 5 . 5 - 5 . . . . ←→ . 5 0 1 10 10 5 4 0 , 0 t 0 p p 1 avg 5) ≥ max < ≥ E ∧ . max ∧ ∧ 35) ≥ − 0 . − ≥ − 8 7 − 45 75 Continued . . t 75 10 . 85) 3 t 7 7 ≥ . max t ( ∧ P olar.bear < avg 56 − 10 ∧ ∧ − 3 35 . t 65) ≥ ≥ 5 . ( max ←→ . 45) 95 12 European.Hare 13 . . 0 ∨ t max < 14 stands for average 8 max < − ( 5) ∧ p Arctic.F ox . 13 10 W ood.mouse < 4 − ≥ − 5 ∨ 0 max ( t Arctic.F ox < 55) pn . 3 . ∧ ∧ ( ≥ 4 t 0 ∧ ∨ t − 2) 5 . 11 ( ∧ . ∨ 9 ≥ 55 max < 0 . 5) t max 75 53 . ≥ . North.American.Beaver < ( 75 4 max . 0 − 5) 7 − ∧ . 9 Norway.lemming < < ∨ max < ←→ 8 5 0 − Grey.Red.Backed.V ole 8 . ≥ ≥ t 8 ∧ t ≥ ( 0 ∧ − 7 ( p 5) max 5 t Arctic.F ox < . 85) 5 . . ( ≥ . 9 0 ∨ − ∧ 0 t 0 ∧ Arctic.F ox < max 36 ←→ max < 10 5 max ∧ 72 Moose < . t ∧ . 35) − ( < 0 . ←→ 5 bucket=20.) J - Jaccard similarity 5 − ∧ − W ood.mouse 5) . 835) 6 4 . 8 . 0 ∨ − t 0 3 835 ∧ 5) p 10 7 ( 65 . . t ≥ t . 5 ( 7 0 ( . ∧ 5) 4 ∨ in degrees Celsius, and ≥ . 0 ∨ 0 n 85 . ≥ min avg < 85) . avg ≥ 85) Arctic.F ox < ←→ Moose < Arctic.F ox < . 10 − − ( ( 19 − Moose < ∧ 7 avg < max < 58 ( ∨ ∨ t 9 5) 5 12 . t . − ∨ t − < ( ∧ 0 ( 0 5) 3 5) Moose . . t 5) P olar.bear < ( 9 8 . ∨ 0 ( ∨ 0 t ≥ 65 p European.Hedgehog max < 0 P olar.bear < ( Laxmann.s.Shrew < ∨ . ∧ ∨ max < ( 9) Grey.Red.Backed.V ole < ∧ ≥ . ∧ − 5 ∧ 14 5) ∨ . 95) 5) ∧ − 55) . . . 5 9 . 0 75 5 . 0 5 5 t . 5) . 8 4 . . ( 102 0 ←→ t 9 0 0 0 ≥ 114 ≥ ≥ ∨ ∧ in millimeters. ≥ ≥ 5) ≥ . 8 n ≥ max < 35 0 p 6 . 35) Southern.W hite.breasted. . max p 8 − max Arctic.F ox < W olverine < ∧ W ood.mouse max < ∧ ∧ 8 10 ( − Brown.long.eared.bat ∧ − t 5 ∧ − 95 . ( 4 5 ∨ . 55 ≥ 3 t . 0 . 45) ∧ 5 t Beech.Marten . . 10 0 ( Arctic.F ox < 10 t 3 ∧ 5) 0 5 11 ( . ≥ . ∧ ( Arctic.F ox < American.Mink < max < ∨ 2 ≥ 0 ←→ 0 ( avg ≥ . ≥ 5 ∨ ∨ ∧ . − Beech.Marten 5 ∨ − 0 . 53 5) 4 5) . 75) ∧ 0 . . t 6 35) 0 5) . t 0 ≥ 5 P olar.bear < . max max . 3 ∧ max < ≥ 56 0 8 ∧ 0 Laxmann.s.Shrew < ∧ ≥ − − p − ( < ≥ 75 5 9 . 3 Moose < 85 ∨ ∧ . ≥ − t . 8 t precipitation of month 10 ( 7 0 t Y ellow.necked.Mouse p 5) ∧ ∧ 36 . 72 ∨ ∧ ∧ ≥ . ∧ P olar.bear < 0 5 5 Grey.Red.Backed.V ole . min ≥ 35) 75 65 5) ∧ 0 . . . Mountain.Hare 835 ≥ − . 55 8 . − 9 . 4 ∧ 5 0 p 7 Roe.Deer Arctic.F ox . ∧ 13 max 4 Red.F ox P olar.bear < ≥ Norway.lemming 0 5 12 ≥ ∧ 5 ∧ . ∧ t ∧ ∧ ∧ − . ≥ 5 Mountain.Hare < 0 . 0 5 5 5 5 ∧ 4 Mountain.Hare < . . . . 0 85 avg < t W ood.mouse < ∧ . 0 0 0 0 ( ∧ 85 avg < max max < ∧ 5 − . 10 . max max < ∨ ≥ 5 ≥ 5 − . 0 − − . 58 0 12 3 − − 0 9 t t t 95) ( ( 8 10 3 ≥ ≥ stand for minimum, maximum, and average temperature of month . ( ≥ t t t ≥ 8 ( ( ( 8 } ∨ Moose < p max < ( ∨ ∨ ∨ ≥ ←→ ←→ ∧ ∨ avg − 1) 95) 5) ; . . . 72) 5) 5) 75 5) 9 . . . . . Arctic.F ox < Laxmann.s.Shrew Arctic.F ox < t Arctic.F ox < Moose Moose Arctic.F ox Arctic.F ox Moose W ood.mouse < 9 Brown.long.eared.bat 0 ( 18 Grey.Red.Backed.V ole < 0 9 55 max 0 Mountain.Hare < 96 American.Mink < min ; max Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min p-val. Redescription − { tn 1 , 1 E support ; Table A.1: J 0.995 2454 0 ( 0.988 2434 0 ( 0.979 2379 0 ( 0.972 2276 0 ( 0.97 2326 0 ( 0.967 2236 0 ( 0.956 1975 0 ( 0.956 1831 0 ( 0.955 1876 00.953 1993 ( 0 ( Appendix A Redescription Sets from experiments with Bio Data Set 77 ∧ ∧ ∧ ≥ ≥ ≥ ≥ ≥ ≥ ≥ 55) 05) 6 - . . 5 05 5 5 . 1 p . . 7 , p 0 2 1 0 13 ∧ max max < max < max max < max < ≥ E max < ∧ ≥ − − 15 − − − − . 9 ≥ − − Continued 2 ≥ t 8 8 3 45 t 9 ( t . t t 7 45 ( t max ( ( ( t ( max ∨ 11 < − ∨ ∨ ∨ ∧ max − 4 ←→ t − 4 ≥ 10 05) . t 65 85) ←→ 65) ∧ p 95) 2 . . Mountain.Hare < . 5) . American.Mink ( stands for average t American.Mink < . ∧ ( Moose 19 0 0 15 39 36 ∧ 5) ( ∧ 13 . . ∧ Roe.Deer < ∨ pn 95 1 5 0 ( . max . < 5 ≥ 15 ≥ ∨ ←→ . . 0 ≥ − 5) 8 0 17 . 1 ∨ − p 6 0 5) ≥ − − . 5) p max < ≥ 6 avg . ∧ ≥ ( 0 t 5) ≥ 0 European.ground.squirrel min . − − ∨ 0 15 ∧ 6 ∧ . − max 6 t t 7 max Mediterranean.W ater.Shrew 5 1) ∧ − ≥ . . 45 12 max < ∧ . ≥ − 0 2 t ∧ 0 t 50 85 9 − ( . 5 t ∧ 35 Arctic.F ox < W ild.boar < . . 2 4 ( ∨ ( bucket=20.) J - Jaccard similarity t 0 ≥ ∧ ( 75) Mountain.Hare < ∨ max < 35 19 . Arctic.F ox ∨ . 5 ∧ 7 05) . ≥ − 9) . 19 p 5 ∧ . 34 0 5) . in degrees Celsius, and . 0 ←→ 5 14 ∧ 0 max < 10 . 56 max < ≥ n t 0 5) 6 ≥ − ≥ − . max < ∧ 65 . p 0 3 Harbor.Seal avg < 7 2 t − t p 85 ∧ 39 . ( − ∧ 8 ∧ max < ∧ t Eurasian.Lynx 9 15 18 < t . 45 − 5 ∧ . 15) 05 . 5 7 ∧ . . ∧ 0 6 0 t ←→ Moose < p 85 Gray.Seal < European.ground.squirrel < 45 13 5 ( . ∧ 15) ( ≥ Granada.Hare European.Hamster < . 65 . . ∧ ( ∧ 0 Greater.W hite.toothed.Shrew ∨ ≥ 18 5) 5 ∧ 55 max < 19 . 5 . . 14 ∨ . ∧ 0 0 7 P olar.bear < 5 ≥ 0 5) max < Mountain.Hare < . 10 − max ≥ . ←→ 5 ≥ ( 5) ∧ 0 p in millimeters. . 0 8 . ( − max < − 0 t 5 0 ∨ . n ( 2 5) 3 ∨ ≥ 0 − . t ≥ t max max ( 0 ( 5) < max 4 . t max < − ∨ Y ellow.necked.Mouse 0 − 85) 1 − ←→ . ∧ − 8 7 ∧ t 7 ≥ t 4 ( ←→ 61 t Moose < 45) 5) t 5 95 . . ( . . 05) ∧ ∨ 0 ∧ . ∧ 0 < 5) 11 7 ∨ . Moose < 17 6 95 ( ≥ 5) 0 15 Roe.Deer ≥ 65 . − . . p . ( ≥ ∨ 5) 1 15) . . 15 ∧ 39 ∨ 0 American.Mink < 5 5) 101 ( . ≥ ≥ 15 0 max < ≥ − ∨ 5) ≥ . ≥ max ≥ . House.mouse. 6 European.Hamster < 0 max < − ≥ 5) 5 45 p ( − . precipitation of month ∧ ( p 6 Granada.Hare < 0 max 9 − ∨ t W ild.boar Beech.Marten Grey.long.eared.bat < < 5 ≥ t max max ∨ ∧ ( . 3 ( ∧ ∧ ≥ − 5) t 0 ∧ ∧ − Eurasian.P ygmy.Shrew < . ∨ ( − 10 ∨ 9) 5 9 15 0 Arctic.F ox < 2 35) . . 5 5 45 Lesser.W hite.toothed.Shrew < . p t . 3 . . t . ≥ 1 0 ( ( ∧ 5) 7 t 5) ( 0 0 0 ∧ . 85) ≥ . ( . 19 − Mountain.Hare ∧ 0 ∨ ∨ 4 ∨ 5 5 ←→ ≥ 4 ≥ . ∨ . ≥ ≥ ∧ Lesser.W hite.toothed.Shrew ≥ 0 0 5 . ≥ 5) 5 25) 15) 85) . W ood.mouse . ∧ . 0 . . 35) 0 0 5 . ∧ max 5 max < 92 36 . − 5 max max < ≥ max < 34 . 0 − − max 0 ≥ < ≥ − − 2 − < t − ≥ 6 8 stand for minimum, maximum, and average temperature of month 10 8 1 ≥ 5 ( t 6 Roe.Deer p t p t t 3 Moose < } ( t p ( ∨ ∧ ∧ ∧ ∧ ∧ ∧ Arctic.F ox < min < Southern.V ole ∨ ∧ ∧ Grey.long.eared.bat ∨ avg ∧ − ∧ ∧ 15) 85 85 85 95 95 65 ; ...... 45 5) 15 5 5 5 5) ...... Moose P olar.bear Moose Moose European.Hamster W ild.boar < Mountain.Hare American.Mink Roe.Deer < 12 Lesser.W hite.toothed.Shrew 63 0 0 5 0 t 61 0 18 18 15 0 17 Eurasian.P ygmy.Shrew 14 0 min ; max Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min p-val. Redescription − { tn 1 , 1 E support ; Table A.1: J 0.95 1923 0 ( 0.947 36 0 ( 0.945 1803 0 ( 0.944 1996 0 ( 0.94 2232 0 ( 0.934 1926 0 ( 0.923 1722 0 ( 0.91 1450 0 ( 0.91 1906 0 ( 78 Appendix A Redescription Sets from experiments with Bio Data Set ∧ ∨ ≥ ≥ ≥ ≥ ≥ 85) 5 1 - . . ←→ ←→ ←→ 5) 1 . , 0 1 avg < avg 0 15 5) max < max 45) max < 5) 5) E max < . . . . − ≥ − ≥ 0 ≥ − 0 0 − 6 − 5 10 t − 4 Continued 9 t t t ≥ ( ∧ 8 95) 10 ∧ t 85) . Gray.Seal t . ∧ max Arctic.F ox < Stoat ∧ ∧ 95 11 715 ∧ ( 20 . − . 25 ∧ 5 ←→ House.mouse < . . 2 5 max < ≥ ∨ t 5 15 17 ≥ 0 House.mouse. . ∧ . 11 5) 045 American.Mink Moose ≥ stands for average ∧ ∧ 0 . − . 5 ≥ 5) 17 0 . 5 1 ∧ Arctic.F ox . ≥ ∧ . 55 0 0 Gray.Seal < 5 11 pn max . 0 . Gray.Seal < 5 max ∧ t avg . ∧ 0 − 0 17 5 ∧ ∧ 5 − − max . . max 8 1 0 5 8 0 ≥ t − t ≥ . avg < t 25 − 9 0 . ∧ max < ∧ t ∧ − ( 11 11 69 1 − ∨ t . 15 max t 15 . ( 5 . 9 8) − t ∧ . ∨ 11 ( 17 9 t 56 ( 15 Mountain.Hare < ≥ . 05) ( max < . ∨ bucket=20.) J - Jaccard similarity ≥ Mountain.Hare < avg < House.mouse < ∨ Mountain.Hare < 11 ( ←→ − 5 ( 54 − Common.Shrew < ∧ p Common.Shrew < 85) max < 5) ∨ ( ≥ . max ∨ ( . Mountain.Hare < 11 < 5 5) ∧ 12 in degrees Celsius, and 0 t . . ( ∨ Mountain.Hare − t ∨ ( 36 5) − 0 0 5) ( ( . n ∨ . 75 3 10 American.Mink < 5) Mountain.Hare < . 9 0 t ∨ 0 . 5) p < ( ∨ t . max ( 5) ∧ 0 ( Mountain.Hare < . 13 0 8 ∧ ( ∨ 5 ∨ 0 5) − ←→ p . ∨ 95) ≥ 9 . . . 0 ∨ ≥ 3 0 ∧ 5) 0 t 5) 9) . 55) . ( . 21 . 0 5) 65 0 Stoat < 705) 1 . ≥ . . ( 0 ≥ 13 max < 4 ≥ ≥ 15 ∨ 85) − ←→ ≥ ≥ . 4 min Gray.Seal < t 5) . max 15 5) min < − . ∧ ∧ 0 avg avg 0 − ≥ 5 − 95 . in millimeters. 11 9 max < . ≥ − − t 0 t 3 t n 6 − 9 Common.V ole 05) 17 ∧ ∧ t Alpine.F ield.Mouse < . t Mountain.Hare 9 P olar.bear ∧ max 6 t ∧ ∧ ∧ ∧ ≥ ∧ Gray.Seal < 25 45 5 ∧ ∧ 5 . . 5 Mountain.Hare < Alpine.F ield.Mouse < − . . 85 . ( European.Snow.V ole < . ∧ 0 35 5 0 95 5 0 . 11 ∧ . 55 . 10 t Alpine.F ield.Mouse < ∨ European.Badger < . ∧ 5 0 Common.Shrew 20 ( 715) . < 5 max ≥ . ≥ ( 10 ∧ 5 . 11 ≥ ≥ 0 ∧ 5) . 17 2 max < ∨ 1 . Gray.Seal 0 5 − ∨ 0 . 5 0 ≥ 9 . ≥ 0 − ∧ t 5) 0 ≥ ≥ . ( 35) ≥ 3 5 max . 0 max ≥ t . ∨ avg max < 0 − avg < 12 max < − ∧ Etruscan.Shrew < 5) max < . − − 5 − ≥ 4 − . ∧ 11 − 5 8 11 Mountain.Hare < t t t t 1 9 8 ( 5 ( ( t . t t precipitation of month ( ∧ ∧ 109 0 ∨ Arctic.F ox < W ild.boar Chamois ∨ ∧ Moose < House.mouse ∧ W ild.boar max < Beech.Marten Chamois House.mouse. < ≥ ∧ 15 25 ∧ ∧ ∧ ∧ ←→ 5) . . ∧ 69 ∧ ∧ ∧ 15 . − max < 5 . 45) . 5 5 5 ←→ 5 5 5 0 . . 5 5 5 . . . p . 5 17 11 . 8 5) . . . − 0 0 0 . 0 t 0 11 0 0 0 0 5 5) 13 ∧ 0 ≥ . ≥ t ∧ ≥ ≥ 0 ≥ ≥ ≥ ≥ ∧ 55 . 85 House.mouse . avg < 95 . 11 max max < ∧ 15 max < − max < 17 5 − − . − 12 9 0 − t t 3 11 stand for minimum, maximum, and average temperature of month ( ( 7 t t t ( ( Common.Shrew } max < ∨ ∨ ( Serotine.bat ∧ max < ∨ ∨ Greater.W hite.toothed.Shrew max < Gray.Seal < Mediterranean.Monk.Seal < − ∨ avg ∧ − ∧ ∧ ∧ − 35 4) 35) 35) 9) 75) ; ...... 5) 5 5 5 5 5 9 10 . . . . . Mountain.Hare Common.Shrew < Common.Shrew < Mountain.Hare Mountain.Hare Stoat < Mountain.Hare Mountain.Hare t Mountain.Hare < t Mountain.Hare t 10 0 18 0 13 0 0 12 Algerian.Mouse < 21 ( 0 13 ( ( ( min ; max Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min p-val. Redescription − { tn 1 , 1 E support ; Table A.1: J 0.908 1798 0 ( 0.906 1597 0 ( 0.905 1942 00.9 ( 1525 0 ( 0.899 1644 0 ( 0.898 1929 0 ( 0.897 1679 0 ( 0.896 1399 0 ( 0.894 693 0 ( Appendix A Redescription Sets from experiments with Bio Data Set 79 ∧ ∨ ∧ ∨ ∧ ∨ ∧ ≥ ≥ ≥ ≥ 5) . 5 5 5 5 - 4 . . . 5) ←→ ←→ . 125) 5) 1 . , . 0 0 0 95) 0 1 . 0 0 ≥ < 5) max max 5) max E . ≥ . 64 6 0 ≥ ≥ 0 − ≥ p − − Continued ≥ 9 ∧ max 9 3 t t t ( 7 Gray.Seal < ( Gray.Seal 05 − p . Stoat ∧ ∧ 45) Stoat < 5 ( . ∧ ∧ t House.mouse < 5 12 4 . ∧ Beech.Marten < ←→ ∨ 5 ∧ 0 . ∧ 985 ←→ 55 Raccoon ≥ 5 . ∧ 0 . stands for average Raccoon . 5 5) 5) 1 . 75 . . 5 0 ∧ . . 10 0 5) ∧ 0 0 . pn 0 5) 5 ≥ . 24 0 . 5 max ≥ max < ≥ . ≥ 0 ≥ Common.Genet < 35) 0 − . Common.Shrew − 100 ∧ avg 10 1 ≥ ∧ 22 t t ≥ 5 < max . ( 5 − ∧ . 0 1 max < − 0 2 p 05 t . − 3 ≥ ∧ t ←→ 8 ∧ ( 22 t max < ( Bank.V ole 35 5) . American.Mink < . ≥ 75 − ∨ . ( 0 ∧ bucket=20.) J - Jaccard similarity Mediterranean.Monk.Seal < 6 42 European.F ree.tailed.Bat < 9 Stoat ←→ American.Mink < ∨ t 5 35) ( 5) ( ∧ Common.Shrew < . . . Common.Shrew ∧ 35) ≥ ( American.Mink 0 5 4 ( 4 ∧ . 5) 5) max ∨ . ∨ . . House.mouse 5 7 in degrees Celsius, and House.mouse ∨ ∧ 0 . ( Muskrat < 0 0 − Savi.s.P ine.V ole < ∨ 17 p ( ≥ ( 55 0 5) 5 ( 8 5) n . . . . 5) t ∨ ≥ ∧ ≥ . ∨ 0 0 ∨ ( 5) 0 ∨ max < . 10 0 5 ∨ 0 . 5) max < 5) max 5) − . 0 ≥ . ≥ . 5) avg < 0 25) Stoat < . 0 0 . 2 − − ( 4 t 0 Stoat ≥ − ( 1 5 ≥ ∨ t ≥ t − Greater.W hite.toothed.Shrew < ≥ 9 ∧ ( max < Stoat < ∨ t 1 ( ∧ 5 ∧ 5) . ∨ ∧ . − ∨ 0 5 0 . 95 3 65) . max < . t 0 45 45) 5) ≥ . 45) ( . 9 . ≥ . − min < Common.V ole Granada.Hare < 17 0 2 House.mouse < 5 Black.rat < ∨ 65 13 ∧ ∧ t ( ≥ − in millimeters. ≥ 5 ≥ ∧ 5 ∧ 9) ≥ 2 . ≥ . ∨ . n t 0 0 5 8 Mountain.Hare 05 Algerian.Mouse < . Mountain.Hare . 45 ∧ p 5) 0 Mediterranean.Monk.Seal < min . ≥ ∧ max < ∧ min ∧ 5 Northern.Bat 22 ∧ 0 House.mouse. Common.Shrew Northern.Bat . < max 25) ∧ 5 5 − . ( . 5 − . − . ∧ 5 ∧ 0 ∧ − 75 0 6 . ≥ 9 0 11 ∨ . 11 t 100 985) t 5 0 5 7 p 11 . . 5 ( ≥ . t . t Gray.Seal ≥ 24 1 0 ∧ 5) 0 ≥ 0 ∧ . ∨ ∧ ∧ max < ∧ 0 Common.Shrew 1 ≥ Common.Shrew ( 6 p 5 35 − 9) . . . 375 . 95 8 . . ∨ 0 max < ∧ ∧ t 6 39 1 ( 22 avg < 56 5 Grey.long.eared.bat < max − House.mouse < . 5) − 35 . ( ≥ . − 0 ≥ < 3 ∧ 0 − t ∨ 7 2 7 ( 5 42 ←→ 8 t precipitation of month p . Beech.Marten < p Gray.Seal < t ( 5) 0 ∨ ( . ∧ avg < ≥ House.mouse Raccoon < Grey.long.eared.bat 5) ∧ max Raccoon < ∧ ∧ 0 . ≥ Common.Shrew min < 7 ∧ 0 ∧ − 5 75 5 ∧ − p ∧ Granada.Hare < . ≥ 05 . . Common.Genet ←→ 5 5 ( ∧ . − 3 ←→ 0 6 . . 0 9 5 5 t ∧ t 2825) . 0 0 . 1 5 ∧ . 13 5) . t 0 5 0 Common.Shrew < . 5) 0 ∧ ∧ . 0 5 . ≥ 0 . ∧ 0 ←→ 0 ∧ 0 ≥ ≥ 55 75 Bank.V ole < ≥ . 5 . 35 5) . American.Mink < . 6 . max < ≥ ∧ 0 4 10 0 ∧ avg < max < 5 − ≥ . 5 ≥ ≥ . 2 0 − − t 0 stand for minimum, maximum, and average temperature of month ( 1 4 t t Stoat American.Mink } ( ∨ ( max Mediterranean.Monk.Seal < ∧ ∧ max max < ∨ Mountain.Hare < ∨ avg ∧ − − − ∧ 95 25 95) ; . . . 5 5 5) 5) 1 3 2 . . . . American.Mink t House.mouse < House.mouse Stoat < Muskrat Muskrat < American.Mink < Stoat < Common.Shrew < House.mouse Common.Shrew < t House.mouse t 0 Y ellow.necked.Mouse 11 ( 0 ( Granada.Hare < ( 0 17 0 Steppe.Mouse < ( European.ground.squirrel < 13 ( min ; max Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min p-val. Redescription − { tn 1 , 1 E support ; Table A.1: J 0.894 1735 0 ( 0.89 1354 0 ( 0.892 1450 0 ( 0.887 1450 0 ( 0.877 1761 0 ( 0.874 1694 0 ( 0.873 982 0 ( 0.869 1540 0 ( 0.868 1607 0 ( 0.866 1653 0 ( 80 Appendix A Redescription Sets from experiments with Bio Data Set ∧ ∧ ∧ ∧ ≥ ≥ ≥ < ≥ 5 5 5 - 5 8 1 . . . ←→ ←→ . 1 p , 0 0 0 0 1 avg avg < avg < 5) max ∧ 5) max < E . . − ≥ − 0 0 − − 1 65 − 1 10 45) Continued t . . t 2 t 3 t ( 4 ∧ t ∧ 36 ∨ 7 ∧ ≥ . ∧ 15 < . House.mouse 10 45) 82 7 ( 45 11 . . . House.mouse. p 2 6 ( 0 stands for average ∨ max ≥ ∧ − 5) 5 ≥ pn . . 1 ≥ − 0 0 House.mouse < House.mouse < ←→ t max < Harbor.Seal < ( ( max ( min < Grey.Red.Backed.V ole < − ∧ ( Norway.lemming 5) − ∨ ∨ 6 Mountain.Hare < . max − t 5 avg 3 0 15) ∨ . ∧ t . ∧ ←→ ∧ 5) 5) − 0 ( 11 . . 5 − t . 5) 63 0 0 2 ∨ 5 85 . 5) 0 . t ∧ . 2 . 0 ( t 0 3 < 0 ( ≥ ≥ 95 ≥ . 5 ≥ 705) 1 p . 4 − ←→ bucket=20.) J - Jaccard similarity ∧ ←→ Lesser.W hite.toothed.Shrew < 15) House.mouse < max < . 45 5) ( ∧ . . 5) − 0 0 63 5 in degrees Celsius, and . ∨ . 1 avg < 0 t Roman.Mole < 0 n min < < ( 5) − Bank.V ole ∧ . W olverine − 5 ≥ ≥ ( 9 0 p 5 1 t Y ellow.necked.Mouse < . ∧ t ∨ 0 ←→ ∧ ∧ max < ≥ 5 ∧ ∧ Northern.Bat < . 5) 0 5 − 95 5) 045) . ∧ 45 . 35 . . . . . 0 2 0 Moose 5 0 1 t 4 Common.Shrew < . ( 11 11 ( ( 0 ≥ ≥ ≥ ∨ ≥ ∨ ≥ ∨ 95) 1 Harbor.Seal < . 5) . 0 avg ∧ 13 5) 05) Lesser.W hite.toothed.Shrew < . . max 8) 5 max < 0 max − ≥ Mediterranean.Monk.Seal Mediterranean.Monk.Seal ∧ . . in millimeters. 19 − 0 1 − 5 − 1 ∧ ∧ t n 95) 49 . Raccoon . t 8 ≥ ≥ 0 6 t ( 0 Raccoon < 5 5 ∧ avg < t ∧ ≥ . . Arctic.F ox < ∧ − 0 0 ∧ ≥ ∧ 5 8 − 15 . ∧ . p Norway.lemming < 5 Grey.Red.Backed.V ole 15 0 max . ←→ . ( 5 45 7) House.mouse. 10 0 ∧ . ∧ . . 11 Y ellow.necked.Mouse t − 0 ≥ ∧ 0 11 ∨ 5) 5 ≥ . ∧ . 10 6 5 ≥ ∧ 82 . min < t 0 . 0 Moose ≥ 5 5) 0 6 . . ( ≥ ∧ 95 − 0 0 . ≥ 1 − ∨ American.Mink max 85 17 12 Siberian.F lying.Squirrel < ≥ . max < max < ≥ ∧ t 5) 4 max − . 5 ∧ ≥ − . − ∧ 0 3 0 − 5 Etruscan.Shrew 3 t 2 avg < . t ( t precipitation of month ≥ 6 35 ( 0 ( ≥ . t ∧ − ∨ max House.mouse Granada.Hare < Granada.Hare < ∨ House.mouse ∧ ( 2 34 5 max < ( t . 4) Bank.V ole Raccoon − . ( ∧ ∧ ( 0 ∨ ←→ ≥ ∨ 85 − 7 95) ∧ 05) . . House.mouse. t ∨ 5 5 . 18 Northern.Bat < ∨ 6 House.mouse < . . 3 3 5) 5) 1 5 ( 5) . p t . 0 0 ≥ ∧ . ∧ . ∧ 19 ≥ 0 0 5) ∨ 0 0 5 . ∧ ∧ 35) . 5 Lesser.W hite.toothed.Shrew < . . 0 65 P olar.bear ≥ 0 5) < ≥ − ≥ . ≥ 0 ≥ . Arctic.F ox ∧ 45 15 41 ∧ 0 1 . . ≥ ∧ 36 max 5 5 0 max < 5 . 5 < . ≥ . 0 max min − ≥ 0 − ≥ 0 5 7 p 1 − − stand for minimum, maximum, and average temperature of month ≥ 10 p t t ( ∧ ( 6 1 } t t Siberian.F lying.Squirrel ∧ max < max ∨ ∨ Raccoon < Moose ∧ ∧ avg ∧ 515 − − ∧ ∧ 5) . ; . 85 5 35 5 5 35) 5 045 3 2 ...... Grey.Red.Backed.V ole < t Bank.V ole < Moose < Moose t House.mouse < House.mouse Moose < House.mouse < Common.Shrew Raccoon.Dog 47 ( 4 Granada.Hare 0 4 − 0 1 ( Granada.Hare 0 House.mouse. 1 min ; max Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min p-val. Redescription − { tn 1 , 1 E support ; Table A.1: J 0.866 1536 0 ( 0.864 667 0 ( 0.845 551 0 ( 0.844 949 0 ( 0.842 1195 0 ( 0.837 589 0 ( 0.838 335 0 ( 0.836 940 0 ( 0.831 813 0 ( Appendix A Redescription Sets from experiments with Bio Data Set 81 ∧ ∧ ∨ ∧ ∨ ≥ ≥ ≥ ≥ < ≥ - 5 5 5 7 . . . 5) 1 . , p 5) 0 0 0 . 1 avg < avg 0 avg min < 0 max < max < ∧ max E − − − − ≥ < − 1 − Moose < − 2 t 85 Continued 5 1 . t 2 5 t t 3 ( t ∧ t ∧ ≥ ( 1 t ( 23 ∧ ( 5 ∧ . ∨ 15 ∨ . 0 ∨ 37 . ←→ 25 Northern.Bat < 11 . 6 4) < 05) . 6415) . ∧ 705) 5) stands for average . ≥ 15 . . 1 0 5 63 4 0 14 max < . pn 0 ≥ ≥ ≥ ≥ − avg < max ≥ 8 Common.V ole < 6 − t ( House.mouse. 65) − p avg Mountain.Hare < . Common.Shrew max < Common.Bent.wing.Bat 3 3 avg < ( 1 ∧ t max t ∨ ∧ ∧ − ∧ ∧ − ( − ∧ ∨ 1 − 5 5 5 75 ∨ t . . 25 5) . . 9 . 10 . 7 0 0 t 0 Greater.W hite.toothed.Shrew 25 t t 5) ∧ 0 . Laxmann.s.Shrew 24 ( . House.mouse. 10 7 ∧ 0 ∧ ∧ ≥ max < 705) ≥ 25 ∧ . . ∧ 5 ≥ ≥ ≥ 4 . − 95 55 5 . bucket=20.) J - Jaccard similarity 75) . . ≥ 0 ←→ 15 . House.mouse 7 5 0 ( . 11 11 t ≥ Stoat 24 0 5) ≥ max max < . ( ∨ max in degrees Celsius, and ∧ 0 ≥ avg < Moose − ∨ − 2 5) − n . . 2 − 8 ∧ max t 0 5) Grey.Red.Backed.V ole t max < 9 . ( 53 ( 12 ( t 5 0 − max < t max . < ∨ − ∨ ∧ 0 ≥ ∧ − − ≥ 4 1 10 8 t t 95 5) 8 8 . 045) . p ( ←→ ≥ t . t 95 ∧ 0 705) ( . 1 . Common.Shrew < ∧ ∨ 11 3 ( ∧ 4 5) ∨ ≥ . 15 Gray.Seal < ≥ Etruscan.Shrew . 82 0 ∨ ≥ . 1 85) 5) ∧ . American.Mink 6 . European.P ine.V ole ∧ 684 Common.Shrew < . 5 ( 5) avg − . 0 49 ∧ 53 5 . ∧ avg < . 0 max < ≥ − 0 ∨ − 0 5 max in millimeters. < ≥ . 5 − − . 1 0 House.mouse. 5) 8 4 t n 8 0 Steppe.Mouse 9 − . ≥ t p t p 0 ∧ avg < max ∧ 3 ∧ ∧ avg < t ∧ ∧ ∧ Common.V ole 5 ( 5 − . ( − ≥ . 15 15 − 0 . 35 0 2 . 2 85 Gray.Seal < . . ∨ t t 1 ( ( 11 t 6415 11 ∧ 12 . ( 23 ←→ North.American.Beaver < 37) 5) ∨ 0 . . ≥ 5 . 6 0 ≥ 0 5) 5) ∧ ←→ . . ≥ 0 ←→ 5 42 5) max . max < . max < avg < max American.Mink < North.American.Beaver 0 5) 0 < . − avg − − ∧ ∧ 0 − − 3 5 3 Etruscan.Shrew − t 7 t Marbled.P olecat < p 1 8 5 5 Common.Shrew < ( t ( . . t t precipitation of month ( 3 European.Mole < 0 0 ∧ t ∧ Etruscan.Shrew ∧ Moose < ∧ ∨ ∧ ∧ ∨ ∧ Moose < ∧ 5 ∧ 5 ∧ . . 4) 25 ←→ Common.Shrew < 25 75 . . 0 865 5 5) 0 ∧ 5 25) . . . Arctic.F ox < . 5 . 25 . . 7 . . ∧ 5 0 0 5) 0 5 18 0 . 15 24 7 . ∧ 5 10 − ≥ 0 ≥ 0 . ≥ 5 ≥ ≥ 0 ≥ ≥ . ≥ 0 ≥ 85) W ood.mouse < 65) . . Common.Bent.wing.Bat < max < ∧ 58 max ∧ max avg < max < max 98 5 − max < . 5 − ≥ 2 0 − − − − . < − t 0 8 ( 2 2 8 6 stand for minimum, maximum, and average temperature of month 10 ≥ t p t t 10 t p Stoat < ( ( 12 Common.Shrew < t ≥ } ( t ( ( ∧ ∧ Striped.F ield.Mouse < ∧ ∧ ∨ ∨ ∧ ∨ ←→ ∨ ∨ avg ∧ 82 15 . . 4) ; . 045 95 5 684 5) 5) 6 1 35) 5) 5) ...... Common.Shrew House.mouse < Common.Shrew Common.V ole < Common.V ole Grey.Red.Backed.V ole < Grey.Red.Backed.V ole < Moose Stoat Common.Shrew Mountain.Hare < 1 3 0 0 European.Mole ( Marbled.P olecat 0 4 ( − − 29 6 0 0 min ; max Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min p-val. Redescription − { tn 1 , 1 E support ; Table A.1: J 0.831 813 0 ( 0.828 1080 0 ( 0.811 877 0 ( 0.813 1206 0 ( 0.803 293 00.802 449 ( 0 ( 0.802 937 0 ( 0.797 839 0 ( 82 Appendix A Redescription Sets from experiments with Bio Data Set ∧ ∧ ∨ ∧ ∨ − ≥ ≥ 7 05) 72) 5 5 5 - 8 . . t . 5) . 85) ←→ ←→ ←→ . 5) 112) 1 . . p . 2 9 , 0 0 0 ∧ 1 0 0 − < 5) min < max ∧ 5) 42 5) E . . 55 . Stoat < < ≥ 5 . 0 ≥ 0 − ≥ ( 0 − < p 7 75 Continued 8 . 1 ∨ 6 t 7 ∧ avg < t ( p W ild.boar < 24 ( ( − 5) ∨ ∧ 95 max < . 7 . ∨ ≥ 0 ∨ Stoat < t − ( 15 max < 22) ∧ 2 . 5) ≥ 375 t . . ∨ 45) 4 − stands for average . 35 0 6 ∧ 4 . Gray.Seal 55) max t . 5) 13 Gray.Seal < . pn 13 ∧ ∧ 15 − 0 17 . ∧ 5 1 max < 8 House.mouse. 15 . t avg < 5 . ≥ . 0 − ( Common.Genet ≥ − 1 avg < ∧ 0 − 5 ∨ ∧ 4 5 t ≥ max < − . Common.Shrew < t ( Alpine.P ine.V ole < max < 0 max ≥ − 5 5) 3 ∧ − . . ∨ 35) ∧ t − ∧ Black.rat < . 0 4 − 7 8 5 max < 5 t 85 t ∧ 6) . . ∧ . . 12 ( Algerian.Mouse 0 0 European.F ree.tailed.Bat 9 − 10 ∧ max 5 ≥ t . 75 69 ≥ 2 ∧ . ∧ 0 − t ≥ ≥ ≥ ∧ 85 6 ( bucket=20.) J - Jaccard similarity 2 5 . < 5 ←→ . t . W ild.boar Stoat < max < ∨ ( 0 9 0 ( 85 ≥ 42 . max p 5) ∧ Beech.Marten min − . Stoat ∨ ≥ in degrees Celsius, and ≥ ≥ 25) ( 0 ∧ 11 5 − 5 . − . ∧ t ←→ n 7 5 8 0 5) 6 ∨ ≥ 15 American.Mink < . p max t t 5 . ∧ . ( 0 5) ∧ . ∧ 0 ∧ ≥ − 5) 20 Stoat ≥ − 0 . 75 5 ( . . 2 0 45 85 max t . ≥ 0 . ←→ ∨ ( House.mouse 24 Common.V ole − ( 20 ( 15 ≥ min ∨ 5) 5) . . ∨ ∨ − Roe.Deer < 11 0 0 Mountain.Hare < max < t ∧ ( 95) 5) ∧ 5) 12 45) . − . ≥ 5 . . t . 75) ∨ Stoat < 0 5 6 0 . max < Common.Shrew 5) . ( 0 13 t max < ∧ . max < 65 0 9) ∧ 24 ∨ − ∧ ≥ ≥ . − ≥ 27 − 05 in millimeters. < 5 . 8 8 . ≥ 5) W ood.mouse < 41 t t 5 2 95 8 n . ≥ Mountain.Hare 0 t ( . ( p ( ∧ 0 ( 5 < ∨ 5 15 . ∧ p ∨ max < ≥ 7 American.Mink ≥ − 0 Steppe.Mouse max p ∧ ≥ ←→ − 75 5) ←→ ∧ . ≥ . ∧ − European.P ine.V ole < 112) ∧ 9 0 47 5 5) t 8 Steppe.Mouse . 24 5 5) . . ∧ max Stoat < t ≥ 375) Arctic.F ox W ild.boar < . . European.Mole < 7 85 . 0 ( 0 ( ( ( max . ∧ 0 0 ∧ ≥ 5 5 6 − ≥ 05) . ∧ ∨ . < ∨ ∨ p 5 ∨ − 2 11 0 . 5 t ≥ 8 . ∧ 5 Moose 0 5) 86 985 5) 5) t . 5) ( p 0 ∧ . . . . ( ≥ ( 0 max 1 4 85 0 ∨ 0 < . avg 15 9 . ≥ − 5) ≥ ≥ ≥ Beech.Marten . 1 10 − max < Stoat < 8 0 ←→ ←→ p t ( ∧ − precipitation of month 3 ( Gray.Seal t − 5 ∧ ≥ 5) 5) . ∨ avg . W ild.boar < Gray.Seal < ∨ max < . ∧ 0 ∧ European.ground.squirrel 0 European.F ree.tailed.Bat 0 11 min < 25 ∧ ∧ t − . − 5) 5 ( ≥ . ∧ . 75 ∧ ≥ 5 − ≥ 35) 5 2 . 5 . 0 0 . . 10 t 5 t max < 6 6 5 0 . 0 t P olar.bear . Gray.Seal 0 17 ∧ ∧ − 0 ≥ ≥ ∧ ∧ ≥ ←→ Common.Genet < ∧ 2 5 ≥ t Common.Genet . 75 75 5 ( 45 . Algerian.Mouse < P olar.bear ∧ Spanish.Mole . 5) . 0 . Common.Shrew < . ∧ 6 0 ∨ ∧ max < 0 ∧ 5 ∧ 24 max ∧ 20 . 5 avg 5 . . ≥ 5 5 0 − 5 . . − 0 0 . 05) − . 0 0 0 2 stand for minimum, maximum, and average temperature of month 9 12 t ≥ t t ≥ 14 ( ≥ ≥ } max < ∧ ∧ max < ∨ American.Mink Mountain.Hare max < ≥ avg − ∧ − ∧ − 45 45 ; . . 5 5 65) 8 8 2 . . . W ild.boar Moose < Stoat Arctic.F ox < t Mountain.Hare < Mountain.Hare Stoat Stoat Stoat Stoat < Common.V ole < t Egyptian.Mongoose House.mouse < t max 0 ( Algerian.Mouse American.Mink < 65 ( 0 Gray.Seal < 20 ( ( ( 9 min ; max Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min p-val. Redescription − { tn 1 , 1 E support ; Table A.1: J 0.794 460 0 ( 0.792 880 0 ( 0.787 111 0 ( 0.786 704 0 ( 0.772 922 0 ( 0.775 802 0 ( 0.758 681 0 ( 0.735 613 0 ( 0.754 898 0 ( 0.733 55 0 ( 0.722 666 0 ( Appendix A Redescription Sets from experiments with Bio Data Set 83 ∧ ∧ ∧ ∧ ≥ ≥ ≥ ≥ ≥ 55) 5 5 5 8 - 5 . . . 5 . ←→ ←→ . 1 p 0 , 0 p 0 0 1 ( avg avg < 12 ∧ 5) max min min < 5) E . ≥ − − . − 0 − ≥ 3 0 85 1 − − t . t Continued ( ( ←→ 2 11 t t 10 11 ( t Stoat < 5) ∧ ( ( 95) ∨ . min < . ←→ ←→ 0 5 ∨ . − 12 3 65) Alpine.marmot < 5) 5) ≥ . . t 5) − . stands for average ←→ . max < ( 7 0 100 0 ∧ 0 65) Kuhl.s.P ipistrelle < . − 5 ≥ ≥ pn ( 5) . ≥ 0 ≥ . 9 0 Raccoon.Dog t 0 ←→ ∨ − ( 1 ( p min < 5) ≥ max ∨ 5) Laxmann.s.Shrew < . . ∧ − 0 0 − 15) ←→ ∧ 3 5) . 2 Y ellow.necked.Mouse . t 35 ≥ t 5 . ≥ 0 max < . 05) European.Hedgehog < 5) 20 ∧ ∧ . ∧ . 0 P olar.bear 42 0 ∧ 0 − ( 5 ≥ Alpine.marmot . Y ellow.necked.Mouse < 5 35 ≥ 55 ( ∨ . . ≥ . 85) 0 ≥ ≥ . 11 0 ∧ 5 t 6 ≥ − 5) ∨ 11 . 7 bucket=20.) J - Jaccard similarity ≥ − 5 ≥ Common.Shrew < − . 0 ∧ max p ≥ 5) ( 0 . ∧ 3 . − ≥ Kuhl.s.P ipistrelle P olar.bear 0 5 min 8 ( . in degrees Celsius, and ≥ 7 ∧ 0 t ∨ − − P arti.coloured.bat < n max ←→ 5 max < . ∧ 2 5) Laxmann.s.Shrew max < ∧ t . − 0 Arctic.F ox 0 − 5) 3 5 ( ∧ 15 − ∧ . . t . Savi.s.P ine.V ole 2 0 0 5 ∨ 2 t ∧ . t ( min < 55 ∧ 0 ≥ House.mouse 575 < 5) ∧ 95 . 5 . ∨ ( − . ≥ . 2 0 1 ≥ 0 ∨ 8 45 10 11 . p ≥ ≥ 95) t . 8 ( 5) ∧ . Grey.Red.Backed.V ole − Black.rat < 19 0 ∨ ( Laxmann.s.Shrew avg 05 ∧ Siberian.F lying.Squirrel . ≥ ∨ Grey.Red.Backed.V ole ∧ ≥ Laxmann.s.Shrew − ( 5 max < in millimeters. 95) 5 ( . . . 34 ∧ 1 5) ∨ 0 European.Hamster < − 0 n t . ∨ 5 min < ( 80 < 0 . 5) max ∧ ≥ 11 . W olverine 0 ∨ 2 t 5) − Y ellow.necked.Mouse < ≥ 0 House.mouse. European.P olecat < 5 . ≥ ( − p . ∧ 0 ( ∧ 7 ∧ 0 7 ∧ ≥ 11 5 < 483) p 5 5 Kuhl.s.P ipistrelle t 95) . t . . . . ( 5 ( ≥ 0 0 Stoat < . 0 0 ≥ ∧ ∧ ←→ ( 1 0 ∨ 12 ∨ ←→ ≥ ≥ ≥ ≥ ∨ 95 5) ≥ . 5) . 6) 175 5) . . . . 35) 0 95) 5) Grey.Red.Backed.V ole 0 . 2 . . 0 12 ( avg 3 42 0 < P olar.bear − ≥ 49 ∨ min − − P olar.bear 1 < ≥ ∧ ≥ − 1 ≥ − ∧ 4 t 5) P olar.bear 5 . Southwestern.W ater.V ole < . p 5 7 . precipitation of month 0 ∧ t 0 11 ∧ ∧ 0 avg < ∧ p min Alpine.Shrew ∧ House.mouse. 5 min < 5 15 . − . . ∧ House.mouse 85 W ildcat < − 0 ∧ Mediterranean.Monk.Seal 0 ( . ∧ − 05 7 . 6 11 3 . 5 ∧ 11 35) ∨ ∧ 2 t . 0 . 5 t t . Norway.lemming 5 0 22 ≥ 5 . Common.Shrew < − 0 5) ∧ . Northern.Red.backed.V ole Laxmann.s.Shrew ∧ 10 . ∧ 0 ∧ 35) ≥ − 0 0 . ∧ ∧ House.mouse. ≥ 3 ∧ 5 5 . 55 . 5 5 . ≥ . ≥ ∧ . . 8 5 42 0 . 5 Common.Genet max 0 0 5 1 . 0 − max < 146 ∧ 0 − max avg < min < ≥ − 7 5 3 − < . ≥ − − − p t 0 ( ( 2 1 stand for minimum, maximum, and average temperature of month 8 2 6 t t t t p ∨ } ≥ min ∧ ∧ ∧ ∧ ∧ max < Southwestern.W ater.V ole − ←→ avg 25) 45 635) ∧ − . 95 . 85 . 45 ; . . . 0 8 575 5 5) 8 11 2 . . . Stoat Muskrat < Grey.Red.Backed.V ole < Grey.Red.Backed.V ole < Arctic.F ox < t P olar.bear < House.mouse < Kuhl.s.P ipistrelle < Laxmann.s.Shrew < t P olar.bear < Raccoon.Dog Alpine.marmot < − Black.rat 10 Eurasian.W ater.Shrew < − Common.V ole < ( 36 2 0 0 ( − House.mouse. 89 min ; max Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min p-val. Redescription − { tn 1 , 1 E support ; Table A.1: 0.721 536 0 ( J 0.722 636 0 ( 0.718 140 0 ( 0.707 188 0 ( 0.715 178 0 ( 0.684 117 0 ( 0.708 600 0 ( 0.681 552 0 ( 0.657 142 0 ( 0.642 70 0 ( 0.638 347 0 ( 0.605 130 0 ( 84 Appendix A Redescription Sets from experiments with Bio Data Set ∧ ∧ ∧ ∧ ≥ ≥ ≥ < ≥ ≥ ≥ ≥ ≥ 1 - 5 5 5 . 5 1 . . . 10 , 0 0 0 1 0 p max ( max E ≥ ∨ − − Continued ≥ ≥ 7 3 ≥ Chamois t t ( ( 75) . ∧ ∨ ∨ 75) 73 . 55 5) . . House.mouse. Alpine.marmot 13 65) ≥ 0 . stands for average 58 Common.Shrew ∧ Edible.dormouse ∧ 5 ≥ ( 6 5 ≥ 25) 5 ∧ . . . p ≥ pn ∨ 0 European.Hamster 0 5 5 ( ∧ . 17 p 0 5) ≥ ( max ∨ . ≥ 15 0 . Etruscan.Shrew ∨ min < ≥ − ( 5) Common.Genet . 3 ≥ 45 ( − 5) 0 t Edible.dormouse . ∨ Common.Genet ( max 5 ( ∧ < t ∨ − 5) 107 ∨ . ∧ Gerbe.s.V ole 25 ∨ 3 10 0 . 5) t ∧ ≥ . p 05 5) 0 . 20 5 ∧ . 5 . 5) ∧ T undra.V ole < bucket=20.) J - Jaccard similarity ≥ 0 . p 0 Alpine.Shrew 5) 11 . 0 ∧ ( 25 ∧ . 75 5 ≥ . ∨ . ≥ in degrees Celsius, and 0 13 104 15 Alpine.marmot 42 5) . ( . max < n ≥ ≥ < 0 15) 21 . ≥ ∨ − max 5) 1 . Southern.V ole < p 92 25) P yrenean.Desman 5) . − 10 10 . ∧ ∧ t max Coypu p ∧ 0 < 3 106 Chamois < 5) 34 ( t . ( 5 ∧ 5 − 6 . 25 . ∧ ≥ . max < p 0 ∧ ≥ ∨ 15 3 0 t 95 5 . ∧ 5 54 ( − 5) p < . ≥ ←→ 95 ≥ . Savi.s.P ine.V ole ( . 0 7 11 0 9 ≥ t 85 max . p 11 5) ∧ . ( ≥ ≥ ∧ 5) ←→ 10 . − 0 61 5 ←→ ≥ p . American.Mink < 7 55 5) 0 in millimeters. t ≥ . . ≥ ∧ ( 152 5) ←→ 0 . Brown.rat < max ∧ n 6 European.Hamster 58 0 ( p < 75 max < ≥ ∧ 5) . − 5 . ∨ European.Hamster < ≥ ∧ . ≥ Egyptian.Mongoose < 5 ←→ 3 0 − 0 10 . 12 t ∧ 5 5) 9) p 0 ( . . 3 15 p 5 ≥ 5) . t ∧ . 0 ≥ ( . ( ∧ ∨ 0 90 0 ≥ 45 ≥ 5 Alpine.Shrew . ≥ ≥ 35) avg 45) 0 < 286 . ∧ ←→ . . 5 ←→ T undra.V ole 6 5 0 p − 6) . 5) . 59 10 − 5) ∧ . 0 − . 5) ∧ p . 10 0 ( 49 5 < t 5 0 . . ( Kuhl.s.P ipistrelle 107 0 ≥ < European.ground.squirrel 12 ≥ Etruscan.Shrew Common.Shrew < ≥ P olar.bear 4 p ∧ 132 ≥ ←→ ∧ precipitation of month ∧ avg < min < p 5 ∧ Mediterranean.W ater.Shrew ∧ ←→ ∧ 5 5 < Alpine.Shrew 5 p . 5) 5 ∧ . − − . . . ( Alpine.F ield.Mouse ∧ 5 0 0 Iberian.Lynx . 1 0 5) 25 0 1 2 0 ∧ . . 5 0 t p t ∧ Alpine.marmot < 55 . ∧ 0 . ≥ 5 ≥ 0 . 45) 5 ∧ 9 20 ∧ ∧ ∧ 5 . . 0 ←→ . 1 5 7 0 5 . 0 . ≥ Gerbe.s.V ole < 15 95 0 . . 5) ≥ . ≥ ∧ 0 ≥ 90 103 75 5 Mediterranean.Monk.Seal . max < max ≥ 0 ≥ ≥ ≥ ∧ min < − − 5 5 5 5 . − p p p 0 ( ( ( stand for minimum, maximum, and average temperature of month 11 5 10 t t t European.Hamster Mediterranean.W ater.Shrew < } ( ∧ ∧ ∧ House.mouse. ∨ ←→ ←→ ←→ ∧ avg ∧ 75 15 95 ; . . . 5 5) 5) 5 5) 5) ...... European.Hamster < Alpine.Ibex Common.Genet < European.Hamster < Alpine.Shrew Edible.dormouse < Etruscan.Shrew < Alpine.marmot < Chamois < Egyptian.Mongoose Stoat < Alpine.marmot Granada.Hare 0 0 0 42 P yrenean.Desman < 0 Mediterranean.W ater.Shrew 21 Lesser.W hite.toothed.Shrew < Lesser.W hite.toothed.Shrew 11 0 0 min ; max Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min p-val. Redescription − { tn 1 , 1 E support ; Table A.1: J 0.583 200 0 ( 0.571 100 0 ( 0.551 194 0 ( 0.492 405 0 ( 0.494 198 0 ( 0.455 35 0 ( 0.444 179 0 ( 0.407 48 0 ( 0.392 143 0 ( 0.383 31 0 ( 0.349 22 0 ( 0.323 200.294 15 0 0 ( ( Appendix A Redescription Sets from experiments with Bio Data Set 85 - 1 , 1 E stands for average pn bucket=20.) J - Jaccard similarity in degrees Celsius, and n in millimeters. n precipitation of month stand for minimum, maximum, and average temperature of month } avg ; min ; max Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min p-val. Redescription − { tn 1 , 1 E support ; Table A.1: J 86 Appendix A Redescription Sets from experiments with Bio Data Set ∨ − − 65) 3 - . t ←→ 5) 11 1 . 4 , ( t 1 avg < 0 ( 5) max < max < max < ∨ ≥ E . max < − ∨ 0 − − − 2 − 7 t 5 7 Continued 45) t t t ( ≥ . 45) ( 2 . ∧ max ∧ t ∨ 13 55 − 85 ∧ 15) 05 . . . ←→ 3 7 t 55) < Stoat . ( 12 13 85 ( 5) . Mountain.Hare < . ∨ 45) stands for average 19 10 . ∨ 0 ≥ ∧ Gray.Seal < p 10 max < 5 ≥ 11 5) pn . 45) ∧ ∧ . . − 15) 0 . max < 0 avg < 5 7 55 . max 13 t . − 40 − ≥ − 0 max 4 85) ∧ 9 − . t 19 t ≥ ( − 55) 9 ( max < . Moose < t 10 8 65 8 ≥ ∨ . ( ∨ t min p ∧ − 4 12 ( ≥ ∧ 5 ∨ max < − . 15) 45) ∨ . ≥ 0 avg . 10 − 1 45 7 . t t 25) 7 85) ( . ( − t 11 . max ≥ 13 54) 8 avg ∨ ∧ 10 t 10 ≥ max < W ood.mouse < bucket=100.) J - Jaccard similarity − ≥ ≥ ( W ild.boar < 55) − House.mouse < ∧ 65 ≥ . − ≥ . ←→ ∨ 05) 8 1 10 ∧ ∧ . 4 max 3 11 t 05 t in degrees Celsius, and t . ( 5) 5 p ( max 5 . − ≥ ( . 5) 20 . max avg . n 0 0 ∨ 11 ∧ 0 ∨ − 0 max 10 − ≥ − t ≥ 7 ≥ 10 15 − 6 t ∧ t ←→ ≥ . 95) max 45) t . max < . 9 ∧ ( ∧ t 1 05 40 − 5) − . max 13 . ∨ 05 ∧ 45 3 max 2 W ood.Lemming < . 0 . < Stoat < t t ( 14 − ( ( ( − ≥ − 8 16 15 ∨ 13 7 75) 65) . ≥ . p ∨ t ∨ . ( 11 5) ≥ 12 t . ∧ 10 5) 18 max < ( ←→ . 0 max 05) 0 . ≥ 45 max − . 5) ←→ − . 05) max < W ood.mouse ≥ max 7 12 − . in millimeters. 0 ( t 2 ←→ 11 7 5) − t − . t 11 n max < − Mountain.Hare < 5 ∧ ∨ max 7 max < 0 5) t ( ∧ ∧ t . ( − − 0 − ∧ 55 5) ≥ ∨ Striped.F ield.Mouse 85 . . 85 9 . 9 . 05) 0 t max < t ≥ ∧ 7 99) . 05 10 ( 12 . 5) . 5 t . ←→ 10 Mountain.Hare < . 4 ∧ − min < ( max < 16 ( Moose < 0 0 13 ( 6 Stoat 5) ∨ − ∨ Brown.Bear < t − 25 . ≥ ( ∨ ←→ . ≥ ≥ 0 Stoat 1 ≥ − ∧ ∧ t ( 5) ∨ 11 avg < 5) Common.V ole 5 . 10 max < ( 73) 5) . t . ≥ . . 0 ∨ ( 0 max < 0 55 ∧ 5) − − avg < 9 0 . max . avg 4 5 ∨ 1 5) 0 8 − . − t ≥ . t − ( − ←→ 0 9 0 ( t 5 2 10 avg < ( 55) t t t ≥ 5) . ( . precipitation of month ( ( avg < − W ild.boar 0 ←→ 19 ∨ ←→ ∨ ∨ 6 Moose < − max < t Moose ∧ W ild.boar ←→ ( 5) Mountain.Hare < 5) ∧ 05) . 85) ∧ ∧ − 5 . . . 45) 11 . 85) 65) 0 . 5 5 5) 5 t 0 . . 7 ∧ 2 Mountain.Hare . . . . 0 t 13 W olverine < 0 0 0 0 ≥ ∧ 13 avg < ∧ ( 5 10 10 ←→ ≥ . 5 ∧ ≥ ≥ . 0 ≥ − 65 0 5) 5 . . . 8 Mountain.Hare < t ←→ 0 0 Granada.Hare < House.mouse House.mouse < 18 max avg ∧ ∧ avg < avg < ∧ ∧ ∧ 5 5) max < − − ≥ . . 5 5 5 − − 0 0 . . . 05 4 9 − . t t 0 0 0 65) 7 7 stand for minimum, maximum, and average temperature of month . ( ( 7 t t ≥ t 4 11 } max ∨ ∨ W ood.mouse ∧ ∧ ∧ ( ≥ ≥ − ∨ avg 99 45 . 05) . 05 45) ; . . . 4 5) 1 10 . Arctic.F ox < Arctic.F ox < W ood.Lemming < Stoat < Norway.lemming < Grey.Red.Backed.V ole < W ood.mouse < Moose Mountain.Hare Moose < Stoat < W ood.mouse < t Mountain.Hare < Mountain.Hare Mountain.Hare < Stoat < − 0 14 max − ( ( 16 13 max min ; max Redescriptions mined by Algorithm 1 from Bio data set (with IG - impurity measure and min p-val. Redescription − { tn 1 , 1 E Table A.2: support ; J 0.966 2347 0.000 ( 0.958 2262 0.000 ( 0.956 2159 0.000 ( 0.949 2372 0.000 ( 0.947 22860.947 0.000 22670.947 0.000 2178 ( 0.000 ( ( 0.938 1905 0.000 ( 0.938 1905 0.000 ( 0.907 17890.906 0.000 2079 0.000 ( ( 0.932 2072 0.000 ( 0.900 1816 0.000 ( 0.891 1750 0.000 ( 0.883 1968 0.000 ( Appendix A Redescription Sets from experiments with Bio Data Set 87 ∨ ∨ ≥ 45) - . 5) 5) 1 . . , 65) 1 0 0 avg 10 55) . avg < . max < max < E − 65) ≥ 57 ≥ − 12 . − 75) − 1 . Continued 8 t 3 05) ≥ 10 t . 45) ≥ 8 t ( . ( t 51 8 ( max < 22 ≥ ∧ p 40 ≥ − avg ∧ ≥ House.mouse < ←→ 8 35 < ←→ ( . p − avg 10 ←→ 4 2 ∨ 05 t 5) . stands for average 7 ( p . 5) ∧ − . t 5) 0 max . ≥ 5) 0 22 ∧ 7 . ∧ 0 pn 15 t 35) 0 . . − ←→ 05 ∧ ≥ . 8 35 22 83 . t American.Mink Common.Shrew 5 1 ( 4 max 5) 45 . < ∧ . ∧ ∨ 0 max < 1 − 5 5 . 45) . 10 − . 1 75) − 0 0 . t p 65) ( 8 max < . 11 35) max < t ∧ . 16 max < ( − 4 57 − ≥ − 85 9 Gray.Seal < . House.mouse < 1 t < ←→ 1 ( t max < ( t 85) ( Beech.Marten < ←→ ∧ 8 39 . bucket=100.) J - Jaccard similarity ( Stoat < ∨ p − max < House.mouse. ∧ ( 5) 5 ∨ max . . 5) 10 ≥ 2 ∧ 5 ∧ . max < 5) 0 0 − t . ∨ ←→ . 5 0 in degrees Celsius, and 7 − . 0 Muskrat < ≥ 0 ←→ 15) − p ∧ 0 ( 05 7 . ≥ 5) ( n 10 5) . ≥ t . . Beech.Marten < 1 3 t ≥ ≥ 5) t ∨ 0 0 ∨ 85 . ∧ 22 ( ∧ . avg ∧ 0 5 5) ≥ . ≥ . 10 − 42 0 45 ≥ 0 . 7 745) ←→ . 1 t ≥ ≥ max < 13 2 95) ≥ Moose < . 7 ∧ 5) max < Stoat 0 . p − 1 ∧ ≥ ( ( 35) 0 5 . − 1 75 max < . . t ∨ 4 0 8 1 t − ∧ avg < Moose 05) ( ≥ 5) − max ( . ←→ . Common.V ole avg < Mountain.Hare < − 10 0 American.Mink Common.Shrew ∨ 45 American.Mink ( − t 12 ( ∧ . ( in millimeters. 5) ( 1 ( − . 5) t ≥ 7 5 ←→ ∨ ∨ . 0 . ≥ 40 t ∨ max n 1 ∨ 0 t House.mouse. 0 ∧ ( ( ∧ 5) 5) 5) ≥ − ≥ 5) max < 5) . . . . ←→ . ≥ ∨ 35 1 ∨ 0 0 0 2 0 House.mouse. 0 . t 05 − 15) max p . . ( 5) ≥ ∧ 5) 2 . ≥ ≥ 83 . t ∧ − 65) 0 House.mouse < 13 21 . 5 ( 0 Stoat ( Eurasian.W ater.Shrew . 6 ≥ ( House.mouse < t 0 ≥ 05 ≥ 22 ←→ ( ∨ ∧ ≥ . ∨ 5 ∧ ∨ 10 5 ≥ . ←→ 5) 5) p . . 5) 5) 0 avg < . 55 . 0 0 . ∧ max 0 5) 0 . 1 American.Mink − ≥ 0 − House.mouse < max ≥ 85 9 ≥ ∧ . t 8 ∧ max < ≥ ( t 5 − 5 . ( 39 . precipitation of month − 55) 8 0 House.mouse < 0 . Beech.Marten t ∨ 1 ≥ Mountain.Hare < Mountain.Hare Mountain.Hare max < t ∧ Raccoon.Dog 49 ∧ ←→ ∧ ( 7 ∧ ∧ ∧ 35) − 5 p Raccoon Raccoon < Y ellow.necked.Mouse 45) . 5 ∧ . ≥ 5 5 5 Mountain.Hare ( . . 5) Raccoon 35 4 . . . 2 . 0 . 5 0 ∧ ∧ ∧ t 8 ∧ 0 0 0 . 0 ∧ 0 13 ( ←→ p 5 5 5 0 5 Common.Shrew < . . . . 5 ( − ≥ . ≥ 0 0 0 0 ←→ ∧ < 5) 0 . Brown.Bear 5 Red.F ox < American.Mink < ≥ ≥ ≥ 1 ≥ 0 ←→ 5) European.Mole Black.rat < . . max < ∧ ←→ ∧ 0 ∧ 0 ≥ ∧ ∧ 5) 5 5 max < − . . . 5 min < 5) 5 5 . ≥ 0 0 0 . . . 1 − 0 t 0 0 0 − stand for minimum, maximum, and average temperature of month ( 7 ≥ ≥ t 1 } ≥ t ∨ ∧ Moose ∧ avg ∧ 15) 15 ; . . 95 235) 5 . . . Stoat < American.Mink < Stoat < House.mouse House.mouse. Moose House.mouse American.Mink < American.Mink < Mountain.Hare Moose < Moose W ood.mouse < Mountain.Hare Muskrat Muskrat < House.mouse Stoat < Common.Shrew < Stoat House.mouse 21 ( 0 6 ( 0 21 min ; max Redescriptions mined by Algorithm 1 from Bio data set (with IG - impurity measure and min p-val. Redescription − { tn 1 , 1 E Table A.2: support ; J 0.877 1613 0.000 ( 0.870 1667 0.000 ( 0.849 1625 0.000 ( 0.835 1701 0.000 ( 0.842 1691 0.000 ( 0.829 1414 0.000 ( 0.823 1599 0.000 ( 0.808 1325 0.000 ( 0.804 1436 0.000 ( 0.802 1167 0.000 ( 0.781 10130.767 0.000 6030.749 870 ( 0.000 0.000 ( ( 0.748 935 0.000 ( 0.745 825 0.000 ( 0.741 611 0.000 ( 0.738 637 0.000 ( 0.714 2820.702 353 0.000 0.000 ( ( 88 Appendix A Redescription Sets from experiments with Bio Data Set ∧ ≥ 5 - . 1 , 0 1 max < E − 535) . 9 t 6 ( ←→ avg < stands for average − 5) . pn 9 0 t ∧ Common.Shrew < ≥ Greater.Horseshoe.Bat ( 55 ∧ 125) ∨ . . 5 6 . 235) 12 5) . 0 − . 6 0 ≥ ≥ Black.rat avg < avg < ∧ 657) avg bucket=100.) J - Jaccard similarity . − 5 0 − . − 8 05) Moose < 0 . t 3 ≥ ( in degrees Celsius, and ( 12 t t 20 n ∨ ∧ 25) ∧ . Black.rat avg 5) ←→ . 35 48 . ∧ − 0 195 4 5) . 1 5 . < Stoat < . t 3 max < ≥ 0 ( ≥ 0 4 ∧ p − ≥ ∨ ≥ 7 ∧ 55 t . 5) max . avg ∧ 85 0 10 . − − 45 1 61 ≥ . 5 t t ( in millimeters. ≥ 11 35) Gray.W olf . ∧ n 15) 6 − Arctic.F ox . ∧ max p 97 ( 65 ←→ . 5 90 ∧ 25) − . ∨ 0 ≥ . 0 5) 3 ≥ . − 5 t 5) 15 16 0 . . p ( 5 min < 0 p ∧ 45 ≥ − ≥ ∧ 1 < 75 ←→ 15) t . . ( 85 max < min < . American.Mink < 5) 10 51 22 . p − − 86 ∧ 0 ( ≥ ≥ 9 5 ←→ 12 t ≥ ≥ . 4 t 0 p ( 6 ∧ 5) ( ←→ . precipitation of month p Greater.W hite.toothed.Shrew < max ( 0 25 House.mouse . 5) ∧ . − ≥ ←→ ∧ ←→ 0 5 14 9 . t ←→ 5 5) 0 . ( ≥ . 5) ≥ . 0 Norway.lemming 0 5) ∨ 0 . ∧ 0 ≥ ≥ Greater.Horseshoe.Bat < 5 . Black.rat < max 75) ≥ . 0 ∧ ∧ − 5 51 . 5 . 0 10 < 0 t ( 8 stand for minimum, maximum, and average temperature of month p } ∧ ←→ avg 15 ; . 5) . Stoat < W ood.Lemming Moose < Grey.Red.Backed.V ole European.Hamster Alpine.marmot Alpine.Shrew Common.Shrew < Arctic.F ox < Common.Shrew < Greater.W hite.toothed.Shrew 22 0 min ; max Redescriptions mined by Algorithm 1 from Bio data set (with IG - impurity measure and min p-val. Redescription − { tn 1 , 1 E Table A.2: support ; J 0.701 604 0.000 ( 0.693 7890.663 0.000 212 ( 0.000 ( 0.655 1820.641 567 0.000 0.000 ( ( 0.634 772 0.000 ( 0.591 182 0.000 ( 0.484 151 0.000 ( 0.442 76 0.000 ( 0.418 107 0.000 ( Appendix A Redescription Sets from experiments with Bio Data Set 89 ∨ ∨ ∧ ∧ ∧ ∧ − ≥ ≥ − 45) 3 1 75) - . t 5) . t ←→ 5) 5) 5) 1 5) 5) . . ( . . 0 ( , . . 0 1 0 0 0 0 0 10 ∧ 5) ∧ max < max max max < E . 95) . 0 ≥ − − − ≥ − Moose < ≥ ≥ 55) ( 2 8 10 2 . 9 ≥ t t t t 375) ( ( ( . ∧ ( max < 49 6 ∨ ∧ max ∧ − 5) 35) < . 2 ≥ 45) . − t House.mouse < 0 . 8 85) ←→ 95) ( ( 4 . 9 . max < 17 p 365) t ∧ ( . stands for average ( ∧ 5) avg 36 18 − 2 . ∧ 0 − 5) 65) pn ≥ House.mouse < W ood.mouse < . . 11 ( 3 ( t 0 ←→ 5 7 t ( 85) max < ( p ∧ max < . ∧ ( 5) ∧ Common.Shrew < American.Mink avg < 6 ∧ . − max < ( ( ∧ Mountain.Hare − 5) 0 Common.Shrew 5) . ( − . ( ≥ 9 − 0 ∧ ∧ 12 0 85) 1 t 05) ∧ . t 9 t . 55) ( max < ∨ t ( . 8 ( 5) 5) ( 5) 3 . . ∧ − 11 . ∧ 0 max 0 ∨ 5) 0 Gray.Seal < . 12 ( Eurasian.W ater.Shrew 0 − t ( ←→ 25) ( 25) . ∧ 95) . . ∧ ∧ 6 bucket=50.) J - Jaccard similarity 11 max < 5) 5 20 ≥ . t 5) max < . max < ( 0 ≥ − 0 ≥ ≥ 5) 25) − . 1 . ∧ − 2 in degrees Celsius, and 1 0 t 7 3 t ( t ( n ( avg ≥ 65) max min ∧ ∧ . ≥ Etruscan.Shrew < ∨ − − ( − 58 Mediterranean.W ater.Shrew < 5 7 American.Mink < ( 2 55) 25) ∧ t t . . t ( max 95) European.P ine.Marten < ( ( < ( . 7 ( ∨ 5) 3 − 49 ∨ . ∧ 5 ∨ ∨ ∨ 0 1 ≥ p European.W ater.V ole < 5) t ( . ≥ ( 5) ( Garden.dormouse < . 0 Kuhl.s.P ipistrelle < 25) W ood.mouse < House.mouse. 5) ( 25) ≥ − . 8 ∨ ( 75) . ( 0 ∨ ∨ ( . . p 0 ≥ ∨ ( ∧ max 22 ∨ Brown.rat 20 5) ∧ ≥ 36 ( . 85) 55) American.Mink < ∨ ≥ Eurasian.W ater.Shrew < − . . 5) min 0 5) ( ( . 5) . ≥ 2 . ∨ 5) 0 t 0 86 . 10 ∧ in millimeters. − ∨ 0 ≥ 7 ( 65) 0 . 2 p n 5) ≥ t 5) 5) ( avg < 1 . max < . . ( 0 6 0 0 ∧ − − p ∧ ←→ ( 8 7 ≥ t t max < ≥ 5) ∧ ( ( 05) . 05) . − . 0 Alpine.Shrew ∧ 4 max < ∧ 3 ( t 11 65) . ( − ∧ 95) 25) ∧ . ≥ American.Mink 1 . 58 t ( Common.Shrew 5) Lusitanian.P ine.V ole < ( ( . 18 ( 38 ≥ 0 ∧ 25) max < Common.Shrew ∧ . ∧ ∧ ( 5 7 max Common.Shrew < ≥ ≥ − 5) p Alpine.marmot < ( 05) House.mouse < . Common.Shrew < ∧ 5) . 5) ( ( 8 ( . − . ( 1 45) 0 ∨ Chamois < p . t 0 0 3 ( ( 76 ∧ 5) precipitation of month ( 5 ∧ t ∧ . max < American.Mink 5) ( ∧ ≥ ( . 0 ∨ ∧ ≥ < Garden.dormouse < ←→ ≥ 5) − 0 Common.Shrew ( 5) . max < 5) ∧ 9) ( 5) . 5 . 9 ≥ . 1) . 0 t 0 . 5) p ∨ 0 ≥ − 0 . ( 5) ( ←→ ∧ 15 . 1 365) 0 1 46 ≥ t . ∨ max 0 5) ≥ ∧ . ( 2 5) ≥ 5) . 0 ≥ . − 0 ≥ 85) 0 7 3 . 55) ≥ . t p ≥ ←→ ( ( 61 max 59 avg ∨ ∧ 5) − < . < − 0 5 7 4 t 45) 55) 1 25) p stand for minimum, maximum, and average temperature of month . . ( . t p ≥ ( 5 3 ( ( Arctic.F ox < } House.mouse. ∧ ∧ ( 38 ( ∨ ∧ ≥ ∧ ∧ avg < 95) 25) ; . . 8 5) 5) 25) 25) . . . . Mediterranean.W ater.Shrew Brown.rat < Common.Shrew < p Eurasian.P ygmy.Shrew Eurasian.P ygmy.Shrew < European.P ine.Marten Beech.Marten Garden.dormouse Common.Shrew < Kuhl.s.P ipistrelle < European.W ater.V ole House.mouse Kuhl.s.P ipistrelle Moose 0 ( ( ( max < 0 6 ( 18 22 ( 7 ( max ( min ; max Redescriptions mined by Algorithm 2 from Bio data set (with IG impurity measure and min p-val. Redescription − { tn 1 , 1 E support ; Table A.3: J 0.912 1406 0.000 ( 0.876 1590 0.000 ( 0.841 1671 0.0000.823 ( 1293 0.000 ( 0.811 1368 0.000 ( 0.803 1237 0.000 ( 0.802 759 0.000 ( 0.801 1180 0.000 (

Appendix B

Redescription Sets from experiments with DBLP data Set

91 92 Appendix B Redescription Sets from experiments with DBLP data Set 5 . ] b 0 − a [ = 5) bucket CONF 5 . 0 min ≥ AviW igderson < - support ; ∨ 1 , 1 5 . E Molina 0 − 5 . 5 5 . 0 . 0 0 5 . ≥ ≥ 0 ≥ 5 5 . . 0 3 SilvioMicali < HectorGarcia ≥ ∧ ∧ ShaulDar < 5 6 . 5 . 0 1 5 ≥ ←→ . ≥ RaymondT.Ng JeffreyXuY u 1 5 . ∧ ∧ AviW igderson 18 5 5 . . ∧ 1 1 SilvioMicali 5 . ≥ ≥ ∧ 1 5 5 . . 6 ≥ S.Sudarshan < 0 5 0 RobertE.Schapire < . ≥ CatrielBeeri 5 7 5 V LDB < ≥ . ≥ . AviW igderson 0 3 ≥ papers for conference CONF . ∨ ←→ Y oavF reund < ←→ 5 ≥ b 5 ≥ ←→ . . 5 ←→ . 5 to 26 . 17 5 ←→ 10 . 3 a 8 ≥ 5 . ≥ 9 S.Sudarshan AnthonyK.H.T ung AnthonyK.H.T ung RichardM.Karp RaviKumar AviW igderson JiongY ang V LDB < P eterGrunwald COLT < ←→ ST OC < T omiSilander ←→ ∨ ←→ ←→ ∨ ←→ ←→ ∨ 5 5) 5 COLT < ←→ 5 5 5 . . . . 5 . . ←→ 5 5 . 3 3 1 5 3 3 ∨ . . 0 . 1 ←→ 5 5 5 . 6 5 ≥ . 5 ≥ 1 ≥ . 0 ≥ ≥ 2 ≥ ≥ - author submitted from PODS P ODS < SIGMODConference COLT ICML < EDBT < W W W < UAI FOCS SIGMODConference UAI < SODA SODA < ∧ ∧ ∧ ∧ ICDM ∧ ∧ ∧ ∧ ∧ ∧ ∧ ∧ ∧ KDD 5 5 5 5 5 5 5 . . . ∧ 5 . 5 . . . 5 . 5 . 5 5 . ∧ . . . 5 2 3 17 17 18 . 8 13 9 10 8 8 12 13 5 6 . ≥ ≥ ≥ ≥ ≥ 2 ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ COLT COLT V LDB V LDB STOC ICDE ECML WWW UAI KDD FOCS PODS ICDE V LDB STOC p-val. Redescription Redescriptions mined by Algorithm 1 from DBLP data set (with DBSCAN binarization routine; Gini-impurity measure; 1 , 1 E LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity Table B.1: J 0.998 2335 0.119 0.997 2334 0.127 0.997 23340.996 0.261 23310.972 0.151 2270 0.155 0.500 5 0.000 0.571 40.500 40.500 4 0.000 0.000 0.429 3 0.000 0.000 0.333 3 0.000 0.313 50.308 4 0.000 0.000 0.273 3 0.000 0.133 10 0.000 Appendix B Redescription Sets from experiments with DBLP data Set 93 5 . 0 - support 1 , 1 E AviW igderson < ∨ 5 . 0 5 . 1 5 . SilvioMicali < = 0 ∧ 5 . 0 5 . ≥ 0 JianyongW ang < ≥ 5 . 0 ≥ ←→ SilvioMicali > 5 . ∧ 6 papers for conference CONF 5 5 . . ShaulDar b 0 0 5 . AviW igderson to 0 ≥ ≥ a ←→ ≥ 5 . KDD < ←→ 5 0 . 5 ∨ . ≥ 26 8 5 RonaldL.Rivest . 2 ∧ 5 . 1 F rankNeven ≥ AviW igderson ST OC < T omiSilander ( ( ←→ ∨ ←→ 5 . 5 5) ←→ 0 . . - author submitted from 5 5 ] 5) b . ≥ − 2 a [ SilvioMicali ≥ SIGMODConference < ST OC < ←→ DavidMaxwellChickering SODA < SODA ∧ ∧ SIGMODConference < CONF ∧ ∧ KDD 5 5 ; 5) . ∧ . . 5 5 ∧ . . 5 ←→ 18 . 11 20 8 8 5 6 . ≥ 23 2 ≥ ≥ ≥ ≥ ≥ ≥ ≥ STOC KDD UAI UAI PODS FOCS V LDB STOC Redescriptions mined by Algorithm 1 from DBLP data set (with k-means (5 clusters) binarization routine; Gini-impurity measure; p-val. Redescription = 5 LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity 1 , 1 E bucket min Table B.2: 0.972 2270 0.155 J 0.996 2330 0.158 0.667 4 0.000 0.500 4 0.000 0.417 5 0.000 0.333 4 0.000 0.357 50.133 10 0.000 0.000 94 Appendix B Redescription Sets from experiments with DBLP data Set ∨ ∧ 5 - . 1 13 , 0 1 E ≥ ≥ 5 . 1 ≥ DavidMaxwellChickering KotagiriRamamohanarao P eterGrunwald ∨ ∧ ∨ 5 5 5 . . . 0 3 2 5 . ≥ 0 ≥ ≥ 8 ≥ Ambrosio 0 papers for conference CONF . 5 . StephenD.Bay GostaGrahne < b GeorgGottlob 6 ∧ 5 P hilipS.Y u ∧ . to ≥ 5 BruceD 5 8 . 5 . ∧ a . 1 12 6 ∧ ←→ 1 5 ≥ . 5 ≥ ≥ 2 . 5 ≥ . 0 2 ≥ 5 ≥ . 0 5 5 . . 5 24 7 4 . 4 ≥ ≥ ≥ ≥ ArnaudSahuguet LucDeRaedt ICDE BenyuZhang JiongY ang F otoN.Afrati ∧ AndrewT omkins P eterGrunwald < 5 ←→ ←→ . ←→ 11 ←→ 0 - author submitted from 5 5 ←→ . 5 . ] 5 ←→ LungW u . ≥ ←→ 1 . b 2 5 3 . 0 ≥ 4 − 5 − . 0 a ≥ ≥ [ ≥ 1 ≥ ≥ JigneshM.P atel < ≥ LeenT orenvliet ∧ Kun JigneshM.P atel BennySudakov 5 W eiF an . CONF V LDB 2 DavidMaxwellChickering < WWW 1 ECML UAI ←→ FOCS ICDM ICML < ←→ COLT ∧ ←→ ∧ ←→ ≥ ∧ 5 ∧ ∧ ∧ ∧ . 5 ←→ 5 ∧ 5 5 5 . 5 . 5 . . . 5 5 5 . . ←→ . . . 5 14 4 4 . 2 3 4 8 15 10 29 11 28 2 ≥ support ; ≥ 23 ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ ICDT ICDT GostaGrahne PODS WWW PKDD ST ACS WWW ICDM ICDM UAI P eterSpirtes < V LDB FOCS ECML ICDM ) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity i L 100 Redescriptions mined by Algorithm 1 from DBLP data set (with hierarchical clustering binarization routine; IG-impurity measure; p-val. Redescription = 1 , 1 E bucket Table B.3: min J 1 2 0 1 3 0 111 1 3 0 1 0 0 1 11 0 2 0 1 1 0 111 3 1 0 5 0 0 1 1 0 1 5 0 0.667 2 0 Appendix B Redescription Sets from experiments with DBLP data Set 95 5 ∧ ∧ ∨ ∧ ∧ ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ . 1 5 5 5 5 . 5 . . 5 . . . 0 0 0 1 ≥ 0 0 ≥ KDD SODA V LDB < Continued ∧ V LDB < V LDB < bucket= ∨ ∧ 5 . ∧ 5 ∧ 5 . 3 . JiaweiHan - support 0 5 0 5 . 1 T etsuoAsano < . , ≥ ∧ 0 1 0 ≥ ≥ ∧ 5 E . 5 . 0 ≥ V ijayV.V azirani < HamidP irahesh HamidP irahesh NaderH.Bshouty 0 ≥ RobertE.Schapire ≥ MayurDatar ∧ ∨ ∨ ≥ SanjeevKhanna ∨ ICDE RobertW.F loyd < MichaelE.Saks < 5 ∧ . ∧ 5 5 ←→ 5 . ∧ . 0 . 5 5 SIGMODConference . 0 0 . 5 0 5 . . 0 MichaelJ.Kearns < 0 ≥ ←→ 0 ←→ 1 ∧ ∨ ≥ 5 ≥ 5 . 5 . ≥ . 5 0 2 . 4 0 AlbertoO.Mendelzon < F rankT homsonLeighton < 2 ≥ ≥ P iotrIndyk ≥ ∨ ∧ ≥ H.V.Jagadish ∧ 5 5 . 5 . ∨ AmosF iat . 0 ST ACS < 0 0 5 SIGMODConference SIGMODConference ∧ . ICDT ∧ COLT ≥ 1 ∧ 5 ∨ . 5 ∧ ∧ . 5 SIGMODConference 0 SODA ≥ 1 . 4 6 5 SIGMODConference . 0 SergeA.P lotkin ∨ ≥ ∧ 4 ≥ HamidP irahesh < ≥ ≥ ∨ HamidP irahesh < HamidP irahesh < ≥ ∨ 5 5 MichaelA.Bender ≥ . ∧ . 5 5 . 0 0 ∧ MayurDatar ∧ ∧ . 5 0 . 5 0 5 ∧ . 5 0 . . 0 ≥ 5 2 ≥ 2 FOCS SIGMODConference . SODA SODA JohannesGehrke < FOCS 0 ≥ ∨ ∧ SODA ∧ ∧ ∧ AviW igderson < ∨ 5 5 ≥ 5 . 5 ∧ . . . 5 5 ∧ 1 . 0 . 5 V ijayV.V azirani 2 0 . 0 0 5 MosesCharikar < AviW igderson 1 ∨ . STOC ≥ ≥ ≥ 0 5 MichaelE.Saks . ∧ ≥ ≥ ∧ ∨ 0 ∨ 5 ≥ . 5 5 5 0 . . . CatrielBeeri 1 1 0 ≥ COLT SODA < STOC FOCS ∧ SODA < STOC ∧ ∨ 5 ∨ ∨ RobertW.F loyd . ∧ 5 ∨ 5 0 . 5 5 ∨ . . . 5 5 RobertE.Schapire < 0 0 . 5 . 1 1 . ≥ 1 1 LeszekGasieniec < RonittRubinfeld < SantoshV empala 0 ≥ SIGMODConference < ∧ ∨ ∨ SIGMODConference < ←→ 5 V LDB < ∧ . 5 5 . . 5 1 HamidP irahesh ∧ 5 . 2 2 ∧ . 1 5 1 ∨ ≥ . 5 V LDB . V LDB < V LDB < 0 ST OC < ≥ ≥ 5 0 V LDB < . MichaelJ.Carey ∧ ∧ ∧ 1 ∨ ∧ 5 5 5 ∧ . . . AlbertoO.Mendelzon < 5 5 5 5 ≥ . . . 0 0 1 . 5 ∧ ≥ 0 1 0 . 0 5 AviW igderson F rankT homsonLeighton 5 . 0 . AlbertoO.Mendelzon < 0 ≥ ≥ RonaldF agin < 0 ∨ ≥ ∧ ECML < ∧ 5 5 ≥ . ≥ . 5 5 ←→ . . ∧ ST ACS < 0 0 COLT < 0 0 FOCS 5 1 . ∨ 1 ∨ ≥ ≥ ∧ STOC SIGMODConference < FOCS F OCS < 5 FOCS . 5 5 ≥ ∧ ∧ ∧ . ∧ 1 ∧ . 0 5 0 5 5 5 5 RonittRubinfeld LeszekGasieniec . . . . . F riedhelmMeyeraufderHeide < 2 0 0 1 1 ≥ ≥ ≥ F OCS < H.V.Jagadish < MichaelJ.Kearns ←→ ←→ PODS JohannesGehrke < JohannesGehrke < BengChinOoi ∨ AviW igderson ∧ ←→ ∧ ∧ ∧ ∧ ∧ 5 5 5 . . . ←→ 5 5 5 5 5 5 5 5 ...... COLT < 21 0 StefanKramer < COLT 21 21 0 ST ACS 0 SIGMODConference < 3 1 F riedhelmMeyeraufderHeide ST OC < ST OC < SergeA.P lotkin < 2 STOC 0 0 0 ST OC < AviW igderson 0 F OCS < AviW igderson Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min p-val. Redescription 1 , ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity 1 i L E 100 P Table B.4: J 0.985 2301 0.125 0.978 2287 0.255 0.971 2272 0.279 0.959 745 0.000 0.957 707 0.000 0.954 700 0.000 0.959 1557 0.000 0.949 636 0.000 0.948 529 0.000 96 Appendix B Redescription Sets from experiments with DBLP data Set ∨ ∨ ∧ ∧ ≥ ≥ ≥ ≥ ≥ ≥ 5 5 5 5 . . . . ←→ ←→ 1 0 0 1 5 5 . . UAI 0 0 ≥ ICDE < ICDE < ICML ∨ ICDE < Continued LeeT an ∧ bucket= ∨ ∧ 5 5 . − 5 ∧ . 5 . 2 . AmosF iat 0 1 5 - support . 13 1 ∧ V LDB < , 0 1 T omiSilander < 5 ≥ ∧ V LDB < Kian . E ICDT < P eterGrunwald LeslieG.V aliant < ∨ 0 MichaelJ.Carey 5 ∧ ≥ ∧ . ∧ 5 ∧ ∨ 5 5 . 1 ∧ DiveshSrivastava < DiveshSrivastava < . . ≥ 5 0 5 5 0 0 . 5 . . ∧ ∧ . V LDB < 0 7 COLT < 0 0 ICDE AmelieMarian ≥ MichaelJ.Carey < 5 ∧ 5 . . ∨ 5 ∧ ≥ ≥ 0 . 0 ∧ 5 5 SODA SurajitChaudhuri < 2 . . 5 5 8 . 0 ≥ ≥ . 5 ←→ ∧ ∧ 0 . 0 7 ≥ 5 ICDE < 5 5 . ST ACS < . ≥ . 0 ∧ 0 ICDE 8 5 ∨ . 5 SyanLi < T omasF eder . 7 ∧ 5 2 . AmosF iat < ≥ − ∨ 7 5 . ∧ ST ACS < SIGMODConference < V LDB 5 0 . 5 ∨ . ∧ ∧ 0 RakeshAgrawal < T omiSilander 5 0 . 5 ≥ 5 W en . 7 ∧ . ∨ 0 ChristosF aloutsos ≥ UAI < 0 MoniNaor < ∨ 5 ST OC < 5 5 STOC . ∨ . . ∨ 2 ∧ MichaelJ.Carey 0 0 0 MichaelJ.Carey 5 ∧ 5 ∨ . P eterKriegel < . 5 ∧ 2 . GioW iederhold ≥ ∧ 5 8 5 . 0 . − V LDB 5 0 5 . ∧ 0 . 1 5 ≥ P eterKriegel < ∨ 1 5 . . LeslieG.V aliant < ≥ 0 5 SIGMODConference < ICDE < 0 − . ∨ 0 Hans ≥ ∧ ∧ 5 5 CatrielBeeri < ∧ 5 Y ingMa < . . . 0 0 RongW en < 5 2 ∧ . 5 − Hans LeslieG.V aliant . 0 ≥ FOCS 5 − 0 . ∧ P hilipS.Y u < 0 SIGMODConference < ∨ 5 Ji ∧ . W ei SDM < ←→ ∧ 5 5 0 ≥ HamidP irahesh < . ∧ . 5 ∧ AviW igderson 5 ∧ 0 0 . . UAI < 2 SurajitChaudhuri SIGMODConference < ∧ ICDE 5 3 SIGMODConference < 0 ∨ 5 . ∧ . ≥ ∨ 5 ∧ AlanL.Selman < MartinF urer ≥ 0 ∨ . ∧ 5 0 5 . . ≥ 5 5 0 T omiSilander < 5 ∧ 5 . . ∨ . 0 0 . ≥ 5 1 0 ≥ 5 1 5 SIGMODConference < . . ≥ HansL.Bodlaender < 0 1 ∧ ←→ 5 ∨ ≥ . SIGMODConference < F OCS < F OCS < ≥ 5 0 RakeshAgrawal < . 5 V LDB Y ingMa . ∧ 0 ∧ MichaelJ.Carey < ∧ 1 MichaelJ.Carey < SyanLi ≥ ∧ ∧ − 5 AviW igderson . ∧ 5 V LDB 5 ∧ . 5 5 . ICML < − 9 COLT < . . ≥ V LDB < ∧ 5 ∧ 0 0 . 5 1 0 ∧ MichaelJ.Carey < 5 . ∨ 5 1 . ≥ ∧ 5 . W ei 0 . HamidP irahesh 2 ∧ 5 ≥ 0 5 . 5 ≥ 0 . . ∨ ICDE ∧ 5 AmosF iat 9 W en . 1 0 5 AviW igderson < 5 7 4 ∧ ≥ 1 GioW iederhold < . . T omiSilander < ∨ P KDD < 0 5 3 ∧ ≥ . 5 ≥ ∧ ∨ . SODA 2 ∧ 5 ←→ 0 ≥ 5 . AlanL.Selman 5 . . ∨ 5 1 5 FOCS . 0 W W W < 1 . ST ACS < 5 2 SIGMODConference . 0 ST ACS < ∧ ∨ 0 ∨ ≥ SODA < ∧ ←→ ∨ ≥ ICDE 5 5 5 ∧ 5 . . . 5 UAI < . 5 . ∧ 0 1 0 . 5 2 0 . ∨ 0 5 1 5 . HansL.Bodlaender T omasF eder < ≥ . ≥ ≥ 2 2 ≥ SurajitChaudhuri ECML W W W < SergeA.P lotkin < RaymondT.Ng < LuisGravano HansL.Bodlaender < AlanL.Selman < SurajitChaudhuri < ≥ V LDB < ←→ ←→ P etriKontkanen ∧ ∧ ∧ ∧ ∧ ∧ ∨ ∨ ∧ ∧ 5 5 5 5 5 5 5 5 5 5 5 5 ∧ ...... 0 UAI < ChristosF aloutsos < V LDB < 2 ST OC < 0 0 AmelieMarian < T omiSilander < 1 WWW SIGMODConference 0 COLT 0 0 0 ST ACS MichaelJ.Carey < 0 0 ST ACS ICDE < 0 UAI T omiSilander 1 Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min p-val. Redescription 1 , ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity 1 i L E 100 P Table B.4: 0.917 2061 0.000 J 0.937 775 0.000 0.919 711 0.000 0.904 2094 0.091 0.901 182 0.000 0.883 2032 0.022 0.877 2028 0.099 0.879 2033 0.092 0.876 2024 0.074 Appendix B Redescription Sets from experiments with DBLP data Set 97 5 2 ∧ ∧ ∧ ∨ ≥ ≥ . ≥ ≥ 0 5 5 5 5 ≥ . . . . ←→ ←→ ←→ 0 0 0 0 5 5 2 . . ≥ 0 4 ≥ 5 . SODA < ICDE 0 Continued bucket= ≥ ∧ 5 ∧ . 0 - support 5 MicahAdler < . 1 , KDD < 0 H.V.Jagadish ∧ ≥ 1 AviW igderson 5 ∨ AidongZhang E . SODA < ∧ ∧ 5 MosesCharikar < 0 MichaelJ.Carey ∧ ≥ . 5 5 ∧ . . 2 1 OdedGoldreich < ∧ ∧ 0 ≥ 0 5 MadhuSudan < . ∧ ≥ 5 AviW igderson 5 0 . . V LDB MichaelJ.Carey < 5 P iotrIndyk < ≥ ∨ 0 . 0 ∨ RakeshAgrawal ∧ 0 ∧ ≥ ∧ 5 5 . 5 5 . ≥ . 5 . V LDB 0 ≥ . 0 2 1 0 ∧ ←→ ≥ ≥ ICDE 5 . 5 . V LDB SIGMODConference 0 ∨ JiaweiHan 0 MoniNaor < 5 ∧ ∧ ∧ ∨ . 5 5 5 5 5 . . . . RonittRubinfeld 4 5 0 0 MosesCharikar < 2 . 0 ∧ ≥ ≥ 5 ≥ . AviW igderson AviW igderson ←→ 0 5 ∨ MadhuSudan ∧ . H.V.Jagadish < StanleyB.Zdonik < ICDM < ≥ AviW igderson < 5 5 5 . . ∨ 5 SODA < 0 ∧ MichaelJ.Carey . 0 KDD 0 5 ←→ 5 ←→ ∧ . ∧ . ∧ ←→ 0 5 0 5 5 5 . . . . 5 MoniNaor P hilipS.Y u 5 0 . SIGMODConference < 2 1 5 . 1 ≥ ∨ 0 ∨ SODA < 5 ≥ 5 . . ≥ ←→ ∧ 1 0 5 5 . MosesCharikar < . 5 . ≥ 0 SIGMODConference < 5 ∧ 0 KDD < MichaelJ.Carey ≥ 5 ∨ MoniNaor < JiaweiHan < . Molina < ∧ SDM FOCS SODA < ∧ SIGMODConference < 0 5 SODA < AviW igderson < 5 ∨ 5 ∧ . . − ∨ . ∧ ∧ ∨ 1 0 ≥ 5 ∧ 1 5 ICDE . 5 . 5 5 5 5 . . . . 0 5 . 5 0 ∧ ≥ . . 0 2 0 ←→ 0 2 FOCS SODA < 0 5 1 ≥ . 5 2 ∧ ∧ . 0 ≥ P eerKroger < 5 5 2 . . ∧ 2 0 ManolisKoubarakis < 5 . PODS ≥ 0 ∧ SIGMODConference ∧ W W W < 5 5 ≥ . . ST OC < STOC HectorGarcia 5 SanjayAgrawal < ∧ EDBT < ST OC < . 0 P KDD < 0 ∧ 2 ∨ AviW igderson 5 ∨ ∧ V LDB < ∧ ∧ . ∨ F OCS < ∧ 5 ≥ 5 H.V.Jagadish < 5 1 . ∧ 5 5 ∧ ST OC < ≥ 5 . . FOCS 5 . . 5 ∧ . 0 . 5 ∨ . 2 0 5 0 1 ∨ JiaweiHan < . ∧ 0 . 5 5 5 0 5 . . 5 2 5 . 0 5 . . ≥ ∧ ≥ 0 1 HamidP irahesh . ≥ 7 1 ≥ 2 ≥ 0 5 MosesCharikar < . ∧ ≥ ≥ 0 ∧ SurajitChaudhuri < ≥ 5 . 5 5 0 . V ipinKumar ≥ . ∧ AbrahamSilberschatz < 0 0 ∧ V LDB < JiaweiHan < 5 Molina ICDE < F OCS < . 5 F OCS < . ICDM ≥ 0 ∨ STOC 4 − ICDE < ∧ ∨ ST OC < ∧ ←→ ∧ 5 ∨ ∧ . ∨ ←→ 5 5 ≥ 5 . . 5 5 1 . 5 5 5 . . . . . 5 1 2 StanleyB.Zdonik 1 . 0 1 2 1 2 0 ≥ ≥ ≥ P eterKriegel ≥ ≥ ≥ − RaymondT.Ng SurajitChaudhuri RonittRubinfeld < MosesCharikar < CatrielBeeri SIGMODConference ←→ ∧ ∨ ∨ ∧ ∧ ∨ 5 . 5 5 5 5 5 5 ...... SIGMODConference HectorGarcia STOC OdedGoldreich < 0 STOC MoniNaor SIGMODConference 13 0 STOC 0 ST OC < STOC KDD < 0 SDM < P hilipS.Y u < FOCS AviW igderson P ODS < V LDB RakeshAgrawal < SIGMODConference < 1 Hans 1 Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min p-val. Redescription 1 , ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity 1 i L E 100 P Table B.4: 0.802 701 0.000 J 0.872 2014 0.052 0.843 532 0.000 0.865 198 0.000 0.841 660 0.000 0.839 1929 0.026 0.821 630 0.000 0.822 1900 0.095 0.816 248 0.000 0.816 624 0.000 0.815 1876 0.054 0.809 689 0.000 0.806 797 0.000 98 Appendix B Redescription Sets from experiments with DBLP data Set ∨ ∨ ∧ ∧ ≥ ≥ ≥ 5 5 5 5 . . . . ←→ ←→ 1 0 0 5 5 3 . 5 ≥ ≥ . 0 ≥ 0 ST OC < Continued bucket= ∧ 5 . 4 - support RanCanetti < 1 ICDT < , V ipinKumar STOC ∨ ≥ 1 ICDT < ∧ ∧ ∧ E 5 RakeshAgrawal . ∧ ICDM < MichaelJ.Carey < HamidP irahesh < 5 5 IgorShparlinski < HamidP irahesh < ChristianSohler < . 0 5 . . 5 ∧ ∧ 4 DavidHeckerman ∧ ∧ 6 . ∨ 5 ∧ ∨ 5 ≥ 5 0 5 5 . . ∧ . . 5 5 5 ≥ 0 FOCS . . 0 . 0 StefanKramer 0 5 SIGMODConference < CatrielBeeri < 0 0 ≥ RichardM.Karp 0 . ∨ ≥ 0 ∧ ∨ ∧ ≥ 5 ∨ . 5 ≥ ≥ 5 . 5 . 8 5 ≥ . 0 . 0 0 0 ≥ STOC 5 ≥ . F OCS < ≥ ≥ V LDB 0 ∧ ∨ ICDE RanCanetti P hilipS.Y u < ∨ 5 . 5 ∨ ∨ . 5 . 0 10 5 5 3 . . ICDE ←→ 4 1 ∧ 5 ≥ AviW igderson < Molina < . 5 ≥ . ≥ 5 AviW igderson < ∧ 0 − 5 IgorShparlinski ∧ SurajitChaudhuri ChristianSohler . SatinderP.Singh 1 UAI JeffreyF.Naughton < 5 SasoDzeroski ∧ . F OCS < ∧ 1 ∨ 2 ∧ MichaelJ.Carey ∧ ←→ 5 5 ∨ ←→ 5 . PKDD . 5 ∧ . . 0 1 5 5 0 5 SODA < 5 . 1 . ∧ . V LDB < . SIGMODConference 0 P hilipS.Y u 5 ≥ 5 0 ∧ 1 ∧ ≥ ∧ 5 5 . . 5 ≥ . 0 1 HectorGarcia ←→ 5 0 . ≥ 5 ∧ 0 Donnell < . MonikaRauchHenzinger < 0 ≥ 0 5 . ICML ∧ 0 SIGMODConference < P eerKroger < 5 5 5 ≥ 5 SODA < SODA < . . . ∧ . ECML RichardM.Karp < 0 0 0 ∧ ∨ 0 ≥ ∧ 5 ∧ . FOCS ∧ 5 5 RyanO ICDE . . 5 2 ≥ 5 . 5 . 5 0 5 ∧ 5 DoinaP recup < . . ∧ 2 . ∨ 0 ←→ KDD 0 StefanKramer < 3 5 3 5 . ∨ 5 3 5 . ∧ . . 1 ≥ ∨ ≥ 0 5 5 ≥ 1 0 5 . . ≥ . 5 0 JeffreyF.Naughton 0 . 0 0 ≥ JeffreyF.Naughton < ≥ ECML < ∨ SIGMODConference < ←→ ST OC < FOCS FOCS 5 ICML < 5 ∨ AviW igderson ∧ . V LDB . ∧ ST OC < ∧ ∧ 7 SIGMODConference < EDBT < 1 ∨ 5 5 AviW igderson ∧ ∧ 5 . . ∨ 5 ∨ AvrimBlum < . 5 MosesCharikar < MosesCharikar < MosesCharikar < ∨ 5 . 0 ≥ 6 . 5 ∧ Donnell 5 ICDM 5 . 5 . 5 0 . 3 Y ingMa < 1 5 JosephNaor . 5 ∧ . ∧ ∧ ∧ 0 V enkatesanGuruswami < . 2 5 . 1 ∧ 1 . 2 4 2 JianP ei < 1 5 5 5 ∧ − 5 1 . . . ∧ . ≥ ≥ 5 0 0 0 JiaweiHan < ∧ 1 ≥ 5 . MichaelBenedikt < . 0 5 ∧ 0 ≥ ≥ ≥ . P ODS < ∧ RyanO W ei 1 5 ≥ ≥ 5 . ∧ SurajitChaudhuri < ∧ . ST OC < 1 V LDB < 5 0 F OCS < ≥ ST OC < ST OC < 5 SODA . ICDE < . ∧ F OCS < V LDB < 0 ∧ ←→ ≥ 0 ∨ ≥ ∧ ∨ ∨ ∧ SDM < P eterKriegel 5 ∧ ∧ Donnell < ←→ 5 . ≥ 5 5 0 5 . 5 5 5 ≥ . ∨ . 5 . . . . 5 − 5 2 . . 5 5 . 10 3 5 1 3 0 . 1 10 0 P eerKroger 1 ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ RichardM.Karp < AviW igderson AvrimBlum JianP ei AviW igderson Hans AviW igderson RyanO SunilP rabhakar < V LDB ←→ ∨ ∧ ∨ ∨ ∧ ∧ ∧ ∨ ∧ ∧ 5 . 5 5 5 5 5 5 5 5 5 5 ...... STOC 0 EDBT 11 0 ICML DoinaP recup JiaweiHan StefanKramer 0 STOC SIGMODConference 0 ECML 1 0 ICDE < STOC 0 F OCS < FOCS 0 S.Muthukrishnan FOCS SODA < 0 SDM 0 Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min p-val. Redescription 1 , ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity 1 i L E 100 P Table B.4: J 0.797 617 0.000 0.786 279 0.000 0.734 80 0.000 0.728 560 0.000 0.663 690 0.000 0.674 58 0.000 0.649 14880.640 0.084 290 0.000 0.637 1399 0.000 0.620 1412 0.125 0.618 194 0.000 Appendix B Redescription Sets from experiments with DBLP data Set 99 ∧ ∧ ≥ ≥ ≥ ≥ ≥ ≥ ≥ 5 5 . ←→ . 0 0 3 ≥ ≥ SODA < 5 Continued . bucket= ∧ 0 5 . 5 ≥ . - support 4 LeonardP itt < 1 0 JamesR.Lee < , V ipinKumar D.Sivakumar HaixunW ang 1 ≥ W W W < ∧ ∨ ≥ ∧ ∧ E StefanKramer ∧ 5 5 ∧ 5 2 . . 5 . ∧ . RobertE.Schapire < 0 0 5 0 5 2 . RichardA.DeMillo < . 5 ∧ MichielH.M.Smid . 0 0 ≥ ≥ 5 0 ≥ . FOCS SIGMODConference < 0 SIGMODConference ≥ ≥ F ernandoC.N.P ereira ≥ ∧ ←→ NicoleImmorlica ∧ ∨ ≥ 5 SatinderP.Singh 5 ←→ . 5 MadhuSudan . 5 . 5 . 3 0 ∧ ∧ 0 5 ←→ . ←→ 5 ≥ ≥ 6 5 . ≥ V LDB . 5 0 . StefanKramer 5 0 . 1 ∨ Y airBartal P hilipS.Y u ≥ 6 ∧ RaviKumar ≥ 5 5 . ∧ ∧ ≥ . 1 ≥ 5 ST OC < 5 0 . . KDD WWW EDBT ←→ 1 OdedGoldreich SasoDzeroski 1 ∨ ArindamBanerjee < BorisAronov HamidP irahesh < ≥ ≥ ∧ ∧ ∨ 5 5 ∧ ∧ . ≥ ∨ 5 . ≥ ∧ 5 5 . 6 . . 0 FOCS 5 5 5 0 . . 5 . 0 3 . ←→ 5 0 0 0 ≥ . 5 ∧ FOCS PKDD 0 . 5 5 ≥ ≥ 0 . . 2 2 ≥ 1 ∧ ∧ 0 6 PKDD ≥ ≥ 5 5 . . AviW igderson ≥ ≥ ∧ ≥ 0 0 UAI KeW ang ∧ 5 MicheleSebag < FOCS . ICDE < ∧ 5 0 ≥ ≥ . ICDM ∧ ∧ ∨ 5 SasoDzeroski 1 . LiLee < 5 ←→ ∧ 5 ≥ 0 . ∧ . 5 . FOCS 0 − 5 5 0 5 . . 0 STOC ≥ . 5 ∧ 0 2 . ≥ 0 5 ≥ 1 . RichardA.DeMillo JiaweiHan ∧ STOC P eterA.F lach < ≥ 0 AviW igderson ECML ∨ MadhuSudan ∧ 5 AviW igderson < ∨ ∧ 5 . ∧ Mong ≥ . MadhuSudan ∧ ECML 5 ∧ 0 5 1 . 0 5 ∧ 5 MartinL.Kersten < . ∧ . ICML . 5 ∧ 0 5 ∧ 0 . . 0 3 0 PODS ∧ 5 STOC ≥ 5 SIGMODConference < ∧ 0 0 . 5 SDM < . ≥ . 2 ∨ 0 5 0 ∧ ∧ 0 . ≥ ∨ 5 . 0 5 5 STOC ≥ ≥ . ≥ . 4 ∧ 1 4 20 5 . ≥ 0 SODA < MichielH.M.Smid < H.V.Jagadish < SODA < ∨ ArvindHulgeri < ∨ ICML < ∧ ∨ 5 ICML < ∧ . 5 ∨ . ECML < 5 1 5 1 ∨ . 0 F OCS < . 5 SODA < V LDB JamesR.Lee < ∨ . 2 0 5 ≥ ∧ SunitaSarawagi 0 SIGMODConference . 5 ∧ ∨ F ernandoC.N.P ereira < ∨ AviW igderson . SODA < MosesCharikar < 0 H.V.Jagadish 5 ∧ 5 ∧ 0 5 . . AviW igderson ∨ 3 MartinL.Kersten ∨ 5 ≥ ∧ NicoleImmorlica < ∧ 5 ∨ 5 . 0 5 . 5 5 ∧ 0 ∨ 5 ∨ 5 . 0 . ≥ KrithiRamamritham < . . ≥ ArindamBanerjee 0 5 5 0 0 5 0 . . ≥ . ∧ ≥ 0 0 0 ≥ 5 ≥ . COLT < ≥ 0 ←→ F OCS < ST ACS < V LDB < ∧ COLT < EDBT < 5 ST OC < ICDT . ≥ SIGMODConference < ∧ ∧ COLT < SDM ∧ ∧ 5 ∧ 2 ∧ . ∨ ∧ 5 5 ∧ 5 5 . ∧ . 5 0 5 5 . . 5 . . 0 . 0 5 . 5 0 0 . 1 . 0 0 5 0 MicheleSebag JamesR.Lee 8 ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ Y airBartal < BorisAronov < ManfredK.W armuth < KrithiRamamritham EricAllender < P hilipS.Y u Y ehoshuaSagiv ←→ AviW igderson V LDB < ∨ ∨ ∧ ∨ ∧ ∧ ←→ ∧ 5 5 5 5 5 5 5 5 5 5 ∧ ∧ ...... SODA 1 ICML P eterA.F lach STOC 5 3 ECML 3 DavidHeckerman SODA 0 SODA HarryBuhrman < SODA 0 0 ICML 0 SDM 4 0 P ODS < 1 0 1 KDD Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min p-val. Redescription 1 , ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity 1 i L E 100 P Table B.4: 0.541 199 0.000 0.537 29 0.000 J 0.616 1397 0.105 0.606 43 0.000 0.576 182 0.000 0.564 211 0.000 0.559 170 0.000 0.451 32 0.000 0.411 92 0.000 0.400 6 0.000 0.389 7 0.000 100 Appendix B Redescription Sets from experiments with DBLP data Set ∧ ∧ ∧ ∧ ≥ ≥ 5 5 5 5 . . . . ←→ ←→ ←→ 0 0 0 0 5 5 5 . . . 0 11 11 Continued ≥ bucket= ≥ ≥ ShengMa ShaulDar - support 1 ∧ , 1 D.Sivakumar < 5 AviW igderson < AviW igderson < . StefanKramer < FOCS ←→ E 1 MosesCharikar < ∨ SODA SODA ∨ ∨ ∨ ∧ 5 . 5 ∨ 5 5 5 ∧ ∧ . StevenSkiena < 5 ≥ . CatrielBeeri < 2 . . . UweSchoning < 1 0 5 ClaireKenyon < 0 0 5 5 0 . . . ∨ ∨ 0 ≥ 0 0 RajmohanRajaraman < ∨ ≥ ≥ 5 ≥ ≥ ≥ 5 ←→ . ∧ . 5 ≥ 0 . ≥ ≥ 5 0 5 5 . . 0 . 0 1 0 ≥ ≥ ≥ ≥ ≥ ICML FOCS FOCS ∧ ∧ ∧ 5 . 5 5 2 ICDT < . . D.Sivakumar 1 1 RolfW iehagen ∧ AnthonyK.H.T ung 5 ∧ . 5 0 CatrielBeeri MosesCharikar . UweSchoning EmmanuelW aller EmmanuelW aller ←→ 0 ≥ ClaireKenyon ←→ JeffreyXuY u ∧ ∧ 5 EstherM.Arkin SIGMODConference . 5 5 ∧ ←→ 5 ∧ ←→ ECML < 1 . . ←→ . ∧ 5 ST OC < ST OC < 5 0 0 . . 5 ←→ ∨ 5 5 5 . 19 0 . . 2 . ≥ ∨ ∨ 5 8 5 7 0 PODS . . 10 5 5 ≥ ≥ . . 3 ≥ 5 6 ∧ . ≥ 3 3 ≥ ≥ 0 ≥ 5 ≥ ≥ . ≥ ≥ 3 ≥ PKDD ≥ SDM ∧ STOC STOC STOC NogaAlon < NogaAlon < COLT 5 ∧ 5 FOCS ∧ . ICDT ICDT . ∧ RobertE.Schapire < 5 ∧ P hilipS.Y u . ∧ ∧ ∧ 0 5 5 ∧ . . 5 0 ∧ ∧ ∧ 21 5 . ∧ SODA 5 5 5 5 5 3 0 StevenSkiena . 5 . . . . . ≥ 5 0 5 5 5 . . 0 5 ≥ . . 0 . 0 0 ∨ 0 0 ∨ . 0 ≥ 0 0 0 2 ≥ 5 5 0 5 . ≥ . 5 . ≥ ≥ ≥ . 0 0 ≥ 6 ≥ 0 4 5 . ≥ ≥ ≥ 0 ≥ ≥ SIGMODConference ≥ ICDE FOCS ICDM < FOCS ICDT ICML < ∧ ∧ STOC ∧ ManfredK.W armuth F OCS < F OCS < ∨ ∧ ∧ 5 ∧ 5 5 . ∧ . FOCS . ∧ ∧ ∧ 5 5 5 5 4 2 2 . ShengMa < . . . OdedGoldreich 5 5 ∧ 5 5 . 4 . 5 2 2 . . ∨ 0 5 ∧ 0 ≥ 1 1 . W eiW ang DonCoppersmith 5 5 ≥ 0 AviW igderson < AviW igderson < . . ≥ MartinF arach EstherM.Arkin ∧ StefanKramer < 0 ∧ MadhuSudan 0 MoniNaor ≥ ∨ ∨ OdedGoldreich 5 ∧ ∧ 5 ∨ 5 . . ∧ ∧ ≥ . ≥ ∧ 5 5 5 5 5 4 0 . . 0 . . . 5 5 5 . . . 0 0 0 0 0 ≥ ManfredK.W armuth ≥ EDBT 0 0 0 EmmanuelW aller ≥ W W W < ST ACS < ICDE SODA < P ODS < ECML < ≥ ≥ ≥ SODA < ≥ ≥ ST OC < ST OC < ∧ ≥ ≥ ≥ ∨ STOC ∨ ∧ ∨ ∨ ∨ ∨ ∨ ∨ 5 5 5 ←→ ∧ . 5 ←→ . 5 . 5 5 . 5 . 5 5 . . . 5 2 2 5 . . 3 5 . 2 5 2 12 . . 0 1 1 3 3 3 ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ RobertE.Schapire RaymondT.Ng NogaAlon XifengY an AlonY.Levy NogaAlon AviW igderson ∧ ∧ ∧ ∧ ∧ ∧ ∧ 5 5 5 5 5 5 5 5 ...... AviW igderson 1 ST ACS AviW igderson ECML StefanKramer 0 SODA AviW igderson ICDM 0 STOC AviW igderson 0 WWW 1 ICDE 1 STOC AviW igderson 0 ICDT SODA 0 PODS W angChiewT an COLT SODA < Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min p-val. Redescription 1 , ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity 1 i L E 100 P Table B.4: 0.207 35 0.000 0.207 25 0.000 0.205 40 0.000 0.194 7 0.000 J 0.318 50 0.000 0.318 7 0.000 0.300 3 0.000 0.299 38 0.000 0.261 6 0.000 0.260 54 0.000 0.250 22 0.000 0.226 190.226 40 0.000 0.000 Appendix B Redescription Sets from experiments with DBLP data Set 101 ∧ ∧ ∧ ∧ ≥ ≥ ≥ ≥ ≥ 5 5 5 5 . . . . ←→ ←→ 0 0 0 0 5 5 . . 1 0 0 ≥ Continued P ODS < ≥ bucket= ≥ ∨ - support 5 . 1 , 2 1 Molina < CraigBoutilier < FOCS E P eterGrunwald < MichaelJ.Carey FOCS RichardM.Karp < ≥ ∨ RakeshAgrawal − ∧ AlbrechtSchmidt JuhoRousu < ∨ ∧ 5 ∨ ∧ 5 . P hilipA.Bernstein < ∧ ∨ . HaimKaplan < ∨ 5 5 0 5 5 . 0 . . . ∧ ChuanChang < 5 5 5 0 ∨ AbrahamSilberschatz 1 . . . SIGMODConference < SIGMODConference < SIGMODConference 2 0 ≥ 5 0 1 0 ≥ MichaelJ.Carey − 5 . ∨ ∧ ∨ . ≥ 1 ≥ 5 V LDB 0 ≥ 5 5 ∧ . ≥ . . ≥ 0 ←→ 0 2 5 ∧ ≥ . 5 ≥ 5 0 . ≥ 5 ≥ . . 0 1 ICML 0 HectorGarcia ∧ ≥ ICML ∨ 5 UAI . ∧ KevinChen 5 ICDE 1 . ICDT ∧ 5 0 ∧ ∧ RolfW iehagen . 5 ∧ . 5 0 5 . 5 . 0 ∧ . ≥ Y ehoshuaSagiv < 5 0 ICDE < Molina < 5 0 P eterGrunwald RolfW iehagen . ICDT < RichardM.Karp HaimKaplan ∧ SergeAbiteboul < 0 ∧ ≥ RajeevRastogi < − ∧ 5 ∧ 5 . ∧ ∨ . 5 . 0 5 1 ←→ ECML < . 0 5 ←→ ←→ 5 MihalisY annakakis UAI < . . 0 ∨ 5 1 0 Molina 5 . V LDB < 5 ∧ ∨ . . V LDB < 5 3 P ODS < 6 . 6 DiveshSrivastava ∧ 5 − 5 ≥ ∧ . 3 . 5 5 ∨ . ∧ 0 5 3 ≥ . ≥ ≥ . 0 5 1 ≥ 5 . 4 . 5 ≥ ≥ ≥ 0 5 5 . HectorGarcia . ≥ ≥ 0 UAI 0 P ODS < ∨ ≥ ∧ FOCS ≥ FOCS ∨ COLT 5 P hilipA.Bernstein . 5 RobertE.Schapire < ∧ . 5 ∧ 5 0 ∧ COLT ∧ . . 0 5 0 ∧ 1 5 5 5 V LDB . HectorGarcia . ∧ . . ≥ 5 0 0 0 RajeevRastogi 1 . ≥ ≥ ∧ 5 5 . RobertE.Schapire < ≥ 0 . 5 0 . 0 ≥ ≥ ∧ AbrahamSilberschatz < 0 ←→ ≥ SIGMODConference < SergeAbiteboul 5 ←→ ∨ . 5 0 ∨ . 5 ∧ 5 Y ehoshuaSagiv . . ICML AbrahamSilberschatz < 8 5 0 5 1 ∧ . . STOC STOC ICML < ∧ 5 ∨ 0 5 ManfredK.W armuth ≥ ≥ . 5 ∧ ∧ SIGMODConference < DavidHeckerman ∧ 5 SIGMODConference < . 0 ICDT < ∧ . ≥ ICML < 5 2 5 5 ∨ ∧ ∨ 0 . MadhuSudan . . ∧ 5 . 1 5 5 ∧ 2 4 5 CraigBoutilier 5 . . . ∧ ManfredK.W armuth 0 . 4 1 5 5 0 RajeevRastogi < 0 . 5 . ∧ ∧ . ≥ 0 0 ICDE CraigBoutilier < ≥ ≥ ∨ 0 OdedGoldreich ≥ 5 5 W W W < . . FOCS ∨ 5 ∧ ∧ 0 0 JuhoRousu < . ≥ ∧ 5 0 ∧ 5 . 5 5 ∨ . . ≥ 5 0 . . SIGMODConference 5 0 0 ECML < ≥ . 0 5 5 SODA < . SODA < ECML < 0 ≥ ∧ ≥ 0 ∨ P ODS < UAI < ≥ ≥ ∨ ∨ ∨ ≥ 5 5 ≥ ∨ . . 5 ≥ ∨ 5 5 . . . 0 5 2 . 1 2 4 5 . 0 0 AbrahamSilberschatz ≥ ≥ ≥ ≥ ≥ ≥ AlbrechtSchmidt < V LDB AbrahamSilberschatz < ≥ V LDB ICDT AbrahamSilberschatz < RobertE.Schapire SatinderP.Singh JiaweiHan GioW iederhold < AviW igderson ∧ ←→ ∧ ∨ ∧ ∧ ∨ ∧ ∧ ∧ ∧ ∧ 5 5 5 5 5 5 5 5 5 5 5 5 ...... JuhoRousu SIGMODConference 1 MichaelJ.Carey < BengChinOoi SODA AviW igderson PODS 6 0 UAI CraigBoutilier 0 SIGMODConference 1 4 ECML 0 PODS 0 0 0 0 SODA 2 ECML RobertE.Schapire Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min p-val. Redescription 1 , ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity 1 i L E 100 P Table B.4: 0.158 40 0.000 J 0.194 38 0.000 0.191 43 0.000 0.190 30 0.000 0.189 165 0.000 0.183 11 0.000 0.182 68 0.000 0.180 37 0.000 0.162 23 0.000 102 Appendix B Redescription Sets from experiments with DBLP data Set ∧ ∨ ∧ ∧ ∨ − ≥ ≥ ≥ ≥ ≥ 5 5 5 5 5 . . . . . ←→ 0 0 2 0 1 5 . ≥ ≥ ≥ W olf 2 EDBT Molina Continued bucket= ≥ ∨ − DanSuciu T iloBalke < 5 ←→ . - support 8 ∧ 5 − 1 . , V LDB 5 0 1 . ≥ 2 ∧ E V LDB HamidP irahesh T omM.Mitchell < ≥ 5 ≥ . DavidP age < V ahabS.Mirrokni < SugatoBasu < ∧ W olf ∨ 2 5 ∨ ∨ 5 ∨ ∨ . GerthStoltingBrodal < . ←→ MichaelJ.Carey 1 5 5 0 5 5 SIGMODConference . ∨ . . 5 . . ∧ 0 1 0 StephenA.F enner 0 STOC 5 ∧ HectorGarcia . ≥ ≥ 5 5 11 0 . ∧ ∧ . ∧ ≥ ≥ ≥ ≥ 0 0 5 5 5 ≥ ≥ . . . 0 0 ≥ 0 W W W < ∨ ≥ 5 ICDM . 5 . 1 ∧ BernhardSeeger ICDE 1 5 5 ≥ . ∨ ∧ V LDB SIGMODConference . 0 ≥ 3 5 5 ∧ ∧ . DavidP age . ICML 0 ≥ 5 0 5 . . 5 ∧ . T omM.Mitchell 3 RolfW iehagen 0 ≥ ≥ 0 5 . ∧ ←→ ≥ 0 V ahabS.Mirrokni ≥ DiveshSrivastava 5 . 5 ←→ . 0 ∧ KDD < SurajitChaudhuri < 1 DavidJ.DeW itt < 5 MichaelJ.Carey < 5 GerthStoltingBrodal . . ←→ V LDB ∨ ∧ ∧ 1 0 ∧ ≥ ICDE < ICDE 5 5 ∧ 5 5 . . . . 5 UAI < ∨ ∧ ≥ 5 . 2 2 6 0 . ←→ 5 5 0 5 ∨ 1 . . . 5 ≥ 0 8 ≥ AnthonyK.H.T ung 0 . 5 . 6 ∧ 3 ≥ ≥ RakeshAgrawal BengChinOoi 5 ≥ PKDD . ≥ ∧ JeffreyF.Naughton SIGMODConference 0 ∧ PKDD ∧ 5 5 5 . ∧ 5 ∧ . . FOCS ICDE ≥ . ∧ 5 5 0 . 1 0 5 . 0 5 . 5 0 ∧ . ∧ 0 . EDBT < 0 ≥ 0 ST ACS ≥ ≥ FOCS ≥ 0 5 5 . HaixunW ang < ∨ COLT . ≥ 5 RobertE.Schapire < ≥ ≥ ∧ . ∧ 0 5 1 5 . 0 ≥ ∧ . 5 ∧ 5 ∧ DanielGruhl < . . 1 0 ≥ 5 5 0 0 3 ≥ . . ∨ BernhardSeeger < 5 ≥ . 2 0 ≥ ≥ 5 NarainH.Gehani < 0 . 0 ∨ ≥ ECML V LDB < ←→ 5 ECML ≥ . STOC 5 ∧ ∧ . 0 ICDM < SIGMODConference < ∧ MichaelJ.Carey ManfredK.W armuth 0 ∧ V LDB 5 5 STOC V LDB < . . ∨ 5 ∧ MadhuSudan OdedGoldreich ≥ ∧ ∧ 5 . 4 ∧ StefanKramer 2 ∧ ≥ . ∧ 5 ICML < 4 5 MichaelJ.Carey . 5 5 ∧ ∧ 5 5 2 . . 5 DavidJ.DeW itt . ∧ . . T iloBalke < 1 . SurajitChaudhuri 3 ∧ 0 5 5 ∧ 0 0 1 5 . . 3 ∧ . 5 − ≥ ∧ 5 0 0 . . 5 StefanKramer 0 JeffreyXuY u ≥ 5 . . 0 5 SugatoBasu < 0 . 0 ≥ ≥ 0 ∧ ∧ ≥ FOCS 2 ∨ 5 ArnoJ.Knobbe 5 DanielGruhl ≥ . . ∧ 5 W olf 0 . 0 ICML < W W W < 5 ICML < 2 SODA < . KDD < ∨ ICDE < ≥ 0 SODA < ∨ ≥ ∨ ICDE < ∨ ←→ ∨ ←→ UAI < 3 ∧ ≥ ∨ ∨ 5 5 ∨ ≥ 5 5 . . 5 5 . . 5 5 ∨ . 5 5 . . 4 2 . ≥ . . 4 4 2 0 8 NarainH.Gehani 1 5 3 3 . ≥ ≥ ≥ 0 ≥ ≥ ≥ ≥ ≥ ≥ ≥ AviW igderson AviW igderson DanielGruhl < SophieCluet SasoDzeroski HamidP irahesh < ICDT ←→ RobertE.Schapire ∧ ∧ ∨ ∧ ∧ ∨ ∧ 5 . 5 5 5 5 5 5 5 ∧ ...... SODA 0 T iloBalke 3 UAI ICDE 13 NarainH.Gehani < SODA 0 WWW PKDD ICDE 0 ICML SasoDzeroski EDBT < 0 BernhardSeeger < 0 ICML 0 KDD SugatoBasu HaixunW ang SIGMODConference 0 Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min p-val. Redescription 1 , ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity 1 i L E 100 P Table B.4: J 0.157 37 0.000 0.157 22 0.000 0.152 20 0.000 0.146 37 0.000 0.142 26 0.000 0.143 2 0.000 0.141 9 0.000 0.125 35 0.000 0.124 11 0.000 0.117 49 0.000 0.115 18 0.000 Appendix B Redescription Sets from experiments with DBLP data Set 103 ∨ ∨ ∧ ∨ ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ 5 5 5 5 . . . . ←→ 0 0 0 1 5 . ≥ 0 ≥ ≥ PODS V LDB Continued ≥ bucket= ∧ W eiF an ∧ 5 . 5 ∧ . 5 - support 0 5 1 , GeorgGottlob < . ≥ 1 H.V.Jagadish < 1 ≥ ∨ GioW iederhold E AristidesGionis V LDB ∧ 5 V ladimirV apnik < ∧ . ∨ 5 ∧ . 0 5 ∨ 5 0 . . 5 HaixunW ang < . 0 5 KDD ≥ 0 . V LDB 0 ∨ SIGMODConference SIGMODConference SIGMODConference AhmedK.Elmagarmid 0 MichaelJ.Carey ∧ RakeshAgrawal ∧ ≥ 5 ∧ ∧ ∧ GerhardW eikum ∧ ≥ . 5 5 ∧ . . ≥ 5 5 5 0 5 . . . 4 0 . 5 0 5 0 HuanLiu < . 3 ≥ 0 ←→ ∨ ≥ ←→ ≥ 5 KDD 5 . . 5 . 8 1 ∧ 5 3 . 5 5 . 0 JiaweiHan < . ≥ ≥ 0 ICDM < 1 ICDE ≥ V LDB < PODS ∧ ≥ MichaelJ.Carey ∨ ∧ ≥ ∨ 5 ∧ 5 AristidesGionis < Y ehoshuaSagiv . 5 5 . ∧ 5 . . 5 . 0 0 5 . ∧ 1 5 0 1 . . 1 5 V ladimirV apnik 5 0 . 0 ≥ . H.V.Jagadish < ≥ ≥ 1 GioW iederhold ←→ 0 ≥ ∧ HuanLiu ∧ 5 ≥ 4 ≥ . ICDM < ←→ 5 1 . ∨ 5 1 . SurajitChaudhuri < ←→ ICDE 5 ICDE SIGMODConference < 1 W W W < ICDT < . 0 ∧ ∧ 5 ∧ ∨ ∨ ∨ . ≥ 5 5 5 5 1 5 . . 5 . . ≥ . 4 0 0 1 0 1 V ipinKumar < ≥ SDM < 5 ≥ . ≥ ≥ ∨ CatrielBeeri < SIGMODConference RaghuRamakrishnan 0 ∧ RakeshAgrawal SIGMODConference AbrahamSilberschatz 5 H.V.Jagadish H.V.Jagadish ∧ . 5 ∧ ∧ ∧ ≥ . ∧ PKDD 2 ∧ ∧ 5 ∧ ICDE 5 2 5 5 5 . . . . . KDD 5 5 5 SurajitChaudhuri 5 0 ∧ JiaweiHan < 0 . . . 0 KDD < . ∧ 0 0 LuizT ucherman < ≥ 0 0 EDBT 0 0 ∧ 5 V LDB ∧ V LDB ∧ 5 . ∧ ≥ ≥ ≥ . ∨ ∧ 5 5 0 5 ∧ ≥ LungW u < ≥ ≥ 5 . 5 . 0 ∧ . 4 ≥ 5 . . 5 0 0 . 0 . 5 0 0 − ≥ . 5 ≥ 0 3 ≥ ≥ ≥ ICDE Kun ∧ ICDE DanielF.Lieuwen < V LDB 5 ∧ . KDD < ∧ ∨ ECML 4 5 ∧ . ∧ 5 5 ICDM < DmitryP avlov < ICDM ICDE < V LDB < GeorgGottlob < . 0 ∧ SIGMODConference < 5 P ODS < ≥ 5 0 . ∨ ∧ ∧ ∧ . H.V.Jagadish ∧ 5 ∨ ∧ ≥ 0 StefanKramer ≥ . 1 ∧ 5 5 5 JiaweiHan 5 AbrahamSilberschatz 5 . . 5 5 . 1 ∧ . . AristidesGionis < . . 5 ∧ MichaelJ.Carey 4 1 2 . 0 ∧ 5 4 0 ∧ 0 SurajitChaudhuri SurajitChaudhuri 5 ∨ 1 5 ∧ . . 5 HaixunW ang < 5 LuizT ucherman . ≥ 5 . ∧ 0 ≥ ∧ 0 5 . 0 0 . ∨ 0 5 5 0 . ≥ ≥ SurajitChaudhuri 5 1 ≥ . ≥ ←→ ∧ 0 EDBT < ICDE < ICDM < 5 5 ICML < . . ICML < ∨ SDM < W W W < ≥ V LDB < 0 ICDT < ∨ 1 ∨ ∨ ∧ ∨ 5 ∨ ∨ ∨ 5 . 5 ≥ 5 . ≥ . 5 5 5 0 . . . 5 5 . 0 1 . . 1 1 0 5 0 1 DanielF.Lieuwen V ipinKumar ≥ GeorgGottlob ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ SIGMODConference AbrahamSilberschatz CatrielBeeri V ipinKumar < SasoDzeroski HaixunW ang LuizT ucherman < NickKoudas PODS ←→ ←→ ←→ ∨ ∧ ∧ ∨ ∧ ∧ ∨ ∧ ∧ 5 5 5 5 5 5 5 5 5 5 5 5 ...... ICML 0 SDM 0 ICDE GerhardW eikum < V LDB 0 ICDM 4 HaixunW ang JiaweiHan 8 DanielF.Lieuwen < WWW 0 0 SIGMODConference < ICDT 1 2 ICDM 0 0 0 EDBT RakeshAgrawal < Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min p-val. Redescription 1 , ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity 1 i L E 100 P Table B.4: 0.099 9 0.000 0.095 20 0.000 0.091 32 0.000 0.089 49 0.000 0.088 77 0.000 J 0.114 39 0.000 0.110 42 0.000 0.108 70 0.000 0.108 93 0.000 0.102 33 0.000 104 Appendix B Redescription Sets from experiments with DBLP data Set ∧ ∧ ∧ ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ 5 . 5 3 5 1 . . 0 0 ≥ SDM KDD 5 EDBT . Continued ∧ V LDB ∧ bucket= 0 ∨ 5 5 . ∧ . ≥ 5 0 11 - support 5 . 1 MarcSebban < ≥ , ≥ 0 ElenaBaralis < ≥ H.V.Jagadish < 1 ∨ ∨ E Y ehoshuaSagiv ∧ RadekV ingralek ≥ 5 5 HamidP irahesh . . 5 ∧ ∧ . 0 0 SurajitChaudhuri ∧ XingquanZhu < V ipinKumar < 1 5 5 . P eterA.F lach < ∨ V LDB ≥ 5 0 ≥ PODS ICDM ∨ ∨ . ≥ RaghuRamakrishnan < SIGMODConference SIGMODConference SIGMODConference < 5 ∧ 0 ∨ JeffreyD.Ullman 5 3 ∧ ∨ XueminLin . ∨ KDD ∧ 5 ∧ ∧ 5 5 . ∧ 0 ≥ 5 5 . . 5 ∧ . 5 . 3 5 . ≥ . ∧ 0 5 4 . 0 0 . 5 5 2 ≥ . 0 0 5 0 ≥ . ≥ ≥ ≥ ≥ 3 ≥ ≥ KeW ang 5 ICDE . V LDB PODS ∧ ICDE V LDB < 1 GioW iederhold GustavoAlonso ∧ BruceG.Lindsay ∧ 5 ∧ ∧ ∨ . 5 ∨ ≥ 5 ∧ . ∧ 5 5 0 . 5 . . 5 . 3 5 5 XingquanZhu 5 . SurajitChaudhuri < 1 0 P KDD < . 0 . 7 ≥ P hilipS.Y u 0 0 GioW iederhold ≥ ∨ P eterA.F lach ≥ ≥ ∧ SurajitChaudhuri < ∧ 5 RaymondT.Ng < AbrahamSilberschatz < ←→ ←→ 5 . . 5 ∧ 0 . ∨ ∧ 0 2 5 1 5 . ←→ 5 5 2 . . ≥ KDD ≥ EDBT < ≥ SIGMODConference < V LDB < PODS 0 EDBT < 0 8 5 . RaghuRamakrishnan ∨ ∧ ∧ ∨ ∧ ≥ ∨ 0 5 5 5 5 ≥ 5 ≥ . ∧ . 5 . . 5 . . . 5 ≥ 0 RakeshAgrawal 0 1 4 5 1 0 . H.V.Jagadish 0 ∧ JiaweiHan < ICDT ≥ S.Sudarshan ≥ ICDE ≥ ∧ ≥ H.V.Jagadish < 5 ∧ . ∧ ∧ UAI 5 ICDE ∧ . 0 ∧ 5 5 5 . 1 ∧ . . 5 5 5 ∧ 0 . . . ≥ 0 0 JiaweiHan < 5 V ipinKumar 0 0 5 0 5 . . . ICDE ≥ ∧ 0 ICDM < 2 ≥ W eiminDu < ∨ V LDB 0 ≥ Molina ∧ 5 5 ∨ . ∧ ∨ GustavoAlonso < ≥ . 5 ≥ − 0 ≥ . 5 0 5 5 RaymondT.Ng . . 5 . 2 0 0 5 ≥ . 0 ←→ ≥ PODS ≥ KDD < ←→ RaymondT.Ng < 5 ≥ 5 ∧ . KDD . ∧ 5 AlbertoO.Mendelzon 0 ECML 4 ∨ 5 . . ∧ 0 5 ∧ 5 3 . ∧ . MarcSebban < 5 V LDB < 3 2 ICDE < SIGMODConference < 0 . ICDE ElenaBaralis < SIGMODConference SIGMODConference < 5 ≥ 0 ≥ ∨ . ∧ ∨ ∧ ∨ ∧ ∧ ∧ ≥ 0 ≥ HectorGarcia DaphneKoller 5 5 5 SurajitChaudhuri < 5 5 . . 5 . 5 5 SurajitChaudhuri . . . . . 5 3 ∧ ∧ 0 1 AbrahamSilberschatz SurajitChaudhuri 0 V ipinKumar < ∨ 5 4 0 W eiminDu ∧ 5 5 5 . . ∧ 5 ≥ ∧ F OCS < ≥ ≥ . 0 0 ICDT 5 0 5 . . ∧ ←→ ∧ 0 ≥ RaymondT.Ng SurajitChaudhuri 0 ≥ ←→ ≥ 5 5 P KDD < . 5 ∧ ∧ 5 . ≥ P KDD < . . 6 ICML < 5 1 5 5 0 ∨ KDD < . . ICDE < ∨ EDBT < V LDB < ∨ 1 1 ≥ 5 ≥ ∧ ≥ ≥ 5 ∧ ∨ . ∨ . 5 3 . ≥ ≥ 5 5 5 5 0 . . . . 0 0 3 1 5 ≥ MarcSebban ≥ ElenaBaralis ≥ ≥ ≥ V LDB V LDB EDBT SIGMODConference EDBT NickKoudas H.V.Jagadish W eiminDu < AbrahamSilberschatz JiaweiHan MonikaRauchHenzinger < RaghuRamakrishnan GustavoAlonso < ←→ ∧ ∧ ←→ ∧ ∨ ∧ ∧ ∧ ∨ ∧ ∧ ∧ ∧ ∨ 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 ...... StephenMuggleton PKDD ICML JiaweiHan EDBT 1 0 V LDB 4 0 SIGMODConference 5 0 ICDM < 0 JiaweiHan PKDD 0 0 SIGMODConference < 0 6 0 1 EDBT < 0 3 Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min p-val. Redescription 1 , ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity 1 i L E 100 P Table B.4: 0.085 17 0.000 J 0.086 8 0.000 0.084 55 0.000 0.084 40 0.000 0.082 36 0.000 0.081 66 0.000 0.080 69 0.000 0.074 32 0.000 0.074 52 0.000 Appendix B Redescription Sets from experiments with DBLP data Set 105 2 ∧ ∧ ≥ 5 5 ≥ . . 0 11 ≥ 5 . bucket= 1 ≥ F lipKorn - support 1 ICDT , ∧ 1 V LDB < 5 ∧ . E AlfonsKemper < ∨ 0 HamidP irahesh 5 GeoffreyI.W ebb < . 5 ∨ DiveshSrivastava < . 1 ≥ ∧ ∨ 0 5 Y asuhikoMorimoto < ∧ . 5 5 . 0 5 . ∨ ≥ . SIGMODConference < 0 0 2 5 ∨ . ≥ 0 5 ≥ . 1 ≥ ≥ ICDE P ODS < JeffreyD.Ullman ∧ ∨ ∧ 5 5 5 . . . JunY ang 8 0 0 RakeshAgrawal ICDE ∧ ∧ ≥ 5 ∧ ≥ . 5 5 . . 5 0 . 7 7 4 Y ehoshuaSagiv < ≥ ≥ GeoffreyI.W ebb ∧ SetragKhoshafian < 5 . ∨ 2 Y asuhikoMorimoto ←→ 5 . 2 5 V LDB < . 0 ∧ ←→ ≥ 5 . 5 ≥ . 4 DiveshSrivastava 0 ∧ AlfonsKemper < ≥ SunitaSarawagi AlfonsKemper 5 . ∨ SIGMODConference < ∧ 2 SIGMODConference 5 5 ICDE . ∧ . 0 ∧ 0 5 ←→ ∧ . 5 MichaelKifer < ≥ . 5 V LDB ≥ 5 . . 11 0 ∨ 0 3 ∧ 5 5 5 . . . ≥ ≥ 2 3 0 SetragKhoshafian 5 . ≥ ≥ ≥ 0 ≥ ←→ ICDT < V LDB < KDD 5 PODS . ∨ ∧ ∧ 1 SIGMODConference < ∧ 5 5 5 GioW iederhold . ICDM . . ∨ Y ehoshuaSagiv 5 ≥ 0 0 1 . SetragKhoshafian < ∧ 5 ∧ . 8 ∧ 5 4 ∨ ≥ . 5 P hilipS.Y u JianP ei 5 . 0 . 5 ≥ 0 . ≥ ∧ ∧ 2 0 5 5 MichaelKifer . . PODS ≥ 0 0 ∧ ≥ ≥ WWW 5 P ODS < ICDM < . ←→ 4 ∧ ∨ ∨ KDD < 5 5 . ≥ 5 . 5 ∨ . . 0 1 20 5 11 . 0 ≥ ≥ ≥ ≥ ≥ MichaelKifer < JiaweiHan JiaweiHan H.V.Jagadish V LDB RakeshAgrawal < ∨ ∧ ∧ ∧ ∧ ∧ 5 5 5 5 5 5 ...... PODS V LDB 0 KDD 0 0 ICDM SIGMODConference 4 0 V LDB SIGMODConference 7 Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min p-val. Redescription 1 , ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity 1 i L E 100 P Table B.4: J 0.071 53 0.000 0.068 20 0.000 0.065 200.063 0.000 32 0.000 0.063 50 0.000 106 Appendix B Redescription Sets from experiments with DBLP data Set 5 5 1 5 5 ∧ ∧ . . ≥ . − . 0 0 0 1 5 - 5 . ←→ 1 . 1 5 0 , ≥ ≥ . 2 1 5 0 . E ≥ 0 STOC Continued ∧ 5 . ShingP erng < KeW ang HectorGarcia 20 ∧ − AviW igderson < 3 ∧ SaktiP.Ghosh < F OCS < ST ACS < 5 IoanaManolescu < DilysT homas < ∧ ∧ . ≥ ∧ MoniNaor < 2 KunalT alwar < 5 5 ∧ ∨ RonaldL.Rivest < . 5 . ∧ . 2 ∧ 5 1 NeoklisP olyzotis ∧ Chang 8 . 5 5 ←→ . . 5 ∧ 0 5 . ≥ JohannesGehrke . 0 5 5 ≥ 0 . 13 0 5 F OCS < . . ∧ 6 0 ≥ ←→ 0 ≥ ∨ ≥ 5 5 . . 5 ≥ . 0 5 7 W eiW ang . 1 ∧ ≥ ≥ 14 5 PODS . Molina < 5 ∧ ≥ SDM < − ≥ 5 . ∧ P ODS < SilvioMicali < 5 DanRoth < 20 5 LeszekGasieniec < . ICDT < . ∧ ∨ 3 SilvioMicali JigneshM.P atel 0 ∧ ∧ ∨ SophieCluet AviW igderson ≥ 5 5 5 . 5 ∧ ∧ AviW igderson 5 . 5 ≥ . ∧ . . . ≥ 0 ∨ 1 5 ST ACS 0 5 5 1 2 ∨ 0 . . . 5 S.Sudarshan 4 3 . 2 5 ≥ ∧ . ≥ ≥ ≥ ≥ 0 ∨ 0 ≥ 5 ≥ ≥ ≥ . ShingP erng 5 V LDB . HectorGarcia 13 − 1 SODA ∨ ∧ DietervanMelkebeek ∧ ≥ 7 5 EDBT . 5 ∧ . 2 papers for conference CONF . ∧ NadaLavrac ≥ 7 5 5 . Chang b . ∧ 1 7 ≥ ∨ 5 to . 5 ≥ . ST OC < ≥ 1 AviW igderson a 1 JohanHastad ∧ ≥ ∧ EDBT LeszekGasieniec ShaulDar AnirbanDasgupta AviW igderson < 5 AviW igderson < ShafiGoldwasser 5 . IoanaManolescu NarainH.Gehani ∧ ICDT . ∧ ∧ ←→ ∨ 1 ∧ 20 5 5 ∨ 5 . . ←→ 5 . 21 ICDM ←→ 5 . 4 2 ←→ . 4 4 5 ←→ 1 5 ∨ 5 1 . . . 5 ≥ SIGMODConference 5 ≥ . ≥ 0 3 ≥ ≥ 7 7 . 8 7 24 ≥ 2 ∧ ≥ LeszekGasieniec < ≥ ≥ ≥ 5 5 LeenT orenvliet ∨ . . 15 1 2 F OCS < 5 . JamesCussens ICDT < V LDB ≥ 2 R.Guha ∨ ≥ ←→ ICDT < ∧ ∧ F OCS < 5 W W W < . F OCS < SilvioMicali < 5 ∧ 5 V LDB . 5 1 ←→ . ∧ - author submitted from ∧ . ∧ 1 ∨ 5 ←→ LungW u ∧ ] . 5 3 W ernerNutt 5 b . 5 15 . AleksandarLazarevic < 5 0 5 21 . 1 . ≥ . 0 ∧ 3 − − SilvioMicali 8 8 ∧ ≥ 7 a 4 5 ≥ [ . ≥ 6 ≥ ≥ ≥ ≥ NarainH.Gehani ≥ ≥ 10 ≥ ST ACS < ∧ Kun ←→ JeffG.Schneider ≥ MichaelGillmann ShojiHirano 5 ∨ SODA . SODA < 3 SIGMODConference CONF 7 STOC KDD PKDD CatrielBeeri < FOCS ∧ 5 ∧ PODS ∧ BrendanJ.F rey SDM STOC . ←→ STOC ←→ ∧ ∧ 5 ∧ ∧ 2 ∧ ←→ 5 ←→ 5 ∧ . ∧ 5 ∨ 5 . . ∨ 5 5 . . 5 5 . . 5 5 5 . . 5 . 5 5 . . 12 LungW u . . . 20 11 20 0 7 ←→ 6 15 7 9 ≥ 26 16 7 19 8 8 − ≥ ≥ ≥ ≥ support ; ≥ ≥ ≥ ≥ ≥ 36 ≥ ≥ COLT < LeenT orenvliet ≥ ∧ Kun Y aronKanza ∧ ∧ ∧ 5 . 5 5 5 . . . FOCS V LDB < S.Sudarshan < MinosN.Garofalakis < V LDB 3 ICDT < ST ACS WWW PKDD ICDM EDBT ICDM < 5 UAI ECML 0 ST OC < ICML 13 FOCS ST ACS SIGMODConference < Molina ST OC < ) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity i L 100 Redescriptions mined by Algorithm 2 from DBLP data set (with k-means (5 clusters) binarization routine; Gini-impurity measure; p-val. Redescription = 1 , 1 E bucket min Table B.5: J 1.000 7 0.000 1.000 5 0.000 1.000 2 0.000 1.000 3 0.000 1.000 31.000 2 0.000 0.000 1.000 11.000 11.000 1 0.000 1.000 4 0.000 0.000 0.000 1.000 11.000 1 0.000 0.000 0.951 2210 0.067 1.000 1 0.000 0.969 31 0.000 0.949 2217 0.284 0.947 2194 0.021 Appendix B Redescription Sets from experiments with DBLP data Set 107 ∧ ∧ ∧ ≥ ≥ ≥ ≥ ≥ ≥ ≥ 5 5 5 - . . . ←→ ←→ ←→ 1 1 , 1 0 1 5 5 7 . . E 7 ≥ 0 ≥ FOCS P ODS < Continued ∧ 5 ∧ . 5 F lipKorn . 0 MartinP al < 5 JiaweiHan 7 . SophieCluet SODA < 3 ≥ ∨ P eterKunath ∧ ≥ HaixunW ang T omiSilander < F OCS < ∧ ∧ 5 ←→ 5 ∧ . ICDE < . ∧ ≥ MosesCharikar < 5 5 ∨ ∧ 0 0 . . 5 5 5 . ∧ . 1 ∨ BanuOzden 5 . 5 . 0 0 . ≥ 11 0 5 ≥ 7 1 0 . ∨ BonnieBerger ≥ 1 SODA ≥ ≥ 5 SIGMODConference ≥ . ≥ ∨ ∧ 0 ∨ ≥ FOCS 5 5 . 5 . . 0 ∧ 0 1 MayankBawa 5 . ≥ ≥ WWW ∧ SIGMODConference < STOC 10 3 ∧ ∧ PKDD ∨ 5 5 ≥ . . HaixunW ang 5 ∨ . 0 ICDT < KDD 1 5 ∧ 19 . ∧ ∧ 3 2 LeenT orenvliet 5 5 StephaneLallich < ≥ ≥ . . RichardJ.Anderson MichaelJ.Carey < MatthiasSchubert 3 6 ∧ ≥ ∨ MarianoP.Consens < ∧ SODA < ∧ ∧ 5 5 ≥ ≥ . 5 . 5 5 . 0 ∨ . 1 . UmeshwarDayal < ICDE < 0 SIGMODConference 0 0 5 ∧ ←→ . ∧ MosesCharikar ∨ EDBT KDD 6 5 5 ≥ 5 . 5 . 5 ∨ . . . ∧ 2 ∧ 3 5 STOC 1 4 . ICDM ≥ 5 5 . 5 0 19 . ∧ . ∧ 0 ≥ 0 ≥ 2 5 5 . . papers for conference CONF . 5 7 2 . ≥ ≥ b 0 UAI < SudarshanS.Chawathe STOC ∨ ≥ to ∧ ArindamBanerjee < 5 ∧ . V LDB a 5 ICDE < . V ipinKumar < 6 AviW igderson < ∨ RaymondT.Ng < SIGMODConference 5 AviW igderson < 0 W eiW ang < ∧ . ∧ ∨ 5 Molina < ∧ ICDM ICDM 3 GioW iederhold ∧ . ∧ 5 SODA < ∧ 5 ∧ 7 5 . 2 . 0 5 P KDD < . 5 ∧ ∧ . 5 − ∧ 5 5 ∨ 1 . 7 . . . 0 ∨ 13 0 5 5 5 5 15 1 1 . . . 5 . 12 . 0 0 0 JigneshM.P atel ≥ ≥ 0 ≥ ≥ 14 5 . V LDB < ∧ 0 ≥ ∧ 5 . 5 F OCS < . F lipKorn < 1 0 ∧ ∨ DiveshSrivastava PODS V enkatesanGuruswami < V LDB ≥ SDM < 5 5 SDM < ICDE < ∧ . . ∨ ∧ ∧ HectorGarcia 4 ∨ 5 ∨ ∨ ST ACS 2 . 5 10 SophieCluet < 5 . MartinP al < RonaldF agin < 5 . 1 DinoP edreschi < ∧ - author submitted from 2 5 . ∧ 7 . ∧ ≥ ] 1 ∨ ∧ 5 5 ∧ 0 GiuseppeManco < b 11 ≥ . . 5 ≥ RaymondT.Ng ICDE . 5 ≥ 5 − 0 5 6 MosesCharikar < . . ∧ . a 0 StephaneLallich T omiSilander < [ ∧ ∨ 1 SophieCluet < 0 0 2 SridharRajagopalan < 5 5 ≥ 5 ∨ . . ∧ . ≥ ≥ ≥ 0 2 ∧ 0 5 BanuOzden < ≥ ←→ 5 . . ←→ 5 SODA < 0 0 ≥ 5 . ≥ ICDT EDBT . CONF 0 0 11 ∨ ST OC < SIGMODConference < ∧ ≥ P KDD < ∧ ST OC < ←→ ≥ 5 ∧ 5 ≥ ∧ . 5 ∧ ∧ . . UAI < 5 5 . . 5 5 5 10 . . ∨ 6 . 2 13 12 7 0 0 5 . MartinP al ≥ ≥ support ; ≥ 2 ≥ ≥ AviW igderson V LDB V LDB < P hilipS.Y u GioW iederhold < SridharRajagopalan KrithiRamamritham < AviW igderson MarianoP.Consens ≥ ←→ GiuseppeManco ∧ ∧ ∧ ∧ ∧ ∨ ∧ ∧ ∨ 5 5 5 5 5 5 5 5 5 5 ∨ ...... SODA < 8 0 0 JohnE.Hopcroft UAI T omiSilander 3 V ipinKumar 0 PKDD SDM 0 SIGMODConference < P ODS < MarioSzegedy < SDM 6 ICDE ArindamBanerjee 0 SODA 1 1 P ODS < 7 ) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity i L 100 Redescriptions mined by Algorithm 2 from DBLP data set (with k-means (5 clusters) binarization routine; Gini-impurity measure; p-val. Redescription = 1 , 1 E bucket min Table B.5: J 0.945 121 0.000 0.922 2144 0.170 0.909 110 0.000 0.898 184 0.000 0.880 44 0.000 0.847 116 0.000 0.833 15 0.000 0.826 271 0.000 0.800 4 0.000 108 Appendix B Redescription Sets from experiments with DBLP data Set 1 ∨ ∧ ∧ ∧ ∧ ∧ ∧ ∧ ≥ ≥ 5 5 5 5 5 5 1 ≥ 5 - 22 ...... ←→ . 1 , 0 2 0 3 2 1 5 1 ≥ ≥ 5 E ≥ . ≥ ≥ ≥ 9 22 P ODS < Continued ≥ ∧ ≥ MoniNaor < 5 . ICDE LungW u BengChinOoi 13 NinaMishra ∧ − ∧ ←→ ∧ MichaelJ.Carey 5 ≥ . 5 1 . 5 7 PODS RichardCole < LungW u ∨ . 0 F rankNeven < 5 2 Y uriBreitbart < Kun 5 . ∨ − ≥ ∧ . ≥ ∨ ≥ 1 JohnLangford 1 ∧ 5 ∧ ≥ . SantoshV empala 5 . 5 ≥ 0 22 5 1 . P raveenSeshadri . 3 ICDE 4 Kun ≥ 11 ≥ ∧ ←→ ∧ RajmohanRajaraman ≥ ≥ ∧ ←→ ≥ 5 5 5 5 5 . . ∧ . . 5 . . 0 V LDB 0 6 1 5 1 14 . ∧ ≥ 3 ≥ 5 ≥ SungRanCho . ≥ 5 ∨ MosesCharikar 2 ≥ ∧ UAI RichardCole P hilipS.Y u < 5 . F rankNeven ≥ ShafiGoldwasser ErichJ.Neuhold < SODA < ∧ JustinA.Boyan 0 ∨ V LDB ∧ 5 ∧ 5 ∧ HuiXiong < 5 SIGMODConference ∧ . . . ≥ ←→ ∨ 5 0 9 5 5 5 . ∨ ∨ . . . NarainH.Gehani ←→ 7 SODA 5 11 0 2 . 0 5 5 . 13 9 . ≥ ∧ ≥ 6 ≥ 0 10 ≥ ≥ SallyA.Goldman ≥ 5 ≥ ≥ SIGMODConference . ←→ ≥ ∧ 8 ≥ ≥ ShaulDar 1 5 ∧ papers for conference CONF . . ∧ ≥ 3 b 5 MarkH.Overmars MoniNaor . 5 AbrahamSilberschatz < ICML . ≥ ∨ EDBT to 4 STOC 18 STOC ∨ ∧ 5 . a ∧ PODS V LDB ∧ ∧ 5 5 0 ≥ P hilipS.Y u ←→ . . BelaBollobas < DavidCohn 5 HuiXiong ∧ 5 SilvioMicali 5 1 ∧ . STOC . . ∧ 16 5 ∧ 5 . W W W < 5 ∧ . 19 H.V.Jagadish . ∨ 5 10 11 5 4 ←→ . ≥ . 5 5 ∧ . 5 5 . 10 ←→ ∧ 1 19 . 5 2 ≥ ≥ . 0 1 1 1 ≥ 5 5 5 ≥ . . . 1 ≥ ≥ 10 0 0 ≥ ≥ V LDB < 15 ≥ ≥ ≥ ∨ ≥ COLT < FOCS 5 F OCS < DavidP.Helmbold . STOC ∨ ShaulDar 4 ∧ ∨ V LDB PODS ICDT ∧ KeW ang 5 SDM COLT < NarainH.Gehani 5 . 5 ∧ . ∧ ∧ . ∨ 1 ←→ 5 ∧ ≥ ∧ . ∧ 5 ICDM 5 5 - author submitted from 1 17 . 5 0 19 . . 5 JeffreyC.Jackson < 8 . . ] 1 0 5 ←→ 7 5 ≥ ∧ b . 4 6 ∧ JiaweiHan JohnLangford < 5 2 2 − 5 ≥ . 5 ≥ . a . ≥ [ ≥ 0 ∧ ∨ 5 RaymondT.Ng 0 SantoshV empala < JurisHartmanis < 5 5 5 . . . ≥ ∧ T irthankarLahiri ∨ 1 ≥ ∧ EDBT 1 0 5 MichaelJ.Carey < ∧ 5 5 . SIGMODConference FOCS . . ≥ ICDT < ∧ ≥ 0 ≥ COLT < SODA < 5 0 0 ICDT CONF . FOCS ∧ ICML ICDT < ∧ ICDM < ∧ 5 8 ICDT < ∨ ∧ ≥ . ≥ ∧ 5 ∧ SDM < ∧ 5 ∨ ←→ 5 . ∨ . 5 5 ∧ . ≥ . . 18 5 5 ∨ 5 5 . 5 . 5 . 14 25 . 16 . . 26 8 5 17 11 5 6 . 1 22 SungRanCho < ≥ 16 5 ≥ ≥ ≥ support ; ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ ShojiroNishio P hilipS.Y u AviW igderson ←→ ∧ ∧ ∧ 5 5 5 5 . . . . 0 V LDB < KDD FOCS 0 Y uriBreitbart COLT W ayneEberly < V LDB SODA HerbertEdelsbrunner < SDM SIGMODConference < 0 ST OC < SDM 0 ICDT SophieCluet V LDB AbrahamSilberschatz JohannesGehrke COLT SODA SallyA.Goldman LaszloLovasz ICDM ) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity i L 100 Redescriptions mined by Algorithm 2 from DBLP data set (with k-means (5 clusters) binarization routine; Gini-impurity measure; p-val. Redescription = 1 , 1 E bucket min Table B.5: J 0.800 4 0.000 0.800 40.671 47 0.000 0.000 0.800 4 0.000 0.800 4 0.000 0.667 2 0.000 0.667 2 0.000 0.615 40 0.000 0.605 52 0.000 0.600 3 0.000 0.571 4 0.000 0.556 5 0.000 0.524 43 0.000 0.500 6 0.000 Appendix B Redescription Sets from experiments with DBLP data Set 109 ∧ ∧ ∧ ∧ ∧ ∧ ≥ ≥ 5 5 5 5 5 5 - . 5 5 . . . . . ←→ . . 1 1 , 0 5 6 0 3 3 7 1 5 . ≥ E 0 ≥ ≥ ≥ KDD ≥ ICML < ≥ Continued ∧ 5 ∧ . BlaiBonet < 2 ∧ 23 FOCS ≥ LeonidLibkin < 5 SasoDzeroski < . AviW igderson < STOC ∧ ∨ ≥ 0 ICML P eterL.Bartlett ∨ ∧ 5 ∧ 5 5 . . . DavidHeckerman < 5 5 Y oavF reund 5 0 SophieCluet ∧ ∧ RudiStuder < . 5 . 0 . 0 1 ∨ 1 ∧ 5 ∧ ≥ 5 ∨ AviW igderson < . UAI T omiSilander ECML . 5 5 0 5 ≥ . ≥ . 7 5 . ∨ ∨ ∨ . 0 ∧ 0 0 6 5 5 5 5 . . . ≥ ≥ . 0 0 0 DavidA.McAllester < 1 ≥ ≥ ∧ ≥ ≥ ICML ≥ ST ACS < 5 ∧ 5 . . IpLin ∨ 1 5 0 . 5 − 9 LeonidLibkin . COLT < ≥ 1 2 STOC MoisesGoldszmidt < ICDM ∨ ∧ MosesCharikar < ≥ ≥ ∧ P eterL.Bartlett 5 ←→ F rankNeven < ∧ ∧ . 5 RudiStuder King . 5 RiccardoSilvestri < 0 . ∧ 5 5 ∨ UAI < . ∧ 11 . 2 ∨ 14 5 0 ≥ 5 0 RobertE.Schapire < . ∨ 5 . 5 . ≥ AviW igderson ≥ 0 . 0 5 SasoDzeroski ←→ 0 ≥ ∧ . 0 STOC ≥ DavidA.McAllester ≥ 7 5 ≥ ∨ 4 . ∧ ∧ 0 ≥ 5 ←→ 5 . 5 . . ≥ ICML 0 1 5 5 . ∧ 5 SDM 1 UAI ShojiroNishio . ≥ 5 . 0 papers for conference CONF . ∧ ∧ 2 ∧ COLT b 7 ManfredK.W armuth 23 5 ∧ ManfredK.W armuth < . EDBT to ≥ 5 3 ∧ . AviW igderson ∨ a ∧ F rankNeven 0 5 V ipinKumar < F OCS < . ∧ 5 5 DavidHeckerman 0 . . ∧ SODA < ∧ 5 1 . ∨ 5 5 ≥ 5 25 ←→ ∧ . 5 . RobertE.Schapire < 1 ECML < MaxHenrion < . 1 . UAI < KennethW.Regan < 0 2 0 SIGMODConference 5 ≥ 5 5 ∨ ∧ . . ∨ ≥ ∨ ∧ ∧ 7 8 ≥ 5 ≥ 5 1 . . 5 5 5 . ICML < . . 7 1 ≥ ≥ 1 1 2 5 ICML < ∧ . ≥ 1 5 ≥ ≥ . ≥ ∧ 5 9 . ManfredK.W armuth < ≥ V LDB ManfredK.W armuth 2 5 . ∨ ICDT AnthonyK.H.T ung ∧ 7 ST ACS < ∧ FOCS ≥ 5 ∧ . COLT 5 5 ∨ . ∧ ICDT 0 . KDD 5 4 ∧ . 0 1 - author submitted from 5 ∧ ChrisClifton ←→ UAI < . RobertE.Schapire ≥ 5 ∧ ] . 5 P ascalP oncelet < 11 8 b 3 . ShivakumarV enkataraman < ∨ ∧ 0 ∧ 5 ≥ − 5 RiccardoSilvestri < . ∧ 5 5 ∧ a . 5 ≥ . 0 [ DavidHeckerman < ShafiGoldwasser 5 . DmitryP avlov < ∨ 0 . 0 SasoDzeroski < 5 0 . 0 ∧ 5 ∧ ManfredK.W armuth 4 . ≥ 5 COLT < 0 . ≥ 5 1 ←→ . RiccardoSilvestri ≥ 4 JigneshM.P atel V LDB < ←→ 0 W W W < EllaBingham ≥ 5 ∨ CONF ≥ P ODS < . ST OC < ≥ ICML < 5 PKDD ∧ ←→ 0 ∨ P ODS < . ≥ 5 ∨ . ∨ ∧ SDM < 2 5 ←→ ∧ 5 ∨ ←→ . 5 5 7 V LDB < . ←→ 5 . . 5 5 5 5 5 . ∧ 4 . . . 0 . ≥ ∧ 8 2 8 28 5 11 5 5 ≥ 20 ≥ . 23 ≥ ≥ ≥ support ; ≥ ≥ 9 ≥ ≥ ≥ RobertE.Schapire UAI COLT < MaxHenrion < MoniNaor MarcGyssens ≥ V ipinKumar ∧ ∧ ∧ ∧ ∨ ∧ ∧ 5 5 5 5 5 5 5 5 ...... PKDD V LDB ECML < 2 1 P eterGrunwald PODS 0 UAI MaxHenrion 1 KDD UAI < 1 2 0 ST ACS SODA < 0 COLT FOCS DavidMaxwellChickering PODS STOC SalilP.V adhan WWW NarainH.Gehani ) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity i L 100 Redescriptions mined by Algorithm 2 from DBLP data set (with k-means (5 clusters) binarization routine; Gini-impurity measure; p-val. Redescription = 1 , 1 E bucket min Table B.5: 0.333 1 0.001 0.333 2 0.000 0.328 19 0.000 0.314 11 0.000 J 0.500 39 0.000 0.500 1 0.001 0.418 133 0.000 0.409 199 0.000 0.375 18 0.000 0.353 6 0.000 0.337 33 0.000 0.333 4 0.000 110 Appendix B Redescription Sets from experiments with DBLP data Set ∧ ∧ ∧ ≥ ≥ ≥ ≥ ≥ 5 5 5 - . . . ←→ ←→ ←→ 1 , 0 0 0 1 5 5 5 . . . E 4 ≥ 2 1 ≥ ≥ ICDT Continued ≥ 24 5 ≥ ≥ ∧ . KeW ang 5 6 ≥ . ∧ 5 5 ≥ . 5 PODS 0 H.V.Jagadish . 4 StefanKramer ∧ ICDM KurtMehlhorn < FOCS PODS ∧ ≥ 5 ∧ ∨ ∧ 5 . ∧ EmmanuelW aller < T homasHofmann < . ←→ 4 5 0 5 LungW u 5 5 H.V.Jagadish . ∧ . . ∨ H.V.Jagadish . 5 . P ODS < 3 0 4 5 KeW ang − ≥ ∨ ≥ 5 14 2 . ∧ . ∨ 5 ∧ 0 ≥ ≥ ≥ . 5 11 5 ≥ ≥ . . 5 2 . 0 ≥ 3 DavidMaxwellChickering Kun 0 ≥ ≥ ∧ ∧ SophieCluet 5 5 5 ArnoJ.Knobbe < . . . KDD ICML 0 V LDB ∧ 1 0 PODS ∨ ∧ 5 5 ∧ ∧ . . ≥ ∧ 5 0 4 . 5 5 5 . SasoDzeroski . 4 . 3 Molina < 2 ≥ ≥ 8 5 . 0 − ≥ KurtMehlhorn AlonY.Levy ←→ ≥ MosheY.V ardi < T omF awcett < 5 ∧ . ∨ 3 5 ∨ ShengMa < 5 . T homasHofmann SIGMODConference . ←→ 5 5 0 . . ≥ 0 ∨ P KDD < ∨ COLT < 0 0 V LDB 5 5 . ≥ . ≥ 5 ∨ ∨ P hilipS.Y u 6 . ≥ 1 ←→ ∧ ≥ 3 5 ∧ 5 SIGMODConference SallyA.Goldman . 5 UAI . 5 . ≥ 5 . ≥ 1 . ∧ ∧ ∧ 4 7 ≥ 14 5 papers for conference CONF . 5 HectorGarcia 5 5 . ≥ . . DavidA.McAllester < b 0 ≥ EmmanuelW aller 2 0 ∨ ≥ 5 CraigBoutilier . ∧ ∧ ≥ to 5 2 ≥ . ∧ 5 5 2 a FOCS . . ≥ 5 ICDT 1 . SDM M.R.Garey UAI ShengMa ∧ 11 0 ≥ T omF awcett SIGMODConference < ∧ Y uriBreitbart ∧ ∧ ICDM ∧ 5 . ≥ 5 5 ∨ 5 ∧ ICDT < ICML . . 5 1 . ∧ KeW ang < 5 . 5 . 5 ←→ ∧ 5 ∧ 3 DiveshSrivastava 5 ∧ . . ←→ 0 14 11 5 . 1 5 0 . ≥ 5 3 5 8 . 5 . ≥ . ∧ 0 . ≥ ≥ 0 5 ≥ ≥ ≥ 5 10 ≥ . 24 0 ≥ 6 ≥ P hilipS.Y u ≥ STOC KDD < ∧ V LDB ICML < SDM KDD SIGMODConference < ∧ DavidA.McAllester ∧ 5 SurajitChaudhuri ∧ MosheY.V ardi < . ∧ ECML < SDM ∧ P ODS < 5 ∧ ∨ 5 ∧ . 0 . ICDE ∨ 5 MoniNaor ∧ 5 ∨ 5 - author submitted from ∧ 5 2 SatinderP.Singh . 5 ∨ . 5 4 S.Seshadri NaderH.Bshouty < . 5 . . . ] 5 4 . 5 3 ∧ ∧ 2 8 . b 4 ≥ . 1 ∧ ∧ 0 ∨ 10 13 2 5 5 0 − . 5 . MosheY.V ardi 5 a . ≥ 5 [ ≥ . ≥ 1 7 0 W eiW ang . ≥ ≥ 0 ≥ ArnoJ.Knobbe < 0 1 ∧ ≥ ∨ StephenA.F enner H.V.Jagadish < 5 ←→ 5 ≥ . . 5 5 0 5 Molina . SODA < . CONF P KDD < COLT < 0 ICML < 9 ←→ FOCS ≥ ≥ ICDM < PKDD − ∨ ∨ ←→ KDD < ∨ 5 ∨ ≥ ∧ . ∧ ≥ ∨ 5 5 ∨ 5 5 5 . . 5 5 . . . . 12 . 2 4 5 0 3 5 8 10 . 5 7 ≥ ≥ ≥ ≥ support ; ≥ ≥ ≥ ≥ ≥ ≥ V LDB KeW ang NaderH.Bshouty < DmitryP avlov < SasoDzeroski < MichaelJ.Carey H.V.Jagadish < AviW igderson ∧ ∧ ∨ ∧ ∨ ∧ ∨ ∧ 5 5 5 5 5 5 5 5 ...... SIGMODConference < PODS ICDE RaghuRamakrishnan SODA 0 ICML 0 COLT NaderH.Bshouty 0 SIGMODConference HectorGarcia CatrielBeeri ST ACS PKDD ArnoJ.Knobbe ECML 5 0 ICDM KDD SIGMODConference 0 0 2 ) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity i L 100 Redescriptions mined by Algorithm 2 from DBLP data set (with k-means (5 clusters) binarization routine; Gini-impurity measure; p-val. Redescription = 1 , 1 E bucket min Table B.5: 0.139 5 0.000 0.155 43 0.000 0.132 48 0.000 J 0.294 5 0.000 0.258 16 0.000 0.228 18 0.000 0.200 1 0.004 0.192 14 0.000 0.175 11 0.000 0.167 1 0.003 0.167 4 0.000 0.166 28 0.000 Appendix B Redescription Sets from experiments with DBLP data Set 111 ∧ ∧ 5 5 - . . ←→ 1 , 0 1 1 E 20 ≥ ICDE HenryF.Korth < SrujanaMerugu < ∧ ∨ DanielF.Lieuwen < ∨ 5 LyleH.Ungar < 5 ElenaBaralis < ∨ . . 5 . 4 7 5 ∨ ∨ 8 5 5 . ≥ ≥ . ≥ ≥ 1 0 ≥ ≥ ICDM ∧ 5 . SophieCluet 1 ∧ HenryF.Korth 5 . LyleH.Ungar DanielF.Lieuwen 0 ElenaBaralis ←→ ←→ ←→ ←→ SDM < 3 5 5 . 5 ∨ . . ≥ 1 0 5 26 . 6 ≥ ≥ ≥ papers for conference CONF . ≥ b 5 . 5 to . 1 PKDD 1 JianyongW ang < a FOCS ≥ ∧ PKDD V LDB ≥ ∧ ICDT ∧ 5 ∧ ∧ . 5 . 5 ∧ 5 . 5 1 . . 12 5 4 5 . 11 21 6 4 ≥ ≥ ≥ 5 ≥ ≥ . 5 ≥ . 1 1 ≥ ≥ ShojiroNishio V LDB EDBT V LDB ICDE ∧ ∧ ICDM < ∧ 9 W eiW ang ∧ ∧ W eiW ang 5 5 ∧ 5 5 . ∧ . ≥ . . - author submitted from 6 ∧ 2 1 5 1 5 ] . . b ShaulDar 5 1 SrujanaMerugu < 3 . − 0 ∧ a [ ∨ ≥ 5 Y ossiMatias . 5 ≥ . 4 ∧ 1 5 ≥ . ≥ F riedhelmMeyeraufderHeide 2 CONF ICDE < W W W < EDBT < W W W < SDM < ≥ ∨ ∨ ∨ ∨ ∨ ←→ 5 5 5 5 . . . . 5 . 2 6 1 1 1 27 ≥ ≥ ≥ support ; ≥ ≥ ≥ JianyongW ang LaksV.S.Lakshmanan CharuC.Aggarwal ∧ ∧ 5 5 ∧ . . SDM SrujanaMerugu 1 WWW NarainH.Gehani WWW 5 EDBT H.V.Jagadish ICDE 7 FOCS ) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity i L 100 Redescriptions mined by Algorithm 2 from DBLP data set (with k-means (5 clusters) binarization routine; Gini-impurity measure; p-val. Redescription = 1 , 1 E bucket min Table B.5: J 0.130 6 0.000 0.122 5 0.000 0.102 5 0.000 0.065 70.060 4 0.000 0.000 0.056 1 0.020 112 Appendix B Redescription Sets from experiments with DBLP data Set ∨ ∧ ∨ ∨ ∧ ≥ ≥ ≥ ≥ ≥ ≥ ≥ 5 5 5 5 . 5 . 5 . . . . 0 0 0 1 0 0 ≥ ≥ ≥ KDD 5 ICDE < SODA FOCS . ICDE < Continued ∧ 1 ∧ ∨ ∨ 5 . ∧ 5 5 . 5 AmosF iat . 3 . JiaweiHan 0 5 0 8 . ∧ ≥ ∧ 0 ≥ 5 AviW igderson < - support. 5 . . 1 , 0 0 ∧ ≥ 1 NaderH.Bshouty 5 DiveshSrivastava < 5 E . . ≥ ≥ MayurDatar ∧ 0 ∧ 0 ICDE MichaelE.Saks < 5 ∧ . 5 V LDB < AmelieMarian . ∧ 0 ≥ 5 0 SIGMODConference . ∧ 5 5 . 0 . ≥ 5 SODA ←→ 1 ≥ . ∧ 7 5 3 ≥ SridharRajagopalan . 5 ←→ ∧ 5 . ≥ . 0 2 4 5 5 . AlbertoO.Mendelzon < . F rankT homsonLeighton < 2 0 ≥ 8 SridharRajagopalan < ←→ ≥ ∨ ∧ T omasF eder ≥ H.V.Jagadish ∧ 5 5 ≥ . . ∨ 5 5 ∨ AmosF iat . 8 0 . COLT < MadhuSudan < 0 5 5 0 SIGMODConference T omiSilander . ∧ . SIGMODConference < ∨ ∧ COLT 0 1 ∧ 5 ≥ ∨ 5 . 5 ∧ ∧ . . 5 0 SODA ≥ 5 . 7 5 0 . 5 MoniNaor < . P eterKriegel < 0 . MichaelJ.Carey ST OC < STOC 0 SergeA.P lotkin ≥ ∧ 4 0 ≥ ∧ ∧ − ≥ ∧ ∨ ∨ 5 ≥ 5 . 5 ≥ ST OC < . . 5 5 0 MayurDatar 5 . . 0 0 . ∧ 0 0 0 ∧ ≥ 5 ≥ 5 Hans . ≥ SIGMODConference . ≥ 0 FOCS 0 ∧ ∨ SODA ICDE AviW igderson < ∨ 5 5 ≥ ≥ AviW igderson . . ∧ 5 ∨ CatrielBeeri < ∧ 0 1 . 5 ∨ . 0 5 5 ∧ . MosesCharikar < AviW igderson 1 . 5 STOC Bianchi < . 1 0 AviW igderson FOCS MichaelE.Saks 5 ∧ 0 . ≥ ∧ ∨ − ∨ 0 ∨ 5 ≥ ∨ . FOCS 5 5 5 5 0 . . . . 5 ≥ . ∨ CatrielBeeri 1 0 1 0 SurajitChaudhuri < AviW igderson 0 ≥ SODA < 5 ∧ . SIGMODConference < ∧ ∨ SODA < STOC ∧ 0 5 5 . 5 ∧ ∧ F OCS < . 5 . ∨ 0 . 5 0 0 5 5 . 0 . ∧ . 0 ≥ 1 1 SantoshV empala 24 5 NicoloCesa ≥ . ≥ ∧ 2 ∨ ≥ MichaelJ.Carey < SIGMODConference < 5 5 V LDB < F OCS < . . ≥ ∧ 1 0 F OCS < ∧ ∧ ∧ 5 RakeshAgrawal < . AviW igderson ≥ ≥ ∧ 5 5 5 0 V LDB ∧ . . . ST OC < V LDB ∧ 5 0 AviW igderson < 0 0 V LDB < UAI 5 MichaelJ.Carey ∧ . . ∨ 5 ∧ 1 ∧ ∧ 5 . 0 LungW u ∨ ∧ . 5 5 ≥ 5 0 ≥ 5 5 . . 1 . . 5 5 . − ≥ . 0 . 3 0 5 AviW igderson 1 F rankT homsonLeighton 0 T omiSilander < . AviW igderson < 0 0 AlbertoO.Mendelzon < 0 Bianchi ∨ ≥ ∧ ∧ 5 − ≥ . 5 Bianchi < Kun 5 . ←→ ←→ . 0 1 0 FOCS FOCS − W W W < 5 5 . W W W < . 1 8 ≥ ∧ ∨ ∧ ←→ F OCS < COLT < FOCS ∨ ICDE < 5 5 5 ≥ 5 ∨ ∧ . ∧ 5 . . . . 0 0 0 ∧ 5 5 5 1 . . . 15 F riedhelmMeyeraufderHeide < 5 3 1 1 T omasF eder < . ≥ ≥ NicoloCesa 2 ≥ ≥ ≥ H.V.Jagadish < ST OC < PODS SurajitChaudhuri BengChinOoi SergeA.P lotkin < NicoloCesa ←→ ←→ ←→ ∧ ∧ ∨ ∧ ∧ ∨ ←→ 5 5 5 5 5 5 5 5 5 ...... AmelieMarian < COLT 0 0 ICDM SIGMODConference < UAI < 1 3 1 F riedhelmMeyeraufderHeide 0 ST OC < WWW SergeA.P lotkin < 2 STOC 0 SridharRajagopalan < AviW igderson ST OC < 0 ST OC < 0 0 WWW Redescriptions mined by Algorithm 2 from DBLP data set (with hierarchical (5 clusters) binarization routine; IG-impurity measure;) p-val. Redescription 1 , 1 LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E Table B.6: 0.879 2034 0.091 J 1 10.959 0 745 0 0.965 2239 0.022 0.957 707 0 0.949 2197 0.015 0.959 1557 0 0.949 636 0 0.919 711 0 0.904 2094 0.091 Appendix B Redescription Sets from experiments with DBLP data Set 113 5 ∨ ∨ ∧ ∨ ∨ . − ≥ 0 5 5 5 5 5 . . . . . ←→ ←→ ←→ 0 0 0 0 0 5 5 5 . . . ≥ ≥ 0 0 0 Hans ≥ ≥ ICDE < V LDB < Continued ∧ ∧ ∧ 5 5 . 5 . . 0 0 5 0 LeonardP itt < . 0 ∨ AviW igderson < - support. 5 MichaelE.Saks 1 P ODS < ≥ . EDBT < ICDM < , ∨ 1 0 5 ∧ . ∧ OdedGoldreich < ∧ E 0 5 5 5 ∧ . ←→ . . AviW igderson 0 V LDB < ICDE < ≥ 1 StefanKramer 5 0 SIGMODConference < . 5 ∨ . RakeshAgrawal ∧ 0 ∧ ∧ ∨ W olfgangKafer 1 ≥ ≥ ≥ 5 DiveshSrivastava < 5 5 . 5 . 5 . ≥ . . 0 1 1 ∧ 0 0 ←→ 5 . ≥ ←→ SudiptoGuha ≥ 0 5 MichaelJ.Carey < . ∧ FOCS ICDE 5 ICDE . 0 ≥ ∧ 5 0 P iotrIndyk ∧ ∨ . ∨ 1 V LDB < 5 0 ∧ . 5 5 5 5 RakeshAgrawal < COLT < . . . ∧ . COLT < ICDE 0 5 7 1 ≥ 9 . 1 ∧ 5 ∨ ∧ ∨ 0 . 5 5 5 1 5 . . ≥ ≥ . . 0 AviW igderson 0 5 7 8 . ≥ ∨ 0 ≥ SDM < ICDM < AviW igderson < 5 SasoDzeroski . 5 . 5 ∧ 0 ∧ . JinyanLi < 3 ∧ ST OC < 0 SODA 5 5 . PKDD ∧ 5 . 5 . ←→ . 0 ∨ SODA 0 ∧ F lipKorn < 5 0 MichaelJ.Carey 1 ∧ . V LDB < 5 5 5 MadhuSudan ∨ 0 . . ∨ . 1 ∧ ∧ ≥ ≥ 1 0 1 5 ∧ P eterKriegel < 5 5 . ≥ 3 . . ≥ 1 5 0 3 ≥ . − ≥ 0 5 ≥ . SudiptoGuha < 0 ∧ P hilipS.Y u < Hans JiaweiHan < SurajitChaudhuri 5 ∧ . ECML SODA SODA < AviW igderson < FOCS ∧ ∧ ∧ 0 5 SODA < . ∧ 5 5 5 ∧ ∧ ∧ . . . F rankT homsonLeighton < 0 SIGMODConference < 5 SIGMODConference < ∧ 5 0 0 . 0 5 5 5 DoinaP recup < . . ∧ . . ←→ 0 ∧ ∧ ≥ 5 3 0 F lipKorn 1 2 5 . ≥ ∨ 5 . 5 5 1 . . MoniNaor < . 0 MichaelJ.Carey 5 5 0 . 2 0 . P hokionG.Kolaitis < ∨ 0 ∧ 0 ≥ ←→ ≥ 5 ∧ 5 . . 0 StephenA.F enner < 5 5 SIGMODConference 0 SIGMODConference . . 0 0 5 ∨ . ∧ MichaelJ.Carey < ∧ ICML < ST OC < F OCS < ST OC < 0 3 ≥ ST OC < 5 5 ∧ SIGMODConference < . . ∨ ∨ ICDE F OCS < ∧ ∨ V LDB 1 5 ∨ 1 LeonardP itt < ∨ MichaelE.Saks < ≥ . 5 5 5 5 ∧ ∧ ∧ . . . . 5 5 0 ∨ Y ingMa < . . ∨ 0 5 5 5 1 5 1 . . . 3 5 HamidP irahesh 0 . 5 MichaelJ.Carey < 1 − 1 HamidP irahesh 1 . 0 ∧ ≥ 0 JiaweiHan < ∧ AviW igderson < ∧ ≥ P KDD < F lipKorn < 5 ≥ . 3 5 ∨ ∧ . 0 ∨ ∧ W ei 5 5 NickKoudas < 0 . 5 . . 5 AbrahamSilberschatz < 5 0 ∧ 0 . . V LDB < V LDB < ∧ 1 JiaweiHan < 7 0 5 ICDE < ICDE < ST OC < F OCS < 5 . F OCS < ≥ ∨ . ∨ COLT < STOC COLT < ≥ 0 ≥ 0 ∧ ∧ ∨ ∧ ≥ ←→ ∧ 5 5 ∨ ∨ ∨ . . ←→ 5 ≥ 5 5 5 5 5 ≥ . 1 1 . . . 5 5 5 . . . . . 5 3 2 1 1 . 0 1 1 1 1 StephenA.F enner 0 ≥ ≥ LeonardP itt ≥ ≥ ≥ ≥ ≥ ≥ ≥ V LDB SurajitChaudhuri < F rankT homsonLeighton AviW igderson < ←→ ∧ ←→ ∧ ∧ ∧ 5 5 5 5 5 5 ...... V LDB W olfgangKafer < STOC MoniNaor STOC OdedGoldreich < COLT 0 0 StephenA.F enner < 0 COLT V LDB RakeshAgrawal < ST OC < P ODS < KDD < SIGMODConference 3 P eterKriegel ICML DoinaP recup JiaweiHan STOC AviW igderson 0 STOC 0 Redescriptions mined by Algorithm 2 from DBLP data set (with hierarchical (5 clusters) binarization routine; IG-impurity measure;) p-val. Redescription 1 , 1 LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E Table B.6: 0.676 606 0 J 0.878 361 0 0.872 2014 0.052 0.864 1996 0.101 0.861 1992 0.157 0.842 717 0 0.839 1929 0.026 0.815 1876 0.054 0.822 1900 0.095 0.789 830 0 0.734 80 0 0.733 641 0 0.698 1586 0.02 114 Appendix B Redescription Sets from experiments with DBLP data Set ∧ ∨ ∧ ∧ ∧ ≥ ≥ ≥ ≥ ≥ 5 5 5 5 5 . . . . ←→ ←→ . 1 0 0 0 2 5 3 . ≥ ≥ ≥ 12 Continued P ODS < ≥ ∧ 5 . ICDT < V ipinKumar V ipinKumar 2 ICDT < - support. ∧ ∧ AviW igderson < ∧ 1 JianP ei ∧ , HamidP irahesh < 5 5 2 ≥ SODA . 1 ∨ . 5 4 DavidHeckerman ∧ ∧ 6 5 E . 5 ∧ . CatrielBeeri < . 0 5 0 5 ∧ . 0 5 . ≥ . ∨ 0 2 5 0 SIGMODConference ≥ ≥ RichardM.Karp . 5 ≥ 0 . NicoleImmorlica ∨ ∨ FOCS 0 ≥ F rankT homsonLeighton ≥ 5 5 ≥ . . ∧ ∧ ≥ 0 0 STOC 5 ←→ 5 . . ≥ V LDB 5 ∧ 0 5 . P hilipS.Y u < ∨ 5 FOCS 6 . ∨ 5 5 ∧ . . 10 5 ≥ 3 . 0 5 KDD . 4 OdedGoldreich JiongY ang 1 ArindamBanerjee < ≥ ∧ AviW igderson < Molina < ≥ ∧ ∨ 5 ∧ . 5 − 5 CatrielBeeri SODA < . . 0 5 EmmanuelW aller . 0 0 ←→ 5 FOCS SatinderP.Singh 1 . ∨ UAI ≥ ∧ 2 ≥ F OCS < ∧ 5 ∧ 5 ∧ 5 ←→ . . . ≥ 5 1 5 ∨ 1 ST OC < 5 . 0 . 5 . 0 . AviW igderson < 1 5 0 ∨ . 0 ≥ 0 P hilipS.Y u ∧ 5 ≥ . ICDM ≥ ≥ 5 3 . ∧ 0 HectorGarcia ←→ 5 ≥ . 5 ∧ 0 P hokionG.Kolaitis < . STOC ICDT < JiaweiHan 0 5 STOC . ∧ ICML AviW igderson ∧ 0 ∧ ∧ 5 ≥ ∧ ST ACS NogaAlon < . ∧ ∧ 5 5 RichardM.Karp < ICDT 5 0 . . 5 ≥ 5 MartinL.Kersten < . ∧ 5 ∧ . 5 . 5 0 2 . . 3 . ∧ 0 0 ≥ ∧ 2 5 5 0 0 . 5 . SDM < ≥ 5 5 5 . 2 . 0 . . 0 ←→ KDD 0 ≥ StefanKramer < ≥ ≥ 3 ∨ 0 2 3 5 ≥ ∧ . ≥ 4 ∨ ≥ ≥ 0 5 ≥ . 5 . 0 0 MosesCharikar < ≥ F OCS < SODA < ECML < SIGMODConference < ∨ ICDT ∧ 5 ∨ . F OCS < 5 ∨ AviW igderson ∧ V LDB . ∧ 2 5 5 . . AviW igderson 0 5 5 ∧ ∧ ∧ . . 5 5 2 . ≥ 0 ∧ 6 5 5 5 ICDM 2 . . . ≥ JosephNaor 5 5 1 2 . 1 ∧ . MartinL.Kersten JianP ei < 1 NicoleImmorlica < ∧ 0 AviW igderson < 5 MartinF arach . ≥ ∨ 5 ∨ ∧ 1 . ∨ MichaelBenedikt < ∧ T homasSchwentick ArindamBanerjee 5 0 5 5 . . 5 . 5 ∧ ∧ 0 . . 0 1 11 ≥ 0 0 5 SIGMODConference 5 SurajitChaudhuri < . T ovaMilo . ≥ SODA < ST OC < ←→ V LDB < ST ACS < 0 ≥ 0 P ODS < ≥ SODA ∧ ≥ ST OC < 5 ∧ DavidMaxwellChickering ∨ ∧ . SIGMODConference < ∧ ∨ ≥ ∧ ≥ 5 SDM < ∨ 2 5 . 5 ←→ 5 ←→ . 5 5 . ∧ 5 . 8 . . ∨ 5 . 5 . 5 2 5 0 5 2 . 10 5 . ←→ . 1 . 10 0 4 MosesCharikar 0 ≥ 1 ≥ ≥ ≥ ≥ 17 ≥ ≥ ≥ ≥ ≥ RichardM.Karp < JianP ei SunilP rabhakar < MosesCharikar < P hilipS.Y u NogaAlon ≥ ←→ V LDB < ∨ ∨ ∧ ∨ ∧ ∧ 5 5 5 5 5 5 5 ∧ ...... ECML StefanKramer 0 ICDE < F OCS < FOCS 0 S.Muthukrishnan SDM 0 ICDM ChristianBohm < SODA HarryBuhrman < SODA 0 1 UAI SDM 4 0 PODS LeonidLibkin STOC AviW igderson 0 ICDT Redescriptions mined by Algorithm 2 from DBLP data set (with hierarchical (5 clusters) binarization routine; IG-impurity measure;) p-val. Redescription 1 , 1 LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E Table B.6: J 0.674 58 0 0.649 14880.637 0.084 1399 0 0.621 195 0 0.571 4 0 0.564 211 0 0.558 373 0 0.417 5 0 0.411 92 0 0.342 13 0 0.333 31 0 0.313 5 0 Appendix B Redescription Sets from experiments with DBLP data Set 115 ∨ ∧ ∨ ∧ ≥ ≥ ≥ 5 5 5 . 5 . . ←→ ←→ . 5 1 0 0 . 8 5 0 . 13 ≥ 0 ≥ ≥ ≥ ≥ ≥ Continued ≥ ShengMa JudeaP earl LeonidLibkin < - support. STOC StefanKramer < FOCS ←→ ∨ 1 , MosesCharikar < ←→ MosesCharikar < HarryBuhrman < 1 5 ∧ ∨ ∧ 5 . RobertE.Schapire ST ACS 5 . DiveshSrivastava < E ∨ 5 ∨ . 0 5 5 . 2 . ∧ . 0 ∨ 0 5 5 0 ∧ 0 . . 5 ←→ ≥ . 5 Y ehoshuaSagiv 0 0 ≥ ≥ SIGMODConference < . DavidJ.DeW itt ≥ 0 JaySethuraman 5 4 ≥ ≥ 0 . ∧ ∧ ≥ 0 ∧ ≥ ∧ 5 5 . 5 . ≥ . 1 5 5 1 . 0 0 ≥ ≥ FOCS ≥ ICDT ICML ∧ 5 ∧ ∧ . W W W < 0 SureshV enkatasubramanian 5 5 . . ∧ 3 ∧ 2 ≥ EDBT JudeaP earl < 2 5 RolfW iehagen LungW u ∧ . V LDB < ∨ 0 5 ∧ ≥ . 5 − . ∧ 5 2 HarryBuhrman MosesCharikar . ≥ 0 SridharRajagopalan 5 0 . 5 ICML . ∧ ≥ JeffreyXuY u 0 DiveshSrivastava 2 SIGMODConference Kun ∧ 5 P ODS < ∧ ←→ . ←→ ECML < 5 ∧ ∧ RadekV ingralek < . 0 V LDB 5 ∨ 5 . 0 ∨ 5 5 5 . . . 2 ∧ ←→ . ∧ 5 8 5 DiveshSrivastava < . 8 1 0 . W W W < 5 5 5 3 ≥ . . . 3 5 ∨ ≥ . ∨ ≥ 0 5 8 ≥ ≥ 0 5 ≥ 5 ≥ . . ≥ ≥ 2 UAI < 5 0 ≥ . 1 ∨ ≥ MosesCharikar 5 ≥ RolfW iehagen . ≥ STOC AshishGupta < ∨ SDM 3 STOC ∧ 5 COLT ∧ ∧ WWW ∧ KeW ang . 5 ICDE ∧ ≥ RobertE.Schapire < SODA . 0 P hilipS.Y u 5 5 ∧ 5 . ∧ . 0 . 5 ∧ ∧ ∨ . 0 PODS 0 ∧ 5 3 ≥ Y ehoshuaSagiv < 5 . 5 5 5 0 5 . . . . ←→ 5 ∧ . 0 . ≥ 0 0 0 ∧ ≥ 0 10 0 5 5 0 . ≥ . 5 5 COLT . ≥ ≥ 2 ≥ ≥ 0 ∧ ≥ AndrewT omkins < 5 SIGMODConference < . 0 ∨ FOCS ∨ ICDM < FOCS 5 ICML < ICDT < ∧ . 5 StephenA.F enner STOC ManfredK.W armuth . V LDB 5 0 ∨ ∧ ∧ 5 . 0 ∧ EDBT < . ∧ V LDB ∧ ∧ 2 MihalisY annakakis ∧ 5 5 5 0 OdedGoldreich . ShengMa < 5 . . MoniNaor MoniNaor ∧ ≥ 5 5 ≥ . 5 ∧ 5 4 . . ∧ 5 2 . . 5 3 ∧ ∨ . ∧ ∧ 0 0 5 5 0 2 5 . . RadekV ingralek 2 . ICML < ≥ 5 5 RobertE.Schapire < 5 5 ≥ 0 0 . . . . 0 ≥ ≥ StefanKramer < ∧ ∧ 0 0 ∧ 0 0 ≥ LeonidLibkin < ≥ 5 ∨ 5 ≥ 5 . . . ≥ ≥ ≥ ≥ 5 0 ∨ 0 0 . 0 5 . ST ACS < ICDE 0 SODA < ECML < P ODS < ≥ FOCS ∨ 4 W W W < ∧ ∨ ∨ ≥ WWW ∨ ∧ 5 ∨ 5 . 5 UAI < 5 . 5 ∧ 5 . 5 . . 0 . . 3 2 5 ∨ 5 3 5 2 AndrewT omkins . 5 ≥ 8 . ≥ ≥ ≥ ≥ ≥ 0 ≥ DavidJ.DeW itt < AviW igderson AviW igderson RobertE.Schapire ManfredK.W armuth AviW igderson Y ehoshuaSagiv RaymondT.Ng ≥ ←→ JudeaP earl < ∧ ∧ ∧ ∧ ∧ ∧ ∧ ∧ ∨ 5 . 5 5 5 5 5 5 5 5 5 ...... KDD ICDM 0 WWW 14 AndrewT omkins < SODA < 0 T.S.Jayram < SODA 0 ECML StefanKramer 0 SIGMODConference SIGMODConference < 0 JosephM.Hellerstein UAI 0 0 LeonidLibkin ST ACS 0 PODS 0 Redescriptions mined by Algorithm 2 from DBLP data set (with hierarchical (5 clusters) binarization routine; IG-impurity measure;) p-val. Redescription 1 , 1 LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E Table B.6: 0.2 2 0 0.194 7 0 0.192 37 0 J 0.253 19 0 0.245 39 0 0.224 24 0 0.22 29 0 0.212 25 0 0.203 41 0 0.2 26 0 116 Appendix B Redescription Sets from experiments with DBLP data Set ∨ ∨ ∨ ∧ ≥ ≥ ≥ ≥ 5 5 5 5 . . . ←→ ←→ . 0 0 0 0 5 5 . . 0 0 ≥ ≥ ≥ PODS V LDB Continued V LDB ≥ ≥ ∧ ∧ 5 ∧ DanSuciu < . 5 . QiangY ang < 5 0 ∨ 12 . 3 5 ∨ Y uriBreitbart < ≥ . - support. ≥ FOCS FOCS 0 P eterGrunwald < 1 5 , ∨ T omM.Mitchell < . ErikD.Demaine < ≥ 1 ∨ ∧ ∧ 0 JuhoRousu < 5 ≥ E . ∨ 5 ∨ 5 5 . 0 . . ∨ 5 ≥ 0 5 0 0 . . 5 PODS ICDE ≥ 0 . SIGMODConference 0 MichaelJ.Carey ≥ 0 ≥ ≥ ∧ ∧ ∧ SDM ≥ 5 ∧ ≥ SurajitChaudhuri 5 5 . . . ≥ ∧ 2 5 0 0 . 5 0 . ≥ 2 AbrahamSilberschatz ←→ ICML ICML 5 . 0 5 ∧ ∧ . 7 5 5 HamidP irahesh ≥ H.V.Jagadish . . ←→ ICDT < 2 1 EDBT < ICDE ∧ ∧ ≥ ∨ ∨ RolfW iehagen ∧ 5 MichaelJ.Carey 5 5 5 . 5 . 5 5 . . ∧ 5 . . 0 . . 0 ∧ 0 3 5 1 1 0 ICDM < T omM.Mitchell . RolfW iehagen 15 5 ErikD.Demaine 0 . ≥ ≥ ≥ ≥ ∨ ∧ 0 ≥ 5 5 . . ←→ ECML < ECML < 0 3 ←→ 5 ∨ ∨ . 5 . 1 SurajitChaudhuri < ≥ 5 5 . . W W W < 6 3 3 5 ∧ ∨ ≥ . ≥ 0 V LDB 5 5 ≥ ≥ . 0 ≥ JosephM.Hellerstein F lipKorn < ∧ 5 . 5 . ≥ ICDE 2 ∧ ∧ 5 RaymondT.Ng 0 RakeshAgrawal . JianyongW ang < 5 5 0 SIGMODConference ∧ . ≥ . ∧ ∧ PKDD ≥ FOCS ∧ 0 0 COLT COLT 5 5 5 ∧ 5 . . . RobertE.Schapire < ∧ . 5 5 ∧ . ≥ ≥ ∧ ∧ 0 1 3 . 5 AbrahamSilberschatz < 0 ∧ . 1 5 0 5 V LDB SIGMODConference 5 5 . 5 . SIGMODConference 0 5 . . ≥ ∧ ≥ . ≥ 0 . 0 ∧ 0 0 ∧ ≥ 1 6 1 ∧ 5 RobertE.Schapire < 5 . 5 5 ≥ . . . 5 ≥ 0 ≥ ≥ . ≥ 0 0 ∧ 0 12 ICDT 5 5 ≥ . . 0 0 SDM < ∧ H.V.Jagadish DanielF.Lieuwen < ≥ ∧ ∧ 5 V LDB STOC ICML < ICML < . ∨ ECML 5 ManfredK.W armuth 0 5 . Y uriBreitbart < ∧ . 5 ∧ ∧ ∧ ∧ ICDE < Y ehoshuaSagiv 0 2 QiangY ang < ∧ ICDE < P ODS < W eiW ang ∨ 5 5 5 5 5 ∧ . . . MadhuSudan . ∧ 5 ≥ . ∧ StefanKramer ≥ ∨ 5 . 0 ∧ 2 1 ∧ DanSuciu < 5 0 MichaelJ.Carey . 4 5 5 . ∧ 0 ManfredK.W armuth . . 0 5 5 ∧ 5 1 . ∨ . ∧ . 2 0 5 ∧ SurajitChaudhuri 5 . 1 0 ≥ 0 P eterGrunwald < RakeshAgrawal . 5 5 ≥ 0 . . 5 0 ∧ . ∨ 0 0 ∧ ≥ ≥ 0 JuhoRousu < 5 ≥ 5 ≥ . 5 ≥ . RiccardoSilvestri P ODS < 1 ∨ ArnoJ.Knobbe ≥ 0 ICDM < 5 ICDE < ∨ ≥ . SODA < ECML < ECML < ICML < ≥ ∨ 0 W W W < EDBT < ICDT < ←→ ∨ 5 ∨ ∨ ∨ ∨ ←→ . 5 ∨ ∨ . 5 ∨ 5 5 0 ≥ 5 5 5 5 . . . 5 2 . 5 . . . . . 5 4 2 1 0 0 4 4 . 1 2 0 DanielF.Lieuwen ≥ ≥ ≥ QiangY ang DanSuciu Y uriBreitbart ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ RobertE.Schapire AviW igderson F lipKorn SasoDzeroski AbrahamSilberschatz JianyongW ang ←→ ←→ ←→ ∧ ∧ ∧ ∧ ∧ ∧ ←→ 5 5 5 5 5 5 5 5 5 ...... PODS AbrahamSilberschatz < ICDT 9 0 ST ACS ICDM 11 DanielF.Lieuwen < 8 1 WWW ECML P eterGrunwald 1 SODA 0 ECML JuhoRousu RobertE.Schapire EDBT 0 3 PKDD ICML 0 SurajitChaudhuri < ICDE Redescriptions mined by Algorithm 2 from DBLP data set (with hierarchical (5 clusters) binarization routine; IG-impurity measure;) p-val. Redescription 1 , 1 LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E Table B.6: 0.132 23 0 0.125 57 0 0.115 7 0 0.114 39 0 0.114 39 0 J 0.185 24 0 0.165 44 0 0.162 23 0 0.154 49 0 0.143 2 0 0.143 11 0 0.138 30 0 Appendix B Redescription Sets from experiments with DBLP data Set 117 ∨ ∧ ≥ ≥ ≥ ≥ ≥ ≥ ≥ 3 5 . 1 ≥ V LDB 5 Continued . W eiF an ∧ 0 5 ∧ . ≥ 0 5 NickKoudas < GeorgGottlob < . ElenaBaralis < HaixunW ang 1 ∨ ≥ - support. ∨ GioW iederhold ∨ 5 1 ∧ , . HamidP irahesh 5 5 1 2 AndrewW.Moore < ∧ 0 . . E 0 0 DiveshSrivastava < 5 XingquanZhu < ∨ . ≥ 0 ∧ ≥ KDD ≥ ←→ 5 ∨ . SIGMODConference SIGMODConference < SIGMODConference 5 0 5 ∧ RakeshAgrawal 3 . . ∨ ∧ ∧ 0 5 4 . 5 ≥ 5 5 . ≥ . 0 . 1 ≥ 2 HuanLiu < 3 ←→ ≥ ∨ ≥ ≥ 5 5 . . XiongW ang < 8 1 NickKoudas 5 . ICDE ∨ 0 InderpalSinghMumick JiaweiHan < ≥ ≥ 2 ∧ ICDM < ICDE ICDE 5 ∧ PODS ∧ ≥ . 5 ←→ MichaelJ.Carey ∧ BruceG.Lindsay ∨ . ≥ ∧ 0 5 5 ∧ 5 . 0 5 . 5 ∧ ∧ 5 . . . 0 . 5 0 ≥ 4 5 5 . XingquanZhu 5 0 1 ≥ . . . 11 1 ≥ 0 0 0 RaymondT.Ng < AndrewW.Moore ≥ ≥ ≥ ∨ HuanLiu ←→ 5 . ←→ 5 0 . V LDB 5 2 . ←→ V LDB < ICDE XiongW ang ∧ EDBT < ICDT < ≥ 1 ∧ ∧ 5 5 ≥ ∨ ∨ . . 5 . 5 ≥ 1 1 5 5 5 . . . 5 . ←→ 5 0 RaymondT.Ng 1 . 0 1 V ipinKumar < ≥ 5 0 5 . ∧ . ≥ ≥ ≥ ∨ CatrielBeeri < SIGMODConference 4 H.V.Jagadish < RaghuRamakrishnan 0 5 ≥ . 5 ICDE ∧ ∧ . DiveshSrivastava ∧ 0 ∧ ≥ ≥ 2 5 5 PKDD ∧ 5 . 5 ∧ . . ≥ 5 . KDD SurajitChaudhuri 0 0 . 5 0 KDD < ∧ 5 0 . . 0 ∧ ∧ V LDB 0 5 V LDB 0 ∧ Molina ≥ . SIGMODConference RaymondT.Ng 5 5 ≥ ≥ ∧ 0 5 . ∧ . ICDE . ∧ − ≥ 5 0 0 5 . 0 5 ∧ . ≥ . 2 5 3 ≥ 2 . 5 ≥ . ←→ 0 0 ≥ 5 ≥ . ICDE KDD ≥ 7 ∧ ∧ ECML SIGMODConference < ≥ 5 5 ICDM < DmitryP avlov < ICDM MichaelJ.Carey SIGMODConference < GeorgGottlob < ICDE < . . ElenaBaralis < JeffreyF.Naughton ∧ P ODS < 0 0 ∨ V LDB ∨ ∧ ∨ ∧ ∧ ∧ SDM ∨ ∨ 5 ∧ HectorGarcia ∧ 5 . 5 ∧ 5 JiaweiHan 5 5 5 AbrahamSilberschatz . . 5 5 RaymondT.Ng < 5 . RakeshAgrawal . ∧ . 1 5 . . . . 5 5 1 . 5 ∧ 2 . 1 0 . ∧ 0 0 0 ∧ 0 5 ∨ SurajitChaudhuri 0 ∧ . 1 5 1 5 . 5 . ≥ ≥ 5 2 5 PODS . 5 ≥ ≥ . ∧ ≥ 0 . 0 ≥ . 0 0 0 0 5 ∧ . ≥ ≥ RaymondT.Ng 1 ≥ ≥ 5 3 ≥ . ∧ EDBT < P KDD < 4 ≥ 5 ICML < . ICML < ∨ SDM < ∨ EDBT < ICDT < 1 ICDE < KDD < ≥ ∨ ∧ 5 5 ∨ ∨ . ∨ ∨ . ∨ 5 ≥ 5 5 0 0 . . 5 5 . 5 . . 5 . 1 0 1 . 1 0 1 2 ≥ V ipinKumar ≥ GeorgGottlob ElenaBaralis ≥ ≥ ≥ ≥ ≥ ≥ ≥ V LDB CatrielBeeri V ipinKumar < HamidP irahesh < HaixunW ang RakeshAgrawal H.V.Jagadish W eiW ang NickKoudas NadaLavrac ←→ ∧ ←→ ←→ ∧ ∨ ∨ ∧ ∧ ∧ ∧ ∧ ∧ 5 5 5 5 5 5 5 5 5 5 5 5 5 ...... 0 SIGMODConference 0 ICML 5 0 ICDT 2 ICDM 0 0 EDBT 0 RakeshAgrawal < SIGMODConference 0 SDM 0 ICDE 0 JiaweiHan PKDD EDBT 0 1 KDD Redescriptions mined by Algorithm 2 from DBLP data set (with hierarchical (5 clusters) binarization routine; IG-impurity measure;) p-val. Redescription 1 , 1 LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E Table B.6: 0.081 37 0 0.081 9 0 J 0.108 70 0 0.108 93 0 0.102 33 0 0.101 42 0 0.095 20 0 0.089 40 0 0.085 170.084 0 55 0 0.084 12 0 118 Appendix B Redescription Sets from experiments with DBLP data Set ∧ ∧ ∧ ∧ ≥ ≥ ≥ ≥ 5 5 5 5 . . . ←→ . 0 0 0 0 5 . 0 SDM 5 . ∧ ≥ 0 5 . Y ossiAzar ≥ 0 ∧ ≥ 5 Molina < . - support. 1 0 V LDB , StephaneLallich < 5 1 − . GeoffreyI.W ebb < ∧ 0 E ∨ V ipinKumar < ∨ 5 Y asuhikoMorimoto < 5 P eterA.F lach < . ≥ . 5 ICDM ∨ 0 . ∨ 0 SIGMODConference SIGMODConference ∨ 0 5 ∨ 2 XueminLin 5 RakeshAgrawal < . ∧ ≥ ∧ . 5 ≥ 0 . 5 0 ∧ 5 5 . ≥ ≥ 0 . ∧ . 0 5 5 1 ≥ . ≥ 5 . 0 ≥ ≥ 0 KDD ≥ HectorGarcia JiaweiHan ∧ AviW igderson < ∧ ∧ 5 . 5 5 F lipKorn 2 . V LDB . V LDB < V LDB < 0 0 ←→ ∧ ∧ ∨ ∨ 5 5 5 ≥ 5 5 . . GioW iederhold . . . 0 2 0 0 0 ∧ P hilipS.Y u GeoffreyI.W ebb GioW iederhold P eterA.F lach ≥ ≥ ≥ ≥ 5 ≥ . ∧ ∧ 0 5 . 5 P KDD < . Y asuhikoMorimoto Molina 0 ←→ ←→ 1 ∨ − 5 ≥ KDD 5 5 . ICDT ICDE . WWW . 0 3 ∧ 0 ←→ ∧ ∧ P hilipA.Bernstein < ∧ 5 5 5 5 . . . ≥ 5 . ≥ ≥ ∨ . 0 0 7 0 0 5 RakeshAgrawal < . ≥ ≥ 0 ∨ RakeshAgrawal JiaweiHan < UAI 5 5 ∧ ICDE . ∧ ≥ 5 . ICDE ∧ . 0 JiaweiHan < 5 0 5 V ipinKumar 0 . HectorGarcia ∧ ∧ . 5 5 . 0 . ∧ 0 ≥ ICDM < ∨ 5 V LDB SODA 5 ∧ 0 ≥ 0 ≥ . . 5 5 0 ∨ . 5 0 ∧ ∧ . . ≥ 0 ≥ 0 5 5 0 5 5 . . . . ≥ 0 3 1 0 ≥ 5 . ≥ ≥ ≥ ≥ 0 ≥ KDD < KDD ECML ∧ ∧ ∧ ICDE STOC SIGMODConference < SIGMODConference < 5 5 5 . ICDM . . MosesCharikar P hilipA.Bernstein ∧ ∧ ∧ ∨ 2 0 NirF riedman 0 ∧ RakeshAgrawal 5 5 ∧ 5 5 . . . . ∧ StephaneLallich < 5 SurajitChaudhuri JianP ei P hilipS.Y u V ipinKumar < 1 5 5 . 0 0 . 5 P hilipA.Bernstein < 0 ∨ . ∧ ∧ ∧ 0 ←→ ≥ RakeshAgrawal < 0 ∨ 5 5 5 5 ←→ . . . . ≥ ∨ 5 ←→ 0 SurajitChaudhuri 0 0 0 ≥ 5 . . 5 5 0 . 5 3 . ∧ ≥ ≥ ≥ . ≥ 0 1 P KDD < ICDM < ICML < 1 5 ≥ KDD < . V LDB < V LDB < FOCS ∨ ≥ ∨ KDD < 1 ≥ ∨ ≥ ∧ ≥ ∨ ∨ ∧ 5 5 5 ∨ . . ≥ 5 . 5 5 . 5 2 . . 0 5 0 . 0 . 1 5 1 0 ≥ ≥ ≥ ≥ ≥ ≥ PODS ICDE EDBT JiaweiHan JiaweiHan JiaweiHan AviW igderson ∧ ∧ ∧ ∧ ∧ ∧ ∨ 5 5 5 5 5 5 5 ...... ICDM < 0 JiaweiHan StephaneLallich PKDD 0 ICML StephenMuggleton V LDB 0 DilysT homas KDD ICDM 0 V LDB 0 7 0 ST OC < GioW iederhold Redescriptions mined by Algorithm 2 from DBLP data set (with hierarchical (5 clusters) binarization routine; IG-impurity measure;) p-val. Redescription 1 , 1 LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E Table B.6: J 0.081 66 0 0.079 70 0 0.068 10 0 0.068 48 0 0.068 20 0 0.065 200.062 0 52 0 0.058 33 0