Redescription Mining Over Non-Binary Data Sets Using Decision Trees

Universitätdes Saarlandes Max-Planck-Institut fürInformatik Redescription Mining Over non-Binary Data Sets Using Decision Trees Masterarbeit im Fach Informatik Master's Thesis in Computer Science von / by Tetiana Zinchenko angefertigt unter der Leitung von / supervised by Dr. Pauli Miettinen begutachtet von / reviewers Dr. Pauli Miettinen Prof. Dr. Gerhard Weikum Saarbrücken, November 2014 Eidesstattliche Erklärung Ich erklärehiermit an Eides Statt, dass ich die vorliegende Arbeit selbstständigverfasst und keine anderen als die angegebenen Quellen und Hilfsmittel verwendet habe. Statement in Lieu of an Oath I hereby confirm that I have written this thesis on my own and that I have not used any other media or materials than the ones referred to in this thesis. Einverständniserklärung Ich bin damit einverstanden, dass meine (bestandene) Arbeit in beiden Versionen in die Bibliothek der Informatik aufgenommen und damit veröffentlicht wird. Declaration of Consent I agree to make both versions of my thesis (with a passing grade) accessible to the public by having them added to the library of the Computer Science Department. Saarbrücken, November 2014 Tetiana Zinchenko Acknowledgements First of all, I would like to thank Dr. Pauli Mittienen for the opportunity to write my Master thesis under his supervision and for his support and encouragement during the work on this thesis. I would like to thank the International Max Planck Research School for Computer Science for giving me the opportunity to study at Saarland University and their constant support during all the time of my studies. And special thanks I want to address to my husband for being the most supportive and inspiring person in my life. He was the first trigger for me to start and finish this degree. v Abstract Scientific data mining is aimed to extract useful information from huge data sets with the help of computational efforts. Recently, scientists encounter an overload of data which describe domain entities from different sides. Many of them provide alternative means to organize information. And every alternative data set offers a different perspective onto the studied problem. Redescription mining is tool with a goal of finding various descriptions of the same objects, i.e. giving information on entity from different perspectives. It is a tool for knowledge discovery which helps uniformly reason across data of diverse origin and integrates numerous forms of characterizing data sets. Redescription mining has important applications. Mainly, redescriptions are useful in biology (e.g. to find bio niches for species), bioinformatics (e.g. dependencies in genes can assist in analysis of diseases) and sociology (e.g. exploration of statistical and political data), etc. We initiate redescription mining with data set consisting of 2 arrays with Boolean and/or real-valued attributes. In redescription mining we are looking for such queries which would describe nearly the same objects from both given arrays. Among all redescription mining algorithms there exist approaches which exploits alternating decision tree induction. Only Boolean variables were involved there so far. In this thesis we extend these approaches to non-Boolean data and adopt two methods which allow redescription mining over non-binary data sets. Contents Acknowledgementsv Abstract vii Contents ix 1 Introduction1 1.1 Outline of Document..............................3 2 Preliminaries5 2.1 The Setting for Redescription Mining.....................5 2.2 Query Languages................................8 2.3 Propositional Queries, Predicates and Statements..............8 2.3.1 Predicates................................9 2.3.2 Statements............................... 10 2.4 Exploration Strategies............................. 13 2.4.1 Mining and Pairing........................... 13 2.4.2 Greedy Atomic Updates........................ 13 2.4.3 Alternating Scheme.......................... 14 3 Related research 15 3.1 Rule Discovery................................. 15 3.2 Decision Trees and Impurity Measures.................... 16 3.3 Redescription Mining Algorithms....................... 21 4 Contributions 25 4.1 Redescription Mining Over non-Binary Data Sets.............. 25 4.2 Algorithm 1................................... 27 4.3 Algorithm 2................................... 30 4.4 Stopping Criterion............................... 32 ix x CONTENTS 4.5 Extraxting Redescriptions........................... 34 4.6 Extending to Fully non-Boolean setting................... 35 4.6.1 Data Discretization........................... 35 4.7 Quality of Redescriptions........................... 37 4.7.1 Support and Accuracy......................... 37 4.7.2 Assessing of Significance........................ 38 5 Experiments with Algorithms for Redescription Mining 41 5.1 Finding Planted Redescriptions....................... 41 5.2 The Real-World Data Sets........................... 44 5.3 Experiments With Algorithms on Bio-climatic Data Set.......... 44 5.3.1 Discussion................................ 54 5.4 Experiments With Algorithms on Conference Data Set........... 57 5.4.1 Discussion................................ 64 5.5 Experiments against ReReMi algorithm................... 66 6 Conclusions and Future Work 69 Bibliography 71 A Redescription Sets from experiments with Bio Data Set 75 B Redescription Sets from experiments with DBLP data Set 91 Chapter 1 Introduction Nowadays we encounter massive amounts of data everywhere and increased capabilities accelerate the generation and acquisition of it. This data can be of different origin and describe diverse objects which provides the stage for active data mining in the scientific domain. There are numerous techniques and approaches to find useful tendencies, dependencies or underlying patterns in it. The data derived from scientific domains is usually less homogeneous and more massive than the one stemming from business domain. Despite the fact that a lot of data mining techniques applied in business return nice results for the science as well, some more sophisticated and tailored methods are needed to meet needs arising in science. According to Craford [12] there are two types of analytic tasks for science that can be supported by data mining. Firstly, discovery driven mining used to deriving hypothe- sizes. Secondly, verification driven mining used to support (or discourage) hypothesis, i.e. experiments. In this setting hypothesis formation requires more exquisite approaches and deeper domain-specific knowledge. Facing imposing data volumes, scientist experience overload of data for describing domain entities. The issue which comes along with it is the fact that all these data sets can offer alternative (or even sometimes contradictory) perspective on a studied data. Thus, a universal tool which is suitable for data analysis is a necessary option to have on hand. Moreover, identifying correspondences between interesting aspects of studied data is a natural task in many domains. It is well known that viewing the data from different prospective is useful for better understanding of a whole concept. Redescription mining is aimed to embody this. The ultimate goal of it is finding different ways of looking at data and extracting alternative characteristics of the same (or nearly the same) objects. As it can be concluded from the name, redescription mining is aimed to learn model from data in order to describe it and help with interpretability of investigated results. Redescription is a way of finding objects that can be described from at least two different sides. The number of views can be larger than two, but the setting with double-sided data is more common. To assist in understanding of a redescription mining concept the following example can be used: Example 1. We consider a set of nine countries as our objects of interest, namely Canada, Mexico, Mozambique, Chile, China, France, Russia, the United Kingdom and the USA. Simple toy data set [48, 43, 63] consisting of four properties characterizing 1 2 Chapter 1 Introduction these countries, represented as a Venn diagram in Figure 1.1. is also included. Consider the couple of statements below: 1. Country outside the Americas with land area more than 8 billion square kilometers. 2. Country is a permanent member of the UN security council with a history of state communism. Figure 1.1: Geographic and geopolitical characteristics of countries represented as a Venn diagram. Adapted from [48]. Blue - Located in the Americas Green - History of state communism Yellow- Land area above 8 Billion square kilometers Red - Permanent member of the UN security council Two countries (Russia and China) satisfy both statements. They show alternative characterizations of the same subset of countries from geographical and geopolitical properties. Thus, the redescription is formed. The strength of it is given by symmetric Jaccard coefficient (1/1=1). Descriptors of any side of derived redescription can contain more than one entity. This simple example provides an intuition in understanding concept of redescription. Thus, we are given multi-view data set (in our case consisting of two sub-sets describing same objects with different features). For example, in a setting of niche-finding problem for mammals studied in [23, 49], we can be provided with the one set containing species which live in particular regions. Another set will contain climatic data about same regions. Thus, the redescription mined for such a problem, can be a statement that some mammal resides

Redescription Mining Over Non-Binary Data Sets Using Decision Trees

Etruscan Shrew Muscle: the Consequences of Being Small Klaus D

Species Examined.Xlsx 8:17 PM 5/31/2011

World Record World

List of 28 Orders, 129 Families, 598 Genera and 1121 Species in Mammal Images Library 31 December 2013

Urotrichus Talpoides)

Bonner Zoologische Beiträge Band 51 (2002) Heft 4 Seiten 229-254 Bonn, Dezember 2003

1 Checklist of Indian Mammals FINAL.Pmd

Mammalia, Soricidae) from Vaskapu Cave (N-Hungary

Tactile Guidance of Prey Capture in Etruscan Shrews

List of Taxa for Which MIL Has Images

Checklist of the Central European Mammal Species 6

HEART and RESPIRATORY RATES in the SMALLEST MAMMAL, the ETRUSCAN SHREW SUNCUS ETRUSCUS (INSECTIVORA : SORICIDAE) K Jurgens, R Fons, T Peters, S Sender