Semi-Supervised Structuring of Complex Data

UNIVERSITY LUMIÈRE LYON 2 DOCTORAL SCHOOL INFOMATHS INFORMATIQUE ET MATHÉMATIQUES (ED 512) PHDTHESIS to obtain the title of PhD of Science Specialty : Computer Science Defended by Marian-Andrei RIZOIU Semi-supervised structuring of complex data Thesis Advisers : Stéphane LALLICH and Julien VELCIN prepared at the ERIC laboratory defended on June 24th, 2013 Jury : Reviewers : Adrião Dória Neto - Universidade Federal do Rio Grande do Norte Christel Vrain - LIFO Laboratory, Univ. Orleans Advisers : Stéphane Lallich - ERIC Laboratory, Univ. Lyon 2 Julien Velcin - ERIC Laboratory, Univ. Lyon 2 Examiners : Maria Rifqi - LIP6 Laboratory, Univ. Panthéon-Assas Frédéric Precioso - I3S Laboratory, Univ. Nice Sophia Antipolis Gilbert Ritschard - Institute IDEMO, Univ. de Genève Semi-Supervised Structuring of Complex Data Abstract : The objective of this thesis is to explore how complex data can be treated using unsupervised machine learning techniques, in which additional information is injected to guide the exploratory process. The two main research challenges addressed in this work are (a) leveraging semantic information into data numerical representation and into the learning algorithms and (b) making use of the temporal dimension when analyzing complex data. The main research challenges are derived, through a dialectical relation between theory and practice, into more specific learning tasks, which vary from (i) detecting typical evolution patterns to (ii) improving data representation by using semantics to (iii) embedding expert information into image numerical description or to (iv) using semantic resources (e.g., Word- Net) when evaluation topics extracted from text. The methods we privilege when tackling with our learning tasks are unsupervised and, mainly, semi-supervised clustering. Therefore, the general context of this thesis lies at the intersection of the two large domains of complex data analysis and semi-supervised clustering. We divide our work into four parts. The first is dedicated to the temporal component of data, in which we propose a temporal clustering algorithms, with contiguity constraints, and use it to detect typical evolutions. The second part is dedicated to semantic reconstruction of the description space of the data, and we propose an unsupervised feature construction algorithm, which replaces highly correlated pairs of features with conjunctions of literals. In the third part, we tackle the problem of constructing a semantically-enriched image representation starting from a baseline representation, and we propose two approaches toward leveraging external expert knowledge, under the form of non-positional labels. We dedicate the fourth part of our work to textual data, and more precisely towards the task of topic extraction, using an overlapping text clustering algorithm, topic labeling, using frequent complete phrases, and semantic topic evaluation, by using an external concept hierarchy. We add a fifth part, which describes the applied part of our work, CommentWatcher, an open-source platform for analyzing online discussion forums. Keywords : complex data analysis, semi-supervised clustering, semantic data representation, temporal clustering, topic extraction, semantic-enriched image representation, feature construction Structuration semi-supervisée des données complexes Résumé : L’objectif du travail présenté dans cette thèse est d’explorer comment les données complexes peuvent être analysées en utilisant des techniques d’apprentissage automatique non- supervisé, dans lequel des connaissances supplémentaires sont introduits pour guider le pro- cessus exploratoire. Ce travail de recherche traite deux grandes problématiques : d’une part, l’utilisation d’informations sémantiques dans la construction de la représentation numérique ainsi que dans les algorithmes d’apprentissage automatique, et d’autre part, l’utilisation de la dimension temporelle dans l’analyse de données complexes. De ces problématiques de recherche ont émergé, au travers d’une relation dialectique entre la théorie et la pratique, des tâches plus précises, à savoir : (i) la détection d’évolutions typiques, (ii) l’amélioration de la représentation des données en utilisant leur sémantique, (iii) l’introduction d’informations expertes dans la représentation numérique des images et (iv) l’utilisation de ressources sémantiques additionnelles (comme WordNet) pour l’évaluation des thématiques extraites à partir du texte. Les méthodes qu’on privilégie dans notre travail sont des méthodes non- supervisées et, notamment, des méthodes semi-supervisées. Par conséquent, le contexte générale de cette thèse cette thèse se situe à au croisement des domaines de l’analyse de données complexes et du clustering semi-supervisé. Nous divisons notre travail en quatre parties. La première partie est dédiée à la dimension temporelle des données, et dans cette optique nous proposons un algorithme de clustering temporel avec des contraintes de contiguïté, que nous appliquons à la détection d’évolutions typiques. La deuxième partie quant à elle s’intéresse à la reconstruction sémantique de l’espace de représentation, tâche pour laquelle nous proposons un algorithme non-supervisé de construction d’attributs, dont le principe de base est de remplacer les paires d’attributs hautement corrélés par des conjonctions de ceux-ci. Dans la troisième partie, nous traitons le problème de la construction de représentations numériques séman- tiquement enrichies des images et pour cela nous proposons deux approches qui utilisent des connaissances expertes sous la forme d’annotations. Enfin, la dernière partie de nos travaux théoriques est dédiée aux données textuelles, plus précisément aux tâches d’extraction de thématiques à l’aide d’un algorithme de clustering avec recouvrement, de nommage de thématiques par des expressions intelligibles, ainsi qu’à l’évaluation sémantique des thématiques, en utilisant une hiérarchie de concepts. A ce travail vient s’ajouter une cinquième partie pratique, dont l’aboutissement est la plateforme CommentWatcher qui permet d’analyser les forums de discussion en ligne. Mots clés : analyse de données complexes, clustering semi-supervisé, représentation sémantique de données, clustering temporel, extraction de thématiques, représentation des images enrichi avec de la sémantique, construction non-supervisée des attributs Acknowledgments First of all, I would like to thank my two PhD advisers, Stéphane Lallich and Julien Velcin, for accompanying me on this four-year journey. They have shown me what research means and they have taught me how to appreciate it. They have shown me how one can be rigorous, while creative, and they have become, in the meanwhile, my role-models. I would also like to thank Christel Vrain and Adrião Dória Neto for accepting to dedicate their time and energy towards reviewing my thesis. I equally thank the examiners in my jury, Maria Rifqi, Frédéric Precioso and Gilbert Ritschard, for doing me the honor of participating to the jury of my thesis. A special thanks goes towards Julien Ah-Pine, who, in addition to being a good and supportive friend, has kindly accepted to read this manuscript and to provide educated opinions, which often made me see things in a new light. I also acknowledge my work colleagues and friends, Adrien and Mathilde, who transformed the office into an exhilarating place. And last, but not least, I thank my friends and family for putting up with me. Contents 1 Introduction1 1.1 The big picture..................................1 1.2 Research project.................................3 1.3 The constituent parts of the thesis.......................5 1.4 Content of the different chapters........................6 2 Overview of the Domain9 2.1 Complex Data Mining..............................9 2.1.1 Specificities of complex data...................... 11 2.1.2 Dealing with complex data of different natures............ 13 2.1.3 Temporal/dynamic dimension...................... 16 2.1.4 High data dimensionality........................ 18 2.2 Semi-Supervised Clustering........................... 20 2.2.1 Similarity-based approaches....................... 24 2.2.2 Search-based approaches......................... 26 3 Detecting Typical Evolutions 29 3.1 Learning task and motivations.......................... 29 3.2 Formalisation................................... 31 3.3 Related work................................... 34 3.4 Temporal-Driven Constrained Clustering.................... 34 3.4.1 The temporal-aware dissimilarity measure............... 35 3.4.2 The contiguity penalty function..................... 37 3.4.3 The TDCK-Means algorithm...................... 38 3.4.4 Fine-tuning the ratio between components............... 40 3.5 Experiments.................................... 43 3.5.1 Dataset.................................. 43 3.5.2 Qualitative evaluation.......................... 43 3.5.3 Evaluation measures........................... 45 3.5.4 Quantitative evaluation......................... 46 3.5.5 Impact of parameters β and δ ...................... 49 3.5.6 The tuning parameter α ......................... 50 3.6 Current work: Role identification in social networks.............. 52 3.6.1 Context.................................. 52 3.6.2 The framework for identifying social roles............... 53 3.6.3 Preliminary experiments......................... 54 3.7 Conclusion and future work........................... 57 viii Contents 4 Using Data Semantics to Improve Data Representation 59 4.1 Learning task and motivations.........................

Load more