Learning Data Representations in Unsupervised Learning Maziar Moradi Fard

Learning data representations in unsupervised learning Maziar Moradi Fard To cite this version: Maziar Moradi Fard. Learning data representations in unsupervised learning. Data Structures and Algorithms [cs.DS]. Université Grenoble Alpes [2020-..], 2020. English. NNT : 2020GRALM053. tel-03151389 HAL Id: tel-03151389 https://tel.archives-ouvertes.fr/tel-03151389 Submitted on 24 Feb 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE Pour obtenir le grade de DOCTEUR DE L’UNIVERSITÉ GRENOBLE ALPES Spécialité : Informatique Arrêté ministériel : 25 mai 2016 Présentée par Maziar MORADI FARD Thèse dirigée par Eric GAUSSIER préparée au sein du Laboratoire d'Informatique de Grenoble dans l'École Doctorale Mathématiques, Sciences et technologies de l'information, Informatique Apprentissage de représentations de données dans un apprentissage non- supervisé learning data representations in unsupervised learning Thèse soutenue publiquement le 24 novembre 2020, devant le jury composé de : Monsieur ERIC GAUSSIER PROFESSEUR DES UNIVERSITES, UNIVERSITE GRENOBLE ALPES, Directeur de thèse Madame AMER-YAHIA SIHEM PROFESSEUR DES UNIVERSITES, UNIVERSITE GRENOBLE ALPES, Présidente Monsieur JULIEN VELCIN PROFESSEUR DES UNIVERSITES, UNIVERSITE LYON 2, Examinateur Monsieur LEBBAH MUSTAPHA MAITRE DE CONFERENCES HDR, UNIVERSITE SORBONNE PARIS NORD, Rapporteur Monsieur MARC TOMMASI PROFESSEUR DES UNIVERSITES, UNIVERSITE DE LILLE, Rapporteur Acknowledgements Firstly, I would like to express my sincere gratitude to my advisors Prof. Eric Gaussier and Thibaut Thonet for the continuous support of my Ph.D study and related research, for their patience, motivation, and immense knowledge. Their guidance helped me in all the time of research and writing of this thesis. Also, I would like to thank my fellow labmates especially Prof. Massih- Reza Amini for all the supports they provided during my Ph.D and all the fun we had together. Last but not the least, I would like to thank my family: my parents and sisters for supporting me spiritually throughout writing this thesis and my life in general. Without their support, I could not imagine my self in this situation. i ii Abstract Due to the great impact of deep learning on variety fields of machine learning, recently their abilities to improve clustering approaches have been investi- gated. At first, deep learning approaches (mostly Autoencoders) have been used to reduce the dimensionality of the original space and to remove possible noises (also to learn new data representations). Such clustering approaches that utilize deep learning approaches are called Deep Clustering. This thesis focuses on developing Deep Clustering models which can be used for different types of data (e.g., images, text). First we propose a Deep k-means (DKM) algorithm where learning data representations (through a deep Autoencoder) and cluster representatives (through the k-means) are performed in a joint way. The results of our DKM approach indicate that this framework is able to outperform similar algorithms in Deep Clustering. Indeed, our proposed framework is able to truly and smoothly backpropagate the loss function error through all learnable variables. Moreover, we propose two frameworks named SD2C and PCD2C which are able to integrate respectively seed words and pairwise constraints into end-to-end Deep Clustering frameworks. In fact, by utilizing such frameworks, the users can observe the reflection of their needs in clustering. Finally, the results obtained from these frameworks indicate their ability to obtain more tailored results. iii iv Résumé En raison du grand impact de l’apprentissage profond sur divers domaines de l’apprentissage automatique, leurs capacitésàaméliorer les approches de clustering ont récemment étéétudiées.Dans un premier temps, des approches d’apprentissage profond (principalement des autoencodeurs) ont étéutilisées pour réduirela dimensionnalitéde l’espace d’origine et pour supprimer les éventuels bruits (également pour apprendre de nouvelles représentations de données).De telles approches de clustering qui utilisent des approches d’apprentissage en profondeur sont appeléesdeep clustering. Cette thèsese concentre sur le développement de modèlesde deep clustering qui peuvent être utiliséspour différents types de données(par exemple, des images, du texte). Tout d’abord, nous proposons un algorithme DKM (Deep k-means) dans lequel l’apprentissage des représentations de données(via un autoencodeur profond) et des représentants de cluster (via k-means) est effectuéde manière conjointe. Les résultats de notre approche DKM indiquent que ce modèle est capable de surpasser des algorithmes similaires en Deep Clustering. En effet, notre cadre proposéest capable de propager de manièrelisse l’erreur de la fonction de coûtàtravers toutes les variables apprenables. De plus, nous proposons deux modèlesnommésSD2C et PCD2C qui sont capables d’intégrerrespectivement des mots d’amor¸cageet des contraintes par paires dans des approches de Deep Clustering de bout en bout. En utilisant de telles approches, les utilisateurs peuvent observer le reflet de leurs besoins en clustering. Enfin, les résultats obtenus àpartir de ces modèlesindiquent leur capacitéàobtenir des résultats plus adaptés. v vi Contents Abstract iii Résumév List of Figures viii List of Tablesx 1 Introduction1 2 Definitions and Notations5 2.1 Clustering.............................5 2.1.1 k-means Clustering....................6 2.1.2 Fuzzy C-means......................7 2.2 Autoencoders...........................8 2.3 Evaluation Metrics........................ 10 2.3.1 NMI............................ 10 2.3.2 ACC............................ 10 2.3.3 ARI............................ 11 2.4 Tools for Implementation.................... 11 3 Related Work 13 3.1 Overview............................. 13 3.2 Deep Clustering.......................... 14 3.2.1 Autoencoder Based Deep Clustering Approaches... 14 3.2.2 Non-Autoencoder based Deep Clustering Approaches. 20 3.3 Constrained Clustering Approaches............... 26 3.3.1 Seed Words Based Approaches............. 27 3.3.2 Must-link and Cannot-link Based Approaches..... 28 vii 4 Deep k-means Clustering 31 4.1 Introduction............................ 31 4.2 Deep k-means........................... 32 4.2.1 Choice of Gk ....................... 34 4.2.2 Choice of α ........................ 36 4.3 Experiments............................ 37 4.3.1 Datasets.......................... 38 4.3.2 Baselines and Deep k-Means variants.......... 42 4.3.3 Experimental setup.................... 43 4.3.4 Clustering results..................... 45 4.3.5 k-Means-friendliness of the learned representations.. 55 4.4 Conclusion............................ 56 5 Constrained Deep Document Clustering 59 5.1 Introduction............................ 59 5.2 Seed-guided Deep Document Clustering............ 61 5.2.1 SD2C-Doc......................... 62 5.2.2 SD2C-Rep......................... 63 5.2.3 SD2C-Att......................... 64 5.3 Pairwise-Constrained Deep Document Clustering....... 66 5.3.1 Choice of Deep Clustering Framework......... 67 5.4 Experiments............................ 68 5.4.1 Datasets.......................... 68 5.4.2 SD2C Baselines and Variants.............. 68 5.4.3 PCD2C Baselines and Variants............. 69 5.4.4 Constraints Selection................... 70 5.4.5 Experimental Setup................... 75 5.4.6 Clustering Results.................... 76 5.5 Conclusion............................ 84 6 Conclusion 87 6.1 Summary............................. 87 6.2 Future work............................ 89 6.2.1 Deep Clustering Without Seed Words......... 89 6.2.2 Deep Clustering Including Seed Words......... 89 Bibliography 91 viii List of Figures 2.1 An example of Undercomplete Autoencoders. Blue layers represent the encoder and green layers represent the decoder while the purple layer represents the embedding layer.....9 3.1 General DC architecture..................... 15 3.2 AAE architecture......................... 18 3.3 CATGAN architecture...................... 18 3.4 Pairwise Clustering Architecture................. 23 4.1 Overview of our Deep k-Means approach instantiated with losses based on the Euclidean distance.............. 34 4.2 Examples from the MNIST dataset............... 38 4.3 Annealing scheme for inverse temperature α, following the 1/ log(n)2 sequence αn+1 = 2 × αn; α1 = 0.1............ 42 4.4 The architecture of the encoder part of the Autoencoder... 44 4.5 Accuracy evolution of different deep clustering methods with respect to hyper-parameter values (10−N ) on Reuters..... 50 4.6 Accuracy evolution of different deep clustering methods with respect to hyper-parameter values (10−N ) on 20news..... 51 4.7 Accuracy evolution of different deep clustering methods with respect to hyper-parameter values (10−N ) on Yahoo...... 52 4.8 Accuracy evolution of different deep clustering methods with respect to hyper-parameter values (10−N ) on DBPedia.... 53 4.9 Accuracy evolution

Learning Data Representations in Unsupervised Learning Maziar Moradi Fard

Probabilistic Topic Modelling with Semantic Graph

A Hierarchical Clustering Approach for Dbpedia Based Contextual Information of Tweets

Query-Based Multi-Document Summarization by Clustering of Documents

Wordnet-Based Metrics Do Not Seem to Help Document Clustering

Evaluation of Text Clustering Algorithms with N-Gram-Based Document Fingerprints

Word Sense Disambiguation in Biomedical Ontologies with Term Co-Occurrence Analysis and Document Clustering

A Query Focused Multi Document Automatic Summarization

Semantic Based Document Clustering Using Lexical Chains

Latent Semantic Sentence Clustering for Multi-Document Summarization

Clustering News Articles Using K-Means and N-Grams by Desmond

A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques

Document Clustering Using K-Means and K-Medoids Rakesh Chandra Balabantaray, Chandrali Sarma, Monica Jha