Multi-View Learning for Hierarchical Topic Detection on Corpus of Documents
Total Page:16
File Type:pdf, Size:1020Kb
Multi-view learning for hierarchical topic detection on corpus of documents Juan Camilo Calero Espinosa Universidad Nacional de Colombia Facultad de Ingenieria, Departamento de Ingenieria de Sistemas e Industrial. Bogot´a,Colombia 2021 Multi-view learning for hierarchical topic detection on corpus of documents Juan Camilo Calero Espinosa Tesis presentada como requisito parcial para optar al t´ıtulode: Magister en Ingenier´ıade Sistemas y Computaciøn´ Director: Ph.D. Luis Fernando Ni~noV. L´ıneade Investigaci´on: Procesamiento de lenguaje natural Grupo de Investigaci´on: Laboratorio de investigaci´onen sistemas inteligentes - LISI Universidad Nacional de Colombia Facultad de Ingenieria, Departamento de Ingenieria en Sistemas e Industrial. Bogot´a,Colombia 2021 To my parents Maria Helena and Jaime. To my aunts Patricia and Rosa. To my grandmothers Lilia and Santos. Acknowledgements To Camilo Alberto Pino, as the original thesis idea was his, and for his invaluable teaching of multi-view learning. To my thesis advisor, Luis Fernando Ni~no,and the Laboratorio de investigaci´onen sistemas inteligentes - LISI, for constantly allowing me to learn new knowl- edge, and for their valuable recommendations on the thesis. V Abstract Topic detection on a large corpus of documents requires a considerable amount of com- putational resources, and the number of topics increases the burden as well. However, even a large number of topics might not be as specific as desired, or simply the topic quality starts decreasing after a certain number. To overcome these obstacles, we propose a new method- ology for hierarchical topic detection, which uses multi-view clustering to link different topic models extracted from document named entities and part of speech tags. Results on three different datasets evince that the methodology decreases the memory cost of topic detection, improves topic quality and allows the detection of more topics. Keywords| Named entities, POS tagging, topic detection, multi-view clustering, graph fusion Resumen La detecci´onde temas en grandes colecciones de documentos requiere una considerable cantidad de recursos computacionales, y el n´umerode temas tambi´enpuede aumentar la carga computacional. Incluso con un elevado n`umerode temas, estos pueden no ser tan es- pec´ıficos como se desea, o simplemente la calidad de los temas comienza a disminuir despu´es de cierto n´umero. Para superar estos obst´aculos,proponemos una nueva metodolog´ıapara la detecci´onjer´arquicade temas, que utiliza agrupamiento multi-vista para vincular diferentes modelos de temas extra´ıdosde las partes del discurso y de las entidades nombradas de los documentos. Los resultados en tres conjuntos de documentos muestran que la metodolog´ıa disminuye el costo en memoria de la detecci´onde temas, permitiendo detectar m`astemas y al mismo tiempo mejorar su calidad. Keywords| Entidades nombradas, etiquetado gramatical, detecci´onde temas, agrupamiento multi-vista, fusi´onde grafos VI Esta tesis de maestr´ıase sustent´oel 14 de mayo de 2021 a las 08:00 am, y fue evaluada por los siguientes jurados: white Sergio Gonzalo Jim´enezVargas, Ph.D. Grupo de investigaci´onde ling¨u´ısticaComputacional Instituto Caro y Cuervo white Elizabeth Le´onGuzm´an,Ph.D. Profesora Asociada Departamento de ingenier´ıade Sistemas e Industrial Universidad Nacional de Colombia, Sede Bogot´a Contents Acknowledgements IV Abstract V 1 Introduction 2 2 Theoretical Background 6 2.1 Hierarchical topic detection ........................... 6 2.2 Multi-view data.................................. 9 2.3 Multi-view clustering............................... 9 2.4 Multi-view learning on LDA........................... 10 3 Experimental setup 11 3.1 Datasets...................................... 11 3.1.1 Wikipedia organisms category...................... 11 3.1.2 Universidad Nacional de Colombia thesis abstracts .......... 13 3.1.3 Universidad Nacional de Colombia thesis dataset ........... 13 3.2 Text preprocessing ................................ 15 3.2.1 Language detection............................ 15 3.2.2 Named entities and POS tags ...................... 15 3.2.3 TF-IDF score............................... 17 3.3 Topic detection.................................. 17 4 Multi-view topic clustering 22 5 Graph fusion and evaluation 25 5.1 Graph fusion ................................... 25 5.2 Evaluation..................................... 26 5.2.1 Inter-topic evaluation........................... 28 5.3 Empirical evaluations............................... 32 5.4 Topic models initial reviews ........................... 33 6 Conclusions and recommendations 35 6.1 Conclusions.................................... 35 6.2 Recommendations................................. 35 A Appendix: Topic tree examples 37 B Appendix: Topic clustering results 45 C Appendix: Topic hierarchies 48 D Appendix: Topic fusion 51 E Appendix: Abstracts single-view topic clustering results 57 F Appendix: Theses single-view topic clustering results 60 G Appendix: Organisms single-view topic clustering results 63 H Appendix: Feature concatenation topic clustering results 66 I Appendix: Organisms subcategories 69 Bibliography 73 List of Tables 3.1 Pages downloaded per category.......................... 14 3.2 Entities and tags used on the Organisms dataset. ............... 16 3.3 Entities and tags used on the abstract and thesis datasets........... 16 3.4 Vocabulary size used for each organism entity sub-dataset........... 18 3.5 Vocabulary size used for each thesis entity sub-datasets............. 19 3.6 Vocabulary size used for each abstract entity sub-datasets........... 19 3.7 Organism topics evaluation with a total of 3,409 topics. ........... 19 3.8 Thesis topics evaluation with a total 2889 topics................ 20 3.9 Abstract topics evaluation with a total of 1041 topics............. 20 4.1 Number of topics on each dataset topic tree levels................ 23 5.1 Average coherence scores for the different datasets. .............. 27 5.2 Average compactness scores for the different datasets.............. 28 5.3 Average Inter-topic coherence scores for the different datasets. 29 5.4 Inter-topic coherence per tree level for the HLTA topics............ 30 5.5 Inter-topic coherence per tree level for the entity topics............. 30 5.6 Inter-topic coherence per tree level for the fusion topics............. 30 5.7 Average parent-child coherence scores for the different datasets. 32 5.8 Survey results for topic meaningfulness based on a scale from 1 to 10. 34 List of Figures 1.1 Methodology diagram. .............................. 4 3.1 Example of preprocessing using regular expressions to a fragment from the Ant page from Wikipedia. ............................ 13 5.1 Coherence scores for the different datasets.................... 27 5.2 Compactness scores for the different datasets. ................. 28 5.3 Inter-topic coherence scores for the different datasets.............. 29 5.4 Inter-topic coherence per tree level for the HLTA topics. ........... 30 5.5 Inter-topic coherence per tree level for the entity topics............. 31 5.6 Inter-topic coherence per tree level for the fusion topics............. 31 5.7 Parent-child coherence scores for the different datasets............. 32 5.8 Survey results performed by 30 people where in a scale of 1 to 10, they had to choose how meaningful some topics were. 5 people had previous experience analyzing topic models and 25 did not have any experience........... 34 A.1 Some adjective topics from the abstracts dataset. ............... 37 A.2 Some noun topics from the thesis dataset. ................... 38 A.3 Some organization topics from the thesis dataset. ............... 39 A.4 Some verb topics from the thesis dataset..................... 40 A.5 Some event topics from the organisms dataset.................. 41 A.6 Some location topics from the organisms dataset. ............... 42 A.7 Some object topics from the organisms dataset. ................ 43 A.8 Some people topics from the organisms dataset................. 44 B.1 Topic clustering per tree level from the abstracts entity topic models. 45 B.2 Topic clustering per tree level from the thesis entity topic models. 46 B.3 Topic clustering per tree level from the organisms entity topic models. 47 X List of Figures C.1 Topic hierarchies of the abstracts entities, the color represents the cluster topics belong to. ................................. 48 C.2 Topic hierarchies of the theses entities, the color represents the cluster topics belong to. ..................................... 49 C.3 Topic hierarchies of the Organisms entities, the color represents the cluster topics belong to. ................................. 50 D.1 Topic hierarchies after fusion from the abstracts entities, the color represents the cluster topics belong to. ........................... 51 D.2 Some entity topics after fusion from the abstracts dataset. .......... 52 D.3 Topic hierarchies after fusion from the thesis entities, the color represents the cluster topics belong to. ............................. 53 D.4 Some entity topics after fusion from the thesis dataset............. 54 D.5 Topic hierarchies after fusion from the organisms entities, the color represents the cluster topics belong to. ........................... 55 D.6 Some entity topics after fusion from the organisms dataset........... 56 E.1 Abstracts topic clustering per tree level using only each topic words probabil- ity information................................... 57 E.2 Abstracts topic clustering