Integrating Ontological Knowledge and Probabilistic Topic Models

Semantic Web Topic Models: Integrating Ontological Knowledge and Probabilistic Topic Models by Mehdi Allahyari (Under the Direction of Krys Kochut) Abstract Currently we are coping with a plethora of text (more than 80% of web) generated and disseminated globally on the web, and thanks to new technologies such as smart devices and social networks it keeps growing exponentially every day. This tremendous amount of text mostly unstructured is easy to be processed and perceived by humans, but significantly hard for machines to understand. Needless to say, this volume of text is an invaluable source of information and knowledge. Thus, there is an increasing need to design methods and algorithms in order to effectively process this sheer volume of text and extract high quality information in an automatic fashion. Probabilistic topic models are a class of latent variable models for textual data that can be used to produce interpretable summarization of documents in the form of their constituent topics. However, because topic models are entirely unsupervised, they may create topics that are not always meaningful and understandable to humans. In this dissertation, we develop novel topic models that combine probabilistic topic modeling with domain knowledge in the form of ontologies within a single framework. These models effectively enhance topic modeling process and produce the topics that are best aligned with user modeling goals. We first describe the ontology-based topic model, OntoLDA, in which a document is a mixture of topics where as topics are distributions over the ontology concepts and concepts are multinomial distributions over the words. We demonstrate the utility of this model in order to automatically generate labels for the topics. We next propose the sOntoLDA topic model which combines the DB- pedia ontology with the topic modeling and use this model for semantic tagging of web documents. For all these models, we develop learning algorithms and show their usefulness with experiments conducted on real-world datasets. Index words: Semantic Web, Ontologies, DBpedia, Topic models, Statistical learning, Domain knowledge Semantic Web Topic Models: Integrating Ontological Knowledge and Probabilistic Topic Models by Mehdi Allahyari B.S., University of Kashan, 2005 A Dissertation Submitted to the Graduate Faculty of The University of Georgia in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Athens, Georgia 2016 c 2016 Mehdi Allahyari All Rights Reserved Semantic Web Topic Models: Integrating Ontological Knowledge and Probabilistic Topic Models by Mehdi Allahyari Approved: Major Professor: Krys Kochut Committee: John Miller Will York Electronic Version Approved: Suzanne Barbour Dean of the Graduate School The University of Georgia August 2016 Semantic Web Topic Models: Integrating Ontological Knowledge and Probabilistic Topic Models Mehdi Allahyari June 2016 For my parents and my family: Parisa, and Ryan v Acknowledgments Over the past six years I have received support from a great many individuals. I owe my appreciation to all these people without whom this dissertation could not have been completed. First and foremost I want to express my sincere appreciation and gratitude to my advisor Dr. Krys Kochut who has been a mentor, colleague and friend. His willingness to give me the freedom to explore research on topics for which I have been truly passionate, and at the same time his guidance to recover when I faltered made my Ph.D. experience productive and rewarding. I am also deeply grateful to other professors in my advising committee, Dr. John Miller and Will York, for their encouragement and valuable advice during the years I was striving to advance my research. Most importantly, I owe my deepest gratitude to my wife Dr. Parisa Darkhal and my parents for their unflagging love and continuous support throughout my life and during my studies to whom this dissertation is dedicated. vi Contents Acknowledgments vi List of Figures x List of Tables xii 1 Introduction 1 1.1 Outline . .4 2 Background 7 2.1 Semantic Web . .7 2.2 Probabilistic Topic Models . 14 3 Related Work 18 3.1 LDA-based Topic Models for Modeling Observed Features of Doc- uments . 19 3.2 LDA-based Topic Models exploiting Prior Knowledge . 24 4 Motivating Example 28 vii 4.1 Combining Ontologies with Topic Models . 29 5 Ontology-Based Text Classification 34 5.1 Introduction . 36 5.2 Ontology-based Text Categorization . 37 5.3 Classification Categories . 41 5.4 Categorization Algorithm . 45 5.5 Experiments . 53 5.6 Conclusion and Future Work . 58 6 OntoLDA: An Ontology-based Topic Model for Automatic Topic Labeling 59 6.1 Introduction . 61 6.2 Background . 64 6.3 Motivating Example . 68 6.4 Related Work . 70 6.5 Problem Formulation . 74 6.6 Concept-based Topic Labeling . 82 6.7 Experiments . 94 6.8 Conclusions . 104 7 Combining Topic Models with Wikipedia Category Network for Semantic Tagging 105 7.1 Introduction . 106 7.2 Related Work . 111 viii 7.3 Probabilistic Topic Models . 114 7.4 Ontology-based Topic Models for Semantic Tagging . 116 7.5 Experiments . 122 7.6 Classifying the Tagged Documents . 137 7.7 Conclusions . 142 8 Conclusions and Future Work 145 8.1 Summary of Contributions . 146 8.2 Future Work . 148 Bibliography 151 ix List of Figures 2.1 Graph structure of the example RDF statement . 10 2.2 Linked Open Data Cloud . 13 2.3 LDA Graphical Model . 16 5.1 Thematic graph from the example text . 40 5.2 Instanced graph with selected matched entities for topics defined as union, intersection, and a combination of contexts. 45 6.1 LDA Graphical Model . 67 6.2 Graphical representation of OntoLDA model . 77 6.3 Example of a topic represented by top concepts learned by OntoLDA. 84 6.4 Semantic graph of the example topic φ described in Fig. 6.3 with jV φj =13 ................................ 85 6.5 Dominant thematic graph of the example topic described in Fig. 6.4 87 6.6 Core concepts of the Dominant thematic graph of the example topic described in Fig. 6.5 . 88 x 6.7 Label graph of the concept \Oakland Raiders" along with its mScore to the category \American Football League teams".......... 90 6.8 Comparison of the systems using human evaluation . 100 7.1 Graphical representation of LDA model . 114 7.2 Graphical representation of sOntoLDA model . 117 7.3 Precision, Coverage and MAP of Exact Match for Wikipedia Dataset. 126 7.4 Precision, Coverage and MAP of Hierarchical Match for Wikipedia Dataset. 128 7.5 Relations between the top 4 Wikipedia categories assigned to the article \Tooth brushing"......................... 132 7.6 Precision, Coverage and MAP for Reuters Dataset. 134 xi List of Tables 4.1 Example learned topics that are less useful or meaningful to the user. 29 4.2 Top-10 words for topics from a document set. 30 4.3 Top-10 words for topics from a document set. The second row presents the manually assigned labels. 31 4.4 Sample topics with top-10 words and top-5 generated labels Mei et al. method. 32 4.5 Sample topics with top-10 words and top-5 generated labels by our topic labeling method. 33 5.1 Category details of used text corpus . 54 5.2 Fine-grained categorization of main categories . 54 5.3 Highlevel ontological contexts categorization . 55 5.4 Categorization based on unions of sub-categories . 56 5.5 Categorization based on unions of sub-categories . 56 6.1 Example of a topic with its label. 62 6.2 Example topics with top-10 words learned from a document set. 69 xii 6.3 Example topics with top-10 words learned from a document set. The second row presents the manually assigned labels. 70 6.4 Example of topic-word representation learned by LDA and topic- concept representation learned by OntoLDA. 75 6.5 Notation used in this paper ................... 76 6.6 Example of a topic with top-10 concepts (first column) and top-10 labels (second column) generated by our proposed method . 93 6.7 Sample topics of the BAWE corpus with top-6 generated labels for the Mei method and OntoLDA + Concept Labeling, along with top-10 words . 98 6.8 Sample topics of the Reuters corpus with top-6 generated labels for the Mei method and OntoLDA + Concept Labeling, along with top-10 words . 99 6.9 Example topics from the two document sets (top-10 words are shown). The third row presents the manually assigned labels . 102 6.10 Topic Coherence on top T words. A higher coherence score means the topics are more coherent . 103 6.11 Example topics with top-10 concept distributions in OntoLDA model104 7.1 Examples of five topics with their labels. 108 7.2 Precision, Coverage and MAP values of \exact match" for Wikipedia Dataset. 125 7.3 Precision, Coverage and MAP values of \Hierarchical match" for Wikipedia Dataset. 129 xiii 7.4 Top 5 categories selected for the article \Tooth brushing". 131 7.5 Precision, Coverage and MAP values of \Hierarchical match" for Reuters Dataset. 135 7.6 An example of top-10 words for 5 categories (topics) in Wikipedia Dataset . 136 7.7 An example of top-10 words for 5 categories (topics) in Reuters Dataset . 137 7.8 Multi-label classification Precision, Recall and F-Measure values of different algorithms. 142 7.9 Precision, Recall and F-Measure values of different algorithms. 143 xiv Chapter 1 Introduction Extracting high quality and useful information from massive and constantly growing collections of text documents has become a challenging task and has gained a great deal of attention in recent years. This tremendous amount of text data, which is often unstructured, is created in a variety of forms such as social networks, web and other type of information-centric applications.

Load more