Utilising Wikipedia for Text Mining Applications

Utilising Wikipedia for Text Mining Applications Muhammad Atif Qureshi College of Engineering and Informatics National University of Ireland, Galway Department of Informatics, Systems and Communications University of Milano-Bicocca, Milan A thesis submitted for the degree of Doctor of Philosophy 2015 2 Acknowledgements I am very grateful to my supervisors Dr. Colm O'Riordan and Dr. Gabriella Pasi. Their guidance and support made this day possible as it stands today. In particular, I have found a great friend in Dr. Colm O'Riordan besides being my supervisor, he made my stay in Ireland such a delight and his kindness showed me that I have more to learn from him other than the scientific subject of this thesis. When my mother came to Ireland after the birth of my daughter, Colm visited our place with his family twice, we felt really thankful for his kind gesture and my mother asked me to say thanks in the best possible way for his gentleness on multiple occasions, and hence, from myself and my mother, I am writing a special thanks in this acknowledgement. I am also thankful to the several suggestions and interesting discussions throughout out my PhD program from different people in the National University of Ireland, Galway and University of Milano-Bicocca, Milan. I am very grateful to the College of Engineering and Informatics scholar- ship committee within National University of Ireland, Galway who consid- ered me worthy of the opportunity. It surely would not have been possible to complete the research conducted in this thesis without their support. I am grateful to my wife Arjumand who partnered me in all walks of life ranging from scientific discussions to home affairs. We have seen different ups and downs of life together and always stood by each other's side; we share workplace, home affairs, and enjoy different hobbies and activities. She developed an interest in SciFi because I am a fan of it, she enjoys watching sports (a problem solved for me) in terms of TV time. Together, we have a beautiful daughter Fareeha Qureshi whose smile makes the world look so easy and pleasant. Fareeha will be 5 months old at the time of submission of this thesis but even this short period is so dear and valuable that words can't explain. I am very grateful to my mother who has been a source of strength for me throughout my life; she raised me up after the passing of my father at an early age and she showed with her actions that nothing is impossible. I admire her for completing her PhD in the times when she was a single parent and sole earning hand for our family. It is from her that I learned that nothing is impossible to achieve if we have strong motivation, she is my inspiration and my lamp at home that turned me into a person that I am today. Lastly, and most importantly, I am humbled by the Blessings of the Cre- ator and Sustainer of the Universe, Allah swt. Indeed, it is He who grants us what we do not deserve and none is worthy of praise except Him. 4 Abstract The process whereby inferences are made from textual data is broadly referred to as text mining. In order to ensure the quality and effectiveness of the derived inferences, several approaches have been proposed for different text mining applications. Among these applications, classifying a piece of text into pre-defined classes through the utilisation of training data falls into supervised approaches while arranging related documents or terms into clusters falls into unsupervised approaches. In both these approaches, processing is undertaken at the level of documents to make sense of text within those documents. Recent research efforts have be- gun exploring the role of knowledge bases in solving the various problems that arise in the domain of text mining. Of all the knowledge bases, Wikipedia on account of being one of the largest human-curated, online encyclopaedia has proven to be one of the most valuable resources in dealing with various problems in the domain of text mining. However, previous Wikipedia-based research efforts have not taken both Wikipedia categories and Wikipedia articles together as a source of information. This thesis serves as a first step in eliminating this gap and throughout the contributions made in this thesis, we have shown the effectiveness of Wikipedia category-article structure for various text mining tasks. Wikipedia categories are organized in a taxonomical manner serving as semantic tags for Wikipedia articles and this provides a strong abstrac- tion and expressive mode of knowledge representation. In this thesis, we explore the effectiveness of this mode of Wikipedia's expression (i.e., the category-article structure) via its application in the domains of text classification, subjectivity analysis (via a notion of \perspective" in news search), and keyword extraction. First, we show the effectiveness of exploiting Wikipedia for two classification tasks i.e., 1- classifying the tweets1 being relevant/irrelevant to 1Message sent using Twitter. an entity or brand, 2- classifying the tweets into different topical dimensions such as tweets related with workplace, innovation, etc. To do so, we define the notion of relatedness between the text in tweet and the information embedded within the Wikipedia category-article structure. Then, we present an application in the area of news search by using the same notion of relatedness to show more information related to each search result high- lighting the amount perspective or subjective bias in each returned result towards a certain opinion, topical drift, etc. Finally, we present a keyword extraction strategy using community detection over the Wikipedia categories to discover related keywords arranged in different communities. The relationship between Wikipedia categories and articles is explored via a textual phrase matching framework whereby the starting point is textual phrases that match Wikipedia articles' titles/redirects. The Wikipedia articles for which a match occurs are then utilised by extraction of their associated categories, and these Wikipedia categories are used to derive various structural measures such as those relating to taxonomical depth and Wikipedia articles they contain. These measures are utilised in our proposed text classification, subjectivity analysis, and keyword extraction framework and the performance is analysed via extensive experimental evaluations. These experimental evaluations undertake comparisons with standard text mining approaches in the literature and our Wikipedia framework based on its category-article structure outperforms the standard text mining techniques. 6 Contents 1 Introduction 1 1.1 Motivation and Problem Statement . 1 1.1.1 Textual Data over the World Wide Web . 1 1.1.2 Role of Knowledge Bases in Text Mining Applications . 2 1.2 Open Challenges . 3 1.3 Research Questions . 3 1.4 Contributions . 5 1.5 Thesis Flow and Structure . 6 2 Background 9 2.1 Text Mining . 9 2.1.1 Document Representation Models . 9 2.1.1.1 Vector Space Model . 10 2.1.2 Unsupervised Learning Methods from Text Data . 12 2.1.2.1 Text Clustering . 12 2.1.2.2 Topic Modelling . 13 2.1.3 Supervised Learning Methods from Text Data . 14 2.1.4 Evaluation Measures . 15 2.2 Knowledge Bases . 17 2.2.1 DBPedia . 17 2.2.2 YAGO: Yet Another Great Ontology . 18 2.2.3 Freebase . 18 2.2.4 WordNet . 18 2.2.5 Cyc and OpenCyc . 19 2.2.6 Wikipedia . 19 2.3 The Data Source: Twitter . 21 2.4 Summary of the Chapter . 23 i 3 Related Research 24 3.1 Semantic Relatedness . 24 3.2 Named Entity Recognition . 26 3.3 Disambiguation Problem . 27 3.3.1 Word Sense Disambiguation (WSD) . 28 3.3.2 Named Entity Disambiguation (NED) . 28 3.4 Seeking Information for Complex Needs . 30 3.4.1 Search Result Diversification . 31 3.4.2 Exploratory Search . 31 3.5 Knowledge Extraction . 32 3.5.1 Document Summarization . 32 3.5.2 Keyword Extraction . 33 3.5.2.1 Supervised strategies . 34 3.5.2.2 Unsupervised strategies . 35 3.6 State-of-the-Art in Lieu of Thesis Contributions . 36 3.7 Summary of the Chapter . 37 4 Wikipedia Based Semantic Relatedness Framework 38 4.1 Generation of Candidate Phrases . 38 4.1.1 Variable-Length Phrase Chunking . 39 4.2 Relatedness Scores Using Wikipedia Category Hierarchies . 41 4.2.1 Generation of Relatedness Scores . 42 4.2.2 Relatedness Measures . 44 4.2.2.1 Heuristic 1: Depthsignificance . 45 4.2.2.2 Heuristic 2: Catsignificance . 46 4.2.2.3 Heuristic 3: P hrasesignificance . 46 4.2.2.4 Summary of Relatedness Scores . 47 4.3 Summary of the Chapter . 48 5 Entity Filtering and Reputation Dimensions Classification for Online Reputation Management 50 5.1 Introduction to Online Reputation Management . 51 5.2 Significant Subtasks within Online Reputation Management . 52 5.2.1 Filtering Task . 52 5.2.2 Reputation Dimensions Classification Task . 53 5.3 Challenging Nature of Task . 53 5.3.1 Explicit Challenges . 54 ii 5.3.2 Implicit Challenges . 54 5.4 Overview of Our Approach . 56 5.4.1 Filtering Task . 56 5.4.1.1 Baseline System for the Filtering Task . 57 5.4.2 Reputation Dimensions' Classification Task . 57 5.4.2.1 Baseline System for the Reputation Dimensions' Clas- sification Task . 58 5.5 Methodology . 58 5.5.1 Filtering Task . 58 5.5.1.1 Feature Set Based on Wikipedia Category-Article Struc- ture . 58 5.5.1.2 Feature Set Based on Topic Modelling . 60 5.5.1.3 Twitter-Specific Feature Set .

Utilising Wikipedia for Text Mining Applications

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support