Uvic Thesis Template
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITY OF SOUTHAMPTON FACULTY OF PHYSICAL SCIENCES AND ENGINEERING Electronics and Computer Science Building Tag Hierarchies Based on Co-occurrences and Lexico-Syntactic Patterns by Fahad Ibrahim Bin Moqhim Thesis for the degree of Doctor of Philosophy June 2016 ii iii UNIVERSITY OF SOUTHAMPTON ABSTRACT FACULTY OF PHYSICAL SCIENCES AND ENGINEERING Electronics and Computer Science Thesis for the degree of Doctor of Philosophy BUILDING TAG HIERARCHIES BASED ON CO-OCCURRENCES AND LEXICO-SYNTACTIC PATTERNS Fahad Ibrahim Bin Moqhim Knowledge structures, such as taxonomies, are key to the organization and management of Web content, but are expensive to build manually. In this thesis we explore the issues around automatically building effective tag hierarchies from folksonomies (collective social classifications), and propose changes to the state-of- the-art methods that improve their performance. These changes aim to tackle the “generality-popularity” tags problem, in that popularity is assumed (sometimes inaccurately) to be a proxy for generality, i.e. high-level taxonomic terms will occur more often than low-level ones. The effectiveness of this research is demonstrated in four experiments. The first experiment explores whether taxonomic tag pairs captured directly from users change the quality of constructed tag hierarchies. The second experiment examines the possibility of using personal tag relationships constructed by users to improve the accuracy of learned taxonomic tags. The third experiment demonstrates the potential of using lexico-syntactic patterns applied to a closed text corpus to improve the direction of automatically derived tag pairs in order to build higher quality tag hierarchies. The last experiment investigates the possibility of using an open knowledge repository instead of a closed knowledge resource to increase the tags coverage in any tag collection, and consequently the quality of learned tag hierarchies. The results of our experiments show that collecting taxonomic tag pairs increases the semantic quality of the tag hierarchy, but at the expense of expressivity, and with some degradation of user experience. Secondly, personal tag relationships can be used to improve the accuracy of constructed taxonomic tags, but with limited success if the personal tag relationships and the learned taxonomic tags are not extracted from the same tagging system. Finally, lexico-syntactic patterns applied to a closed large text corpus (e.g. Wikipedia) can be used to improve the accuracy of directions in relations constructed between tags by a generality-based approach to tag hierarchy construction, and this would be improved further if an open corpus (e.g. the Web) is used instead of a closed one, which consequently improves the quality of the learned tag hierarchies in terms of structure and semantics. iv v Table of Contents ABSTRACT ....................................................................................................... iii Table of Contents ................................................................................................... v List of Tables ......................................................................................................... ix List of Figures ........................................................................................................ xi Declaration of Authorship ................................................................................... xiii Acknowledgements ............................................................................................... xv Definitions and Abbreviations ............................................................................ xvii 1 Introduction ..................................................................................................... 1 1.1 Motivation for the Research .................................................................... 8 1.2 Research Hypothesis and Questions ....................................................... 10 1.3 Research Contributions .......................................................................... 12 1.4 Publications ............................................................................................ 14 1.5 Outline of the Thesis .............................................................................. 15 2 Knowledge Representation on the Web ........................................................ 17 2.1 Knowledge Structures based on Collective Intelligence ......................... 18 2.2 Motivation of Knowledge Structures ..................................................... 19 2.3 Existing Knowledge Structures .............................................................. 21 2.3.1.1 Glossaries .................................................................................... 22 2.3.1.2 Folksonomies ............................................................................... 23 2.3.1.3 Controlled Vocabularies .............................................................. 25 2.3.1.4 Taxonomies ................................................................................. 26 2.3.1.5 Thesauri ...................................................................................... 27 2.3.1.6 Conceptual Graphs ...................................................................... 28 2.3.1.7 Ontologies .................................................................................... 29 2.3.1.8 Summary of Existing Knowledge Structures .............................. 30 2.4 Chapter Summary .................................................................................. 32 3 Tag Hierarchies Construction and Evaluation .............................................. 33 3.1 Tagging Motivations .............................................................................. 33 3.2 Tagging Content .................................................................................... 34 vi 3.3 Approaches for Tagging ......................................................................... 35 3.4 Approaches for Tag Hierarchies Construction ....................................... 36 3.4.1 Clustering Techniques based Approaches .......................................... 38 3.4.2 Knowledge Resources based Approaches ........................................... 39 3.4.3 Hybrid Approaches ............................................................................. 40 3.4.4 Limitations of the Approaches ........................................................... 43 3.5 Evaluation of Building Tag Hierarchies Approaches ............................. 43 3.5.1 Semantic Evaluation .......................................................................... 44 3.5.1.1 Evaluation against Reference Taxonomy .................................... 44 3.5.1.2 Evaluation by Human Assessment .............................................. 46 3.5.2 Structural Evaluation ......................................................................... 46 3.5.3 Usability Evaluation ........................................................................... 47 3.6 Chapter Summary .................................................................................. 49 4 Building Tag Hierarchies from Crowdsourced Taxonomic Tag Pairs ........... 51 4.1 Proposed Social Tagging Approach ....................................................... 53 4.2 Proposed Algorithm ............................................................................... 54 4.2.1 Description of the Algorithm ............................................................. 56 4.2.2 Settings of the Algorithms ................................................................. 57 4.3 TagTree System ..................................................................................... 58 4.4 Datasets ................................................................................................. 61 4.5 Evaluation Methodology ........................................................................ 62 4.6 Results and Analysis .............................................................................. 65 4.6.1 Results of Semantic Evaluation.......................................................... 67 4.6.1.1 Reference-based Evaluation ........................................................ 67 4.6.1.2 Human-based Evaluation ............................................................ 68 4.6.2 Results of Structural Evaluation ........................................................ 70 4.6.3 Results of Usability Evaluation .......................................................... 71 4.7 Chapter Summary .................................................................................. 72 5 Improving the Accuracy of Taxonomic Directions When Building Tag Hierarchies ............................................................................................................ 75 5.1 Proposed Approach ................................................................................ 76 5.2 Proposed Algorithm ............................................................................... 79 5.2.1 Description of the Algorithm ............................................................. 80 5.2.2 Similarity Measure ............................................................................. 81 vii 5.2.3 Settings of the Algorithms .................................................................. 82 5.3 Datasets .................................................................................................. 83 5.3.1 Tag Collections ..................................................................................