R Ossi Setchi

Invited Lectures at China Agricultural University, 17-21 September 2012, Beijing, China SEMANTICALLY ENHANCED DOCUMENT CLUSTERING IVAN DIMITROV STANKOV Professor Rossi Setchi Leader of the Knowledge Engineering Systems Group Leader of the Institute of Mechanical and Manufacturing Engineering A thesis submitted to the School of Engineering, Cardiff University In partial fulfilment of the requirements for the degree of Doctor of Philosophy 2012 Abstract This thesis advocates the view that traditional document clustering could be significantly improved by representing documents at different levels of abstraction at which the similarity between documents is considered. The improvement is with regard to the alignment of the clustering solutions to human judgement. The proposed methodology employs semantics with which the conceptual similarity between documents is measured. The goal is to design algorithms which implement the methodology, in order to solve the following research problems: (i) how to obtain multiple deterministic clustering solutions; (ii) how to produce coherent large-scale clustering solutions across domains, regardless of the number of clusters; (iii) how to obtain clustering solutions which align well with human judgement; and (iv) how to produce specific clustering solutions from the perspective of the user’s understanding for the domain of interest. The developed clustering methodology enhances separation between and improved coher- ence within clusters generated across several domains by using levels of abstraction. The methodology employs a semantically enhanced text stemmer, which is developed for the pur- pose of producing coherent clustering, and a concept index that provides generic document representation and reduced dimensionality of document representation. These characteristics of the methodology enable addressing the limitations of traditional text document clustering by employing computationally expensive similarity measures such as Earth Mover’s Distance (EMD), which theoretically aligns the clustering solutions closer to human judgement. A threshold for similarity between documents that employs many-to-many similarity matching is proposed and experimentally proven to benefit the traditional clustering algorithms in producing clustering solutions aligned closer to human judgement. 3 The experimental validation demonstrates the scalability of the semantically enhanced document clustering methodology and supports the contributions: (i) multiple deterministic clustering solutions and different viewpoints to a document collection are obtained; (ii) the use of concept indexing as a document representation technique in the domain of document clustering is beneficial for producing coherent clusters across domains; (ii) SETS algorithm provides an improved text normalisation by using external knowledge; (iv) a method for measuring similarity between documents on a large scale by using many-to-many matching; (v) a semantically enhanced methodology that employs levels of abstraction that correspond to a user’s background, understanding and motivation. The achieved results will benefit the research community working in the area of document management, information retrieval, data mining and knowledge management. 4 Acknowledgements I would like to thank the supervisors of my studies, Professor Rossi Setchi and Dr Yulia Hicks, for their invaluable guidance and support throughout my work. All members of the KES research group from School of Engineering in Cardiff University are thanked for their friendship and help. My deepest gratitude is to my family who has given continuous support and encourage- ment to me. 5 TABLE OF CONTENTS ABSTRACT ----------------------------------------------------------------------------------------------------------------- 3 ACKNOWLEDGEMENTS ------------------------------------------------------------------------------------------------ 5 LIST OF FIGURES --------------------------------------------------------------------------------------------------------- 6 LIST OF TABLES ---------------------------------------------------------------------------------------------------------- 8 LIST OF PUBLICATIONS ----------------------------------------------------------------------------------------------- 10 CHAPTER 1 : INTRODUCTION ----------------------------------------------------------------------------------- 11 1.1. MOTIVATION -------------------------------------------------------------------------------------------------------- 11 1.2. AIMS AND OBJECTIVES ---------------------------------------------------------------------------------------------- 15 1.3. OUTLINE OF THE THESIS -------------------------------------------------------------------------------------------- 17 CHAPTER 2 : LITERATURE REVIEW --------------------------------------------------------------------------- 20 2. 1. CLUSTERING METHODOLOGIES AND TECHNIQUES ------------------------------------------------------------ 20 2.1.1. Clustering methods -------------------------------------------------------------------------------------------- 21 2.1.2. Clustering techniques ----------------------------------------------------------------------------------------- 23 2.1.3. Clustering procedure ----------------------------------------------------------------------------------------- 24 2.1.4. Feature selection----------------------------------------------------------------------------------------------- 25 2.1.5. Clustering algorithm design and selection -------------------------------------------------------------- 26 2.1.6. Evaluation of clustering solutions ----------------------------------------------------------------------------- 27 2.1.6.1. Evaluation methodology in information retrieval and cognitive psychology ------------------------- 28 2.1.6.2. Evaluation methodology employed --------------------------------------------------------------------------- 30 2. 2. MODEL-BASED DOCUMENT CLUSTERING --------------------------------------------------------------------- 35 2.2.1. Partitional approach to clustering ------------------------------------------------------------------------ 35 2.2.2. Hierarchical approach to clustering ---------------------------------------------------------------------- 39 2. 3. SIMILARITY-BASED DOCUMENT CLUSTERING ----------------------------------------------------------------- 43 2.3.1. Word Sense Disambiguation (WSD) ---------------------------------------------------------------------- 43 i 2.3.2. Document representations ---------------------------------------------------------------------------------- 45 2.3.3. Similarity measures ------------------------------------------------------------------------------------------- 47 2.3.4. Clustering techniques ----------------------------------------------------------------------------------------- 51 2.3.5. External semantic source ------------------------------------------------------------------------------------ 55 2. 4. SUMMARY ----------------------------------------------------------------------------------------------------- 57 CHAPTER 3 : CONCEPTUAL MODEL OF SEMANTICALLY ENHANCED DOCUMENT CLUSTERING -- --------------------------------------------------------------------------------------------------------- 59 3. 1. LIMITATIONS OF TRADITIONAL DOCUMENT CLUSTERING ---------------------------------------------------- 59 3.1.1. Clustering solutions generated are inconsistent and poorly aligned to human judgement - 59 3.1.2. Document similarity across domains --------------------------------------------------------------------- 60 3.1.3. Meaningful clustering solutions---------------------------------------------------------------------------- 62 3. 2. REQUIREMENTS TOWARDS THE METHODOLOGY ------------------------------------------------------------- 63 3.2.1. Reduced Dimensionality ------------------------------------------------------------------------------------- 64 3.2.2. Multiple viewpoints to clustering solutions ------------------------------------------------------------- 65 3.2.3. Consistent to human judgement clustering solutions ------------------------------------------------ 68 3.2.4. Meaningful clustering solutions and intuitive browsing -------------------------------------------- 70 3.2.5. Deterministic clustering solutions on a large scale --------------------------------------------------- 72 3. 3. TOWARDS ADVANCED DOCUMENT CLUSTERING -------------------------------------------------------------- 73 3.3.1. Advanced document representation ---------------------------------------------------------------------- 73 3.3.2. Advanced document similarity ----------------------------------------------------------------------------- 79 3. 4. CONCEPTUAL MODEL ----------------------------------------------------------------------------------------- 80 3.4.1. Pair-wise similarity measure ------------------------------------------------------------------------------- 83 3.4.2. Concept indexing in clustering ----------------------------------------------------------------------------- 85 3.4.3. Text normalisation -------------------------------------------------------------------------------------------- 86 3. 5. SUMMARY ----------------------------------------------------------------------------------------------------- 88 CHAPTER 4 : SEMANTICALLY ENHANCED TEXT NORMALISATION ----------------------------------- 89 4. 1. IMPROVEMENT OF CLUSTERS COHERENCY--------------------------------------------------------------------

R Ossi Setchi

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support