Concept Mining: a Conceptual Understanding Based Approach
Total Page:16
File Type:pdf, Size:1020Kb
Concept Mining: A Conceptual Understanding based Approach by Shady Shehata A thesis presented to the University of Waterloo in ful¯llment of the thesis requirement for the degree of Doctor of Philosophy in Electrical and Computer Engineering Waterloo, Ontario, Canada, 2009 °c Shady Shehata 2009 I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required ¯nal revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii Abstract Due to the daily rapid growth of the information, there are considerable needs to extract and discover valuable knowledge from data sources such as the World Wide Web. Most of the common techniques in text mining are based on the statistical analysis of a term either word or phrase. These techniques consider documents as bags of words and pay no attention to the meanings of the document content. In addition, statistical analysis of a term frequency captures the importance of the term within a document only. However, two terms can have the same frequency in their documents, but one term contributes more to the meaning of its sentences than the other term. Therefore, there is an intensive need for a model that captures the meaning of linguistic utterances in a formal structure. The underlying model should indicate terms that capture the semantics of text. In this case, the model can capture terms that present the concepts of the sentence, which leads to discover the topic of the document. A new concept-based model that analyzes terms on the sentence, document and corpus levels rather than the traditional analysis of document only is introduced. The concept-based model can e®ectively discriminate between non-important terms with respect to sentence semantics and terms which hold the concepts that represent the sentence meaning. The proposed model consists of concept-based statistical analyzer, conceptual ontological graph representation, concept extractor and concept-based similarity measure. The term which con- tributes to the sentence semantics is assigned two di®erent weights by the concept-based statistical analyzer and the conceptual ontological graph representation. These two weights are combined into a new weight. The concepts that have maximum combined weights are selected by the concept extractor. The similarity between documents is calculated based on a new concept-based similar- ity measure. The proposed similarity measure takes full advantage of using the concept analysis measures on the sentence, document, and corpus levels in calculating the similarity between doc- uments. Large sets of experiments using the proposed concept-based model on di®erent datasets in text clustering, categorization and retrieval are conducted. The experiments demonstrate exten- sive comparison between traditional weighting and the concept-based weighting obtained by the concept-based model. Experimental results in text clustering, categorization and retrieval demon- strate the substantial enhancement of the quality using: (1) concept-based term frequency (tf), (2) conceptual term frequency (ctf), (3) concept-based statistical analyzer, (4) conceptual ontological graph, (5) concept-based combined model. In text clustering, the evaluation of results is relied on two quality measures, the F-Measure and iii the Entropy. In text categorization, the evaluation of results is relied on three quality measures, the Micro-averaged F1, the Macro-averaged F1 and the Error rate. In text retrieval, the evaluation of results relies on three quality measures, the precision at 10 documents retrieved P(10), the preference measure (bpref), and the mean uninterpolated average precision (MAP). All of these quality measures are improved when the newly developed concept-based model is used to enhance the quality of the text clustering, categorization and retrieval. iv Acknowledgements This thesis would not be possible without the support of many individuals, to whom I would like to express my gratitude. I will always be indebted to my supervisors, Prof. Fakhri Karray and Prof. Mohamed Kamel, for their support, encouragement, guidance, and most importantly trust. Prof. Karray's trust and support was instrumental in giving me con¯dence to achieve many accomplishments. I would like to thank Prof. Karray for his encouragement and guidance throughout my research. Prof. Kamel's input and guidance was invaluable to the quality and contribution of the work presented in this thesis, as well as in other publications. I would like to thank Prof. Kamel for his advice, valuable insights and feedback throughout my research. Without them this research would not have been possible. I would like also to thank many faculty members of the University of Waterloo, most notably my committee members, Prof. Chrysanne DiMarco, Prof. Krzysztof Czarnecki and Prof. Kostas Kontogiannis for their valuable input and suggestions. I wish to thank many of my colleagues at the Pattern Analysis and Machine Intelligence (PAMI) Lab, especially Khaled Hammouda, Mohamed El-Abd, Masoud Makrehchi, Moataz El Ayadi, Hanan Ayad and Rasha Kashef for valuable discussions and insights. I would like to thank the PAMI administrative secretaries, Margaret Ulbrick, Heidi Campbell, and Rosalind Klein for their great help during my program. I am also thankful to the Electrical and Computer Engineering Department graduate studies coordinator Wendy Boles and the Electrical and Computer Engineering administrative sta® Annette Dietrich, Lisa Hendel for their help and support during my graduate studies. I would like to thank the Natural Sciences and Engineering Research Council (NSERC) of Canada, the Ontario Graduate Scholarship (OGS) program, the Faculty of Engineering and the Graduate Studies O±ce at the University of Waterloo, and the Learning Objects Repository Network (LORNET) for their ¯nancial support. I also would like to thank Hazem Shehata for being a great friend and for his constant support and help during my studies. I would like to thank Muhammad Nummer, Hatem El-Behairy, Ayman Ismail, Hassan Hassan, Ahmed Youssef, Ismael El- Samahy, Mohamed El-Dery, Mohamed Hassan for being such great friends. I would like to thank my wife, Reem, for her unconditional support and love, without which many things would not be possible. I would like also to thank my mother Nabila and my father Hassan for their support and encouragement throughout my life. v Contents List of Tables x List of Figures xi List of Algorithms xiv 1 Introduction 1 1.1 Preface . 1 1.2 Motivations . 2 1.3 Contributions . 3 1.4 Organization of the Thesis . 3 2 Background and Literature Review 5 2.1 Text Mining . 5 2.1.1 Text Representation Model . 6 2.1.2 Text Similarity Measure . 7 2.2 Text Clustering . 8 2.2.1 Clustering Techniques . 9 2.2.2 The Vector Space Model and Document Clustering . 10 2.2.3 Evaluation of Cluster Quality For Clustering . 11 2.3 Text Categorization . 13 2.3.1 Text Categorization Techniques . 13 vi 2.3.2 Evaluation of Text Categorization . 15 2.4 Information Retrieval . 17 2.4.1 Document Index . 18 2.4.2 Search Engines . 19 2.4.3 Lucene Search Engine . 19 2.4.4 Terrier Information Retrieval Platform . 20 2.4.5 Evaluation of Text Retrieval . 20 2.4.6 Information Retrieval and Natural Language . 21 2.5 Natural Language Processing and Understanding . 21 2.5.1 The Study of Language . 21 2.5.2 Di®erent Levels of Language Analysis . 22 2.5.3 Thematic Roles . 23 2.5.4 The Role Labeling Task . 25 2.5.5 Applications of Natural Language Understanding . 27 2.5.6 Evaluating Language Understanding Systems . 28 2.5.7 Representing the Understanding . 28 2.6 Knowledge Representation . 29 2.6.1 Ontology . 29 2.6.2 Conceptual Graphs . 31 2.6.3 Forms of Knowledge Interchange . 31 2.7 Summary . 32 3 Concept-based Model 33 3.0.1 Overview . 33 3.1 Concept-based Statistical Analyzer . 35 3.1.1 Sentence-Based Concept Analysis . 35 3.1.2 Document-Based Concept Analysis . 36 vii 3.1.3 Corpus-based Concept Analysis . 36 3.2 Illustrative Example . 38 3.2.1 Example: Role Labeler . 39 3.2.2 Example: Stemmer . 41 3.2.3 Example: Concept-based Statistical Analyzer . 42 3.3 Conceptual Ontological Graph (COG) . 43 3.3.1 Relations in the Conceptual Ontological Graph . 44 3.3.2 Conceptual Ontological Graph Construction . 45 3.3.3 Concept-based Weighting Scheme . 48 3.3.4 Example: Conceptual Ontological Graph Representation . 49 3.4 Concept Extractor . 52 3.4.1 Example: Concept Extractor . 54 3.5 A Concept-Based Similarity Measure . 54 3.5.1 Cosine Measure . 56 3.6 Mathematical Framework . 57 3.6.1 Notation . 57 3.6.2 Concept-based Sensitivity (Discrimination Ability) . 58 3.6.3 Concept-based Entropy . 59 3.7 Summary . 61 4 Experimental Results 62 4.1 Overview . 62 4.2 Text Clustering . 64 4.3 Text Categorization . 71 4.4 Text Retrieval . 80 4.5 Discussion . 86 4.5.1 Text Clustering Results . 86 viii 4.5.2 Text Categorization Results . 89 4.5.3 Text Retrieval Results . 91 4.5.4 Observations and Conclusions . 93 4.6 Summary . 94 5 Conclusions and Future Research 95 5.1 Conclusions . 95 5.2 Limitation of the Approach . 96 5.3 Future Work . 96 5.4 List of Publications . 97 A Appendix 101 A.1 Text Clustering Results . 101 A.2 Text Categorization Results . 105 A.3 Text Retrieval Results . 111 Biblography 115.