Automatic Multi-Word Term Extraction and Its Application to Web-Page Summarization
Total Page:16
File Type:pdf, Size:1020Kb
Automatic Multi-word Term Extraction and its Application to Web-page Summarization by Weiwei Huo A Thesis presented to The University of Guelph In partial fulfilment of requirements for the degree of Master of Science in Computer Science Guelph, Ontario, Canada © Weiwei Huo, December, 2012 ABSTRACT Automatic Multi-word Term Extraction and its Application to Web-page Summarization Weiwei Huo Advisor: University of Guelph, 2012 Prof. Fei Song In this thesis we propose three new word association measures for multi-word term extraction. We combine these association measures with LocalMaxs algorithm in our extraction model and compare the results of different multi-word term extraction methods. Our approach is language and domain independent and requires no training data. It can be applied to such tasks as text summarization, information retrieval, and document classification. We further explore the potential of using multi-word terms as an effective representation for general web-page summarization. We extract multi-word terms from human written summaries in a large collection of web-pages, and generate the summaries by aligning document words with these multi-word terms. Our system applies machine translation technology to learn the aligning process from a training set and focuses on selecting high quality multi-word terms from human written summaries to generate suitable results for web-page summarization. iii Acknowledgments I am grateful to Prof. Fei Song for his continued support, encouragement, and insights. And I would like to thank Prof. Mark Wineberg for his precious advice on my writing and thesis work. I would also like to thank my schoolmate Song Lin for his help and encouragement. Special thanks to Prof. Yang Xiang and Prof. Fangju Wang to be on my thesis examination committee and for their helpful feedback. iv Tables of Contents Abstract Acknowledgments Tables of Contents Chapter 1.Introduction ............................................................................................ - 1 - 1.1 Applications .................................................................................... - 2 - 1.2 Motivation ....................................................................................... - 5 - 1.3 Major Contributions ........................................................................ - 7 - 1.4 Overview ......................................................................................... - 8 - 2.Multi-word Term Extraction and Text Summarization ......................... - 9 - 2.1 Multi-word Terms ................................................................................ - 9 - 2.1.1 What is a Multi-word Term? ........................................................... - 9 - 2.1.2 Motivation for Extracting MWTs Automatically ............................ - 10 - 2.2 Automatic Multi-word Term Extraction .............................................. - 11 - 2.2.1 What MWTs to Extract? ................................................................ - 11 - 2.2.2 Challenges .................................................................................. - 12 - 2.3 Related Work on Multi-words Term Extraction ................................... - 14 - 2.3.1 Linguistic Approaches ................................................................. - 15 - 2.3.2 Statistical Approaches ................................................................. - 17 - 2.3.3 Hybrid Approaches ..................................................................... - 22 - 2.3.4 Graph-Based Approaches ........................................................... - 23 - 2.4 Web-Page Summarization .................................................................. - 24 - 2.4.1 Challenges for Web-Page Summarization .................................... - 26 - 2.4.2 Automatic Text Summarization .................................................... - 27 - 2.4.3 Related Work for Web-Page Summarization ................................ - 31 - 3.Multi-word Term Extraction Methods ................................................. - 35 - 3.1 Statistical Association Measures ........................................................ - 35 - 3.2 LocalMaxs Algorithm ........................................................................ - 37 - 3.3 Smoothed Probabilities of N-Grams .................................................. - 42 - 3.4 Normalized Sequence Probabilities .................................................. - 45 - 3.5 Post-Processing Filters ...................................................................... - 45 - 3.6 Data Preparation ................................................................................ - 46 - 3.7 Multi-word Term Extraction System ................................................... - 47 - 4.Application to Web-page Summarization ............................................ - 50 - v 4.1 System Overview .............................................................................. - 51 - 4.2 Content Selection .............................................................................. - 52 - 4.3 Translation and Ordering .................................................................. - 55 - 4.4 Building the Models for Summarization ............................................. - 58 - 4.4 Summary Generation ........................................................................ - 64 - 5.Experimental Results and Discussions .............................................. - 66 - 5.1 Data Sets............................................................................................ - 66 - 5.2 Evaluation Methods ........................................................................... - 69 - 5.2.1 Measures for Multi-word Term Extraction .................................... - 69 - 5.2.2 Measures for Web-page Summarization ...................................... - 71 - 5. 3 Results on Multi-word Term Extraction ............................................. - 76 - 5.3.1 Experiments and Results ............................................................. - 76 - 5.3.2 Discussion ................................................................................... - 80 - 5.4 Results on Web-page Summarization................................................. - 81 - 5.4.1 Experiments and Results ............................................................. - 82 - 5.4.2 Discussion ................................................................................... - 85 - 6.Conclusions and Future Work .............................................................. - 87 - 6.1 Summary of Contributions ................................................................ - 87 - 6.2 Future Work for Multi-word Term Extraction ...................................... - 88 - 6.3 Future Work for Web-page Summarization ........................................ - 89 - References Cited ...................................................................................... - 91 - Appendix .................................................................................................. - 97 - The Selecting proceeding for Multi-word term ....................................... - 97 - vi List of Figures Figure 2. 1: Syntagmatic Relationships between Words .............................. - 17 - Figure 3. 1: Ranks of Tokens According to Frequencies .............................. - 37 - Figure 3. 2: Association Measures Fluctuating under an Ideal Situation ..... - 40 - Figure 3. 3: Association Measures Fluctuating under the Actual Situation .. - 41 - Figure 4. 1: Alignment for an Example Translation ..................................... - 58 - Figure 4. 2: One-to-one Alignments between Words and MWTs ................ - 59 - Figure 5. 1: Sample Web-page from Open Directory Project ...................... - 67 - vii List of Tables Table 5. 1: Performance of Extraction Models before filtering ................... - 78 - Table 5. 2: Performance of Extraction Models after filtering ....................... - 78 - Table 5. 3: Length Distributions of the N-grams Extracted .......................... - 79 - Table 5. 4: Recall Performance of Three Different Systems ......................... - 83 - Table 5. 5: Performance of Web-page Summarization Systems ................... - 85 - Chapter 1 Introduction A multi-word term (MWT), or simply a term for short in this thesis, is an expression consisting of more than one word with a grammatical structure and a specific meaning such as nouns phrases (e.g., swimming pool, Natural Language Processing), fixed collocations or idioms (e.g., blue moon, apple and orange), compound verbs (e.g., take into account), prepositional phrases (e.g., on the contrary), compound determiners (e.g., a piece of), and many others. MWTs can intrinsically identify what a particular piece of writing is about, capturing the primary entities or key concepts for it. For example, given the sentence: “Natural Language Processing (NLP) is a field of computer science concerned with the understanding and generation of human languages .”, the MWT “Natural Language Processing” corresponds to the major entity and other MWTs such as “computer science” and “human languages” are used as related concepts. Therefore, recognizing MWTs is both advantageous and important to text representation and understanding. Many tasks in Natural Language Processing require techniques to compute MWTs such as Information Retrieval (IR) (Witschel, 2005), Text Summarization (Dunning, 1993; Silva et al . 1999), Document Classification (Monta et al . 2005; Hovy