Identifying Similarity in Text: Multi-Lingual Analysis for Summarization

Identifying Similarity in Text: Multi-Lingual Analysis for Summarization David Kirk Evans Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2005 c 2005 David Kirk Evans All Rights Reserved ABSTRACT Identifying Similarity in Text: Multi-Lingual Analysis for Summarization David Kirk Evans Early work in the computational treatment of natural language focused on summarization, and machine translation. In my research I have concentrated on the area of summarization of documents in different languages. This thesis presents my work on multi-lingual text similarity. This work enables the identification of short units of text (usually sentences) that contain similar information even though they are written in different languages. I present my work on SimFinderML, a framework for multi-lingual text similarity computation that makes it easy to experiment with parameters for similarity computation and add support for other languages. An in-depth examination and evaluation of the system is performed using Arabic and English data. I also apply the concept of multi-lingual text similarity to summarization in two different systems. The first improves readability of English sum- maries of Arabic text by replacing machine translated Arabic sentences with highly similar English sentences when possible. The second is a novel summarization system that supports comparative analysis of Arabic and English documents in two ways. First, given Arabic and English documents that describe the same event, SimFinderML clusters sentences to present information that is supported by both the Arabic and English documents. Second, the system provides an analysis of how the Arabic and English documents differ by pre- senting information that is supported exclusively by documents in only one language. This novel form of summarization is a first step at analyzing the difference in perspectives from news reported in different languages. Contents Acknowledgments . xi 1 Introduction 1 1.1 Goals . 3 1.2 Approaches to text similarity . 6 1.3 Similarity-based approaches to Multi-Document Summarization . 6 1.3.1 Highlighting Similarities and Differences between Foreign and Source Language Data . 7 1.4 Contributions . 7 2 Similarity in English Texts: Simfinder 10 2.1 Related work in English text similarity . 11 2.1.1 Information Retrieval . 11 2.1.2 Clustering Techniques . 12 2.1.2.1 Similarity measures - using term overlap . 14 2.1.2.2 Clustering methods . 15 2.2 English Simfinder . 17 2.2.1 Similarity measure - Combining Linguistics and Machine Learning . 17 2.2.1.1 Identifying and Relating Noun Phrases: LinkIT . 19 2.2.1.2 Other features . 21 2.2.1.3 Learning Method and Results . 22 2.2.2 Clustering Algorithm Tailored for Summarization . 24 2.3 A Flexible Framework for Simfinder . 26 i 3 Similarity in Multi-Lingual Texts: SimFinderML 27 3.1 Motivation . 27 3.2 Related work in Multi-lingual text similarity . 29 3.2.1 Example based machine translation . 30 3.2.2 Cross-Lingual Information Retrieval . 31 3.2.3 Statistical machine translation . 33 3.2.4 Sentence alignment cost functions . 34 3.2.5 Bilingual Phrase Translation . 34 3.2.6 Proper noun phrase transliteration . 35 3.3 SimFinderML Architecture . 37 3.3.1 Pre-processing . 39 3.3.2 Primitive Extraction . 40 3.3.3 Primitive Linking . 42 3.3.4 Similarity Computation . 45 3.3.5 Merging Feature Similarity Values . 47 3.3.5.1 Challenges for Multi-Lingual Feature Merging . 49 3.3.6 Clustering . 50 3.4 Creating an Arabic{English version of SimFinderML . 50 3.4.1 Arabic-language features . 51 3.4.2 Arabic to English Translation Facilities . 52 3.4.2.1 Word feature matching . 52 3.4.2.2 Using a probabilistic dictionary . 52 3.4.2.3 Named entity feature matching . 53 3.4.3 Learning a probabilistic Arabic{English dictionary . 53 3.4.4 Feature Merging Model Training Data . 54 3.4.5 Training Results . 57 3.5 Porting to other languages . 57 3.5.1 Extracting article text from web pages . 60 ii 3.5.2 Using simple document translation for multilingual clustering . 63 3.5.3 Multilingual Clustering Evaluation . 63 3.5.4 Japanese Performance . 64 3.6 SimFinderML Conclusion . 65 4 Finding Similar Arabic-English Sentences 67 4.1 Sentence level evaluation . 68 4.1.1 Finding Similar English Sentences . 69 4.1.1.1 Chunking Machine Translated Arabic Text . 72 4.1.2 Sentence level evaluation results . 74 4.1.2.1 Full Arabic Sentences . 75 4.1.2.2 Chunked Arabic Sentences . 77 4.1.3 Error analysis . 78 4.1.3.1 Full Arabic sentences with chunked sentence similarity . 83 4.2 Clustering evaluation . 85 4.2.1 Aligning the DUC 2004 Corpus . 88 4.2.2 Using the Aligned Corpus for Evaluation . 88 4.2.3 Evaluation Overview . 90 4.2.4 Full machine translated and English Simfinder . 91 4.2.5 SimFinderML Token feature . 93 4.2.5.1 Arabic{English with word-level translation feature . 94 4.2.5.2 Results using token feature . 95 4.2.6 Token and Named Entity features with SimFinderML . 98 4.2.6.1 The Named Entity Feature in Arabic{English SimFinderML 98 4.2.6.2 Results using Named Entity Feature . 99 4.3 Conclusions . 101 5 Similarity-based Summarization 105 5.1 Related work in multi-lingual, multi-document summarization . 106 iii 5.2 DUC 2004 Arabic Corpus . 107 5.3 Summarizing Machine Translated text with Relevant English Text . 107 5.3.1 Summarization Approach . 109 5.3.1.1 Sentence Simplification . 109 5.3.1.2 Similarity Computation . 111 5.3.1.3 System Implementation . 111 5.3.2 Evaluation . 112 5.3.2.1 Summary level evaluation . 113 5.3.2.2 Summary level evaluation results . 113 5.4 Summarization that indicates similarities and differences in content . 116 5.4.1 System Architecture . 117 5.4.1.1 Sentence Simplification to Improve Clustering . 118 5.4.1.2 Text Similarity Computation . 119 5.4.1.3 Sentence clustering and pruning . 119 5.4.1.4 Identifying cluster languages . 121 5.4.1.5 Ranking clusters . 122 5.4.1.6 Sentence selection . 123 5.4.1.7 Summary generation . 124 5.4.2 Evaluation . 125 5.4.2.1 SCU Annotation . 125 5.4.2.2 Characterizing Arabic and English content by SCUs . 126 5.4.2.3 Evaluating language partitions with SCUs . 127 5.4.2.4 Importance evaluation . 130 5.4.3 Results . 131 5.4.3.1 Per-language Partition Evaluation . 131 5.4.3.2 Evaluating importance . 138 5.4.3.3 Example output . 140 5.4.4 Conclusions . 143 iv 6 Conclusions 145 6.1 Contributions . 145 6.1.1 Linguistically motivated primitives . 146 6.1.2 A flexible framework for experimenting with multi-lingual text similarity146 6.1.3 Multi-lingual text similarity for resource-poor languages . 148 6.1.4 CAPS: Summarization that identifies similarities and differences across languages . 148 6.2 Limitations . 149 6.2.1 Experimentation with more Arabic primitives . 149 6.2.2 Better translation for named entities . 150 6.2.3 Per-language feature sets and merging models . 150 6.2.3.1 Combining Arabic and English training data . 151 6.3 Future Work . 153 6.3.1 Further integration of statistical machine translation methods . 153 6.3.2 Noun Phrase Variant Identification . 153 6.3.2.1 Related Work on Noun Phrase Variation . 154 6.3.3 Sense disambiguation . 154 A Detailed SCU Annotation Instructions 156 Bibliography 159 v List of Figures 2.1 Comparison of IR to Multiple Document Similarity . 13 2.2 Two similar paragraphs; the primitive features indicating similarity that are captured by Simfinder are highlighted in bold. 18 2.3 A composite feature over word primitives, with the restriction that one primitive must be a noun and one must be a verb. 18 2.4 A pair of paragraphs that contain a composite match; a word match and a WordNet match (highlighted in bold) occur within a window of five words, excluding stopwords. 19 3.1 A CLIR query matching to one document from a collection of eight documents. 31 3.2 SimFinderML Architecture. 38 3.3 The primitive translation process . 43 3.4 The primitive matching process using translations from a probabilistic learned dictionary . 46 3.5 Examples of similar Arabic{English sentences found by SimFinderML. Ma- chine translations of the Arabic sentences are provided for the reader, but only the Arabic and English is used for the actual matching. 58 4.1 Judgment scale used for sentence-level evaluation one and two. 70 4.2 Percentage of good sentence replacements using copy-split chunking technique and simple-split chunking technique.

Identifying Similarity in Text: Multi-Lingual Analysis for Summarization

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support