Phrase-Based Compressive Cross-Language Summarization

Phrase-based Compressive Cross-Language Summarization Jin-ge Yao Xiaojun Wan Jianguo Xiao Institute of Computer Science and Technology, Peking University, Beijing 100871, China Key Laboratory of Computational Linguistic (Peking University), MOE, China yaojinge, wanxiaojun, xiaojianguo @pku.edu.cn { } Abstract of our knowledge, no previous work of this task tries to focus on summarization beyond pure sen- The task of cross-language document tence extraction. summarization is to create a summary in a On the other hand, cross-language summariza- target language from documents in a dif- tion can be seen as a special kind of machine trans- ferent source language. Previous meth- lation: translating the original documents into a ods only involve direct extraction of au- brief summary in a different language. Inspired by tomatically translated sentences from the phrase-based machine translation models (Koehn original documents. Inspired by phrase- et al., 2003), we propose a phrase-based scoring based machine translation, we propose scheme for cross-language summarization in this a phrase-based model to simultaneously work. perform sentence scoring, extraction and Since our framework is based on phrases, we compression. We design a greedy algo- are not limited to produce extractive summaries. rithm to approximately optimize the score We can use the scoring scheme to perform joint function. Experimental results show that sentence selection and compression. Unlike typi- our methods outperform the state-of-the- cal sentence compression methods, our proposed art extractive systems while maintaining algorithm does not require additional syntactic similar grammatical quality. preprocessing such as part-of-speech tagging or syntactic parsing. We only utilize information 1 Introduction from translated texts with phrase alignments. The The task of cross-language summarization is to scoring function consists of a submodular term of produce a summary in a target language from compressed sentences and a bounded distortion documents written in a different source language. penalty term. We design a greedy procedure to This task is particularly useful for readers to efficiently get approximate solutions. quickly get the main idea of documents written in For experimental evaluation, we use the a source language that they are not familiar with. DUC2001 dataset with manually translated refer- Following Wan (2011), we focus on English-to- ence Chinese summaries. Results based on the Chinese summarization in this work. ROUGE metrics show the effectiveness of our pro- The simplest and the most straightforward posed methods. We also conduct manual evalua- way to perform cross-language summarization is tion and the results suggest that the linguistic qual- pipelining general summarization and machine ity of produced summaries is not decreased by too translation. Such systems either translate all the much, compared with extractive counterparts. In documents before running generic summarization some cases, the grammatical smoothness can even algorithms on the translated documents, or sum- be improved by compression. marize from the original documents and then only The contributions of this paper include: translate the produced summary into the target lan- Utilizing the phrase alignment information, • guage. Wan (2011) show that such pipelining ap- we design a scoring scheme for the cross- proaches are inferior to methods that utilize in- language document summarization task. formation from both sides. In that work, the au- thor proposes graph-based models and achieves We design an efficient greedy algorithm to • fair amount of improvement. However, to the best generate summaries. The greedy algorithm is 118 Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 118–127, Lisbon, Portugal, 17-21 September 2015. c 2015 Association for Computational Linguistics. partially submodular and has a provable con- the source sentence to the target sentence e(y) is: stant approximation factor to the optimal so- L lution up to a small constant. f(y) = g(pk) + LM(e(y)) Xk=1 We achieve state-of-the-art results using the L 1 • − extractive counterpart of our compressive + η start(p ) 1 end(p ) | k+1 − − k | summarization framework. Performance in k=1 terms of ROUGE metrics can be significantly X where LM( ) is the target-side language model improved when simultaneously performing · score, g( ) is the score function of phrases, η < 0 extraction and compression. · is the distortion parameter for penalizing the dis- 2 Background tance between neighboring phrases in the derivation. Note that the phrases addressed here are Document summarization can be treated as a spe- typically continuous n-grams and need not to be cial kind of translation process: translating from a grammatical linguistic phrasal units. Later we will bunch of related source documents to a short tar- directly use phrases provided by modern machine get summary. This analogy also holds for cross- translation systems. language document summarization, with the only Searching for the best translation under this difference that the languages of source documents score definition is difficult in general. Thus and the target summary are different. approximate decoding algorithms such as beam Our design of sentence scoring function for search should be applied. Meanwhile, several con- cross-language document summarization purpose straints should be satisfied during the decoding is inspired by phrase-based machine translation process. The most important one is to set a constant limit of the distortion term start(p ) models. Here we briefly describe the general idea | k+1 − 1 end(p ) δ to exhibit derivations with dis- of phrase-based translation. One may refer to − k | ≤ Koehn (2009) for more detailed description. tant phrase translations. 3 Phrase-based Cross-Language 2.1 Phrase-based Machine Translation Summarization Phrase-based machine translation models are cur- Inspired by the general idea of phrase-based rently giving state-of-the-art translations for many machine translation, we describe our proposed pairs of languages and dominating modern statis- phrase-based model for cross-language summa- tical machine translation. Classical word-based rization in this section. IBM models cannot capture local contextual information and local reordering very well. Phrase- 3.1 Phrase-based Sentence Scoring based translation models operate on lexical entries In the context of cross-language summarization, with more than one word on the source language here we assume that we can also have phrases and the target language. The allowance of multi- in both source and target languages along with word expressions is believed to be the main rea- phrase alignments between the two sides. For son for the improvements that phrase-based mod- summarization purposes, we may wish to se- els give. Note that these multi-word expressions, lect sentences containing more important phrases. typically addressed as phrases in machine transla- Then it is plausible to measure the scores of these tion literature, are essentially continuous n-grams aligned phrases via importance weighing. and do not need to be linguistically integrate and Inspired by phrase-based translation models, we meaningful constituents. can assign phrase-based scores to sentences from Define y as a phrase-based derivation, or the translated documents for summarization pur- more precisely a finite sequence of phrases poses. We define our scoring function for each p1, p2, . , pL. For any derivation y we use e(y) sentence s as: to refer to the target-side translation text defined by y. This translation is derived by concatenat- F (s) = d0g(p) + bg(s) ing the strings e(p ), e(p ), . , e(p ). The scor- p s 1 2 L X∈ ing scheme for a phrase-based derivation y from +η dist(y(s)) 119 Here in the first term g( ) is the score of phrase p, Algorithm 1 A greedy algorithm for phrase-based · which can be simply set to document frequency. summarization The phrase score is penalized with a constant 1: S ← ∅ damping factor d0 to decay scores for repeated 2: i 1 ← phrases. The second term bg(s) is the bigram 3: single best = argmaxs U,C( s ) B F ( s ) ∈ { } ≤ { } score of sentence s. It is used here to simu- 4: while U = do 6 ∅ F (Si 1 s ) F (Si 1) late the effect of language models in phrase-based 5: si = argmaxs U − ∪{C( }s −)r − ∈ { } translation models. Denoting y(s) as the phrase- 6: if C(Si 1 s ) B then − ∪ { } ≤ based derivation (as mentioned earlier in the previ- 7: Si Si 1 s ← − ∪ { } ous section) of sentence s, the last distortion term 8: i i + 1 dist(y(s)) = L start(p ) 1 end(p ) ← k=1 | k+1 − − k | 9: end if is exactly the same as the distortion penalty term 10: U U s P ← \{ i} in phrase-based translation models. This term can 11: end while be used as a reflection of complexity of the trans- 12: return S∗ = argmaxS single best,Si F (S) lation. All the above terms can be derived from ∈{ } bilingual sentence pairs with phrase alignments. Meanwhile, we may also wish to exclude unim- The space U denotes the set of all possible com- portant phrases and badly translated phrases. Our pressed sentences. In each iteration, the algorithm definition can also be used to guide sentence com- tries to find the compressed sentence with maxi- pression by trying to remove redundant phrase. mum gain-cost ratio (Line 5, where we will fol- Based on the definition over sentences, we de- low previous work to set r = 1), and merge it to fine our summary scoring measure over a sum- the summary set at the current iteration (denoted mary S: as Si). The target is to find the compression with maximum gain-cost ratio. This will be discussed count(p,S) i 1 in the next section. Note that the algorithm is also F (S) = d − g(p) + bg(s) naturally applicable to extractive summarization.

Load more