Phrase-based Compressive Cross- Summarization

Jin-ge Yao Xiaojun Wan Jianguo Xiao Institute of Computer Science and Technology, Peking University, Beijing 100871, China Key Laboratory of Computational Linguistic (Peking University), MOE, China yaojinge, wanxiaojun, xiaojianguo @pku.edu.cn { }

Abstract of our knowledge, no previous work of this task tries to on summarization beyond pure sen- The task of cross-language document tence extraction. summarization is to create a summary in a On the other hand, cross-language summariza- target language from documents in a dif- tion can be seen as a special kind of machine trans- ferent source language. Previous meth- lation: translating the original documents into a ods only involve direct extraction of au- brief summary in a different language. Inspired by tomatically translated sentences from the phrase-based machine translation models (Koehn original documents. Inspired by phrase- et al., 2003), we propose a phrase-based scoring based machine translation, we propose scheme for cross-language summarization in this a phrase-based model to simultaneously work. perform scoring, extraction and Since our framework is based on phrases, we compression. We design a greedy algo- are not limited to produce extractive summaries. rithm to approximately optimize the score We can use the scoring scheme to perform joint function. Experimental results show that sentence selection and compression. Unlike typi- our methods outperform the state-of-the- cal sentence compression methods, our proposed art extractive systems while maintaining algorithm does not require additional syntactic similar grammatical quality. preprocessing such as part-of-speech tagging or syntactic parsing. We only utilize information 1 Introduction from translated texts with phrase alignments. The The task of cross-language summarization is to scoring function consists of a submodular term of produce a summary in a target language from compressed sentences and a bounded distortion documents written in a different source language. penalty term. We design a greedy procedure to This task is particularly useful for readers to efficiently get approximate solutions. quickly get the main idea of documents written in For experimental evaluation, we use the a source language that they are not familiar with. DUC2001 dataset with manually translated refer- Following Wan (2011), we focus on English-to- ence Chinese summaries. Results based on the Chinese summarization in this work. ROUGE metrics show the effectiveness of our pro- The simplest and the most straightforward posed methods. We also conduct manual evalua- way to perform cross-language summarization is tion and the results suggest that the linguistic qual- pipelining general summarization and machine ity of produced summaries is not decreased by too translation. Such systems either translate all the much, compared with extractive counterparts. In documents before running generic summarization some cases, the grammatical smoothness can even algorithms on the translated documents, or sum- be improved by compression. marize from the original documents and then only The contributions of this paper include: translate the produced summary into the target lan- Utilizing the phrase alignment information, • guage. Wan (2011) show that such pipelining ap- we design a scoring scheme for the cross- proaches are inferior to methods that utilize in- language document summarization task. formation from both sides. In that work, the au- thor proposes graph-based models and achieves We design an efficient greedy algorithm to • fair amount of improvement. However, to the best generate summaries. The greedy algorithm is

118 Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 118–127, Lisbon, Portugal, 17-21 September 2015. c 2015 Association for Computational Linguistics.

partially submodular and has a provable con- the source sentence to the target sentence e(y) is: stant approximation factor to the optimal so- L lution up to a small constant. f(y) = g(pk) + LM(e(y)) Xk=1 We achieve state-of-the-art results using the L 1 • − extractive counterpart of our compressive + η start(p ) 1 end(p ) | k+1 − − k | summarization framework. Performance in k=1 terms of ROUGE metrics can be significantly X where LM( ) is the target-side language model improved when simultaneously performing · score, g( ) is the score function of phrases, η < 0 extraction and compression. · is the distortion parameter for penalizing the dis- 2 Background tance between neighboring phrases in the deriva- tion. Note that the phrases addressed here are Document summarization can be treated as a spe- typically continuous n-grams and need not to be cial kind of translation process: translating from a grammatical linguistic phrasal units. Later we will bunch of related source documents to a short tar- directly use phrases provided by modern machine get summary. This analogy also holds for cross- translation systems. language document summarization, with the only Searching for the best translation under this difference that the of source documents score definition is difficult in general. Thus and the target summary are different. approximate decoding algorithms such as beam Our design of sentence scoring function for search should be applied. Meanwhile, several con- cross-language document summarization purpose straints should be satisfied during the decoding is inspired by phrase-based machine translation process. The most important one is to set a con- stant limit of the distortion term start(p ) models. Here we briefly describe the general idea | k+1 − 1 end(p ) δ to exhibit derivations with dis- of phrase-based translation. One may refer to − k | ≤ Koehn (2009) for more detailed description. tant phrase translations. 3 Phrase-based Cross-Language 2.1 Phrase-based Machine Translation Summarization Phrase-based machine translation models are cur- Inspired by the general idea of phrase-based rently giving state-of-the-art translations for many machine translation, we describe our proposed pairs of languages and dominating modern statis- phrase-based model for cross-language summa- tical machine translation. Classical -based rization in this section. IBM models cannot capture local contextual in- formation and local reordering very well. Phrase- 3.1 Phrase-based Sentence Scoring based translation models operate on lexical entries In the context of cross-language summarization, with more than one word on the source language here we assume that we can also have phrases and the target language. The allowance of multi- in both source and target languages along with word expressions is believed to be the main rea- phrase alignments between the two sides. For son for the improvements that phrase-based mod- summarization purposes, we may wish to se- els give. Note that these multi-word expressions, lect sentences containing more important phrases. typically addressed as phrases in machine transla- Then it is plausible to measure the scores of these tion literature, are essentially continuous n-grams aligned phrases via importance weighing. and do not need to be linguistically integrate and Inspired by phrase-based translation models, we meaningful constituents. can assign phrase-based scores to sentences from Define y as a phrase-based derivation, or the translated documents for summarization pur- more precisely a finite sequence of phrases poses. We define our scoring function for each p1, p2, . . . , pL. For any derivation y we use e(y) sentence s as: to refer to the target-side translation text defined by y. This translation is derived by concatenat- F (s) = d0g(p) + bg(s) ing the strings e(p ), e(p ), . . . , e(p ). The scor- p s 1 2 L X∈ ing scheme for a phrase-based derivation y from +η dist(y(s))

119 Here in the first term g( ) is the score of phrase p, Algorithm 1 A greedy algorithm for phrase-based · which can be simply set to document frequency. summarization The phrase score is penalized with a constant 1: S ← ∅ damping factor d0 to decay scores for repeated 2: i 1 ← phrases. The second term bg(s) is the bigram 3: single best = argmaxs U,C( s ) B F ( s ) ∈ { } ≤ { } score of sentence s. It is used here to simu- 4: while U = do 6 ∅ F (Si 1 s ) F (Si 1) late the effect of language models in phrase-based 5: si = argmaxs U − ∪{C( }s −)r − ∈ { } translation models. Denoting y(s) as the phrase- 6: if C(Si 1 s ) B then − ∪ { } ≤ based derivation (as mentioned earlier in the previ- 7: Si Si 1 s ← − ∪ { } ous section) of sentence s, the last distortion term 8: i i + 1 dist(y(s)) = L start(p ) 1 end(p ) ← k=1 | k+1 − − k | 9: end if is exactly the same as the distortion penalty term 10: U U s P ← \{ i} in phrase-based translation models. This term can 11: end while be used as a reflection of complexity of the trans- 12: return S∗ = argmaxS single best,Si F (S) lation. All the above terms can be derived from ∈{ } bilingual sentence pairs with phrase alignments. Meanwhile, we may also wish to exclude unim- The space U denotes the set of all possible com- portant phrases and badly translated phrases. Our pressed sentences. In each iteration, the algorithm definition can also be used to guide sentence com- tries to find the compressed sentence with maxi- pression by trying to remove redundant phrase. mum gain-cost ratio (Line 5, where we will fol- Based on the definition over sentences, we de- low previous work to set r = 1), and merge it to fine our summary scoring measure over a sum- the summary set at the current iteration (denoted mary S: as Si). The target is to find the compression with maximum gain-cost ratio. This will be discussed count(p,S) i 1 in the next section. Note that the algorithm is also F (S) = d − g(p) + bg(s) naturally applicable to extractive summarization. p S i=1 s S X∈ X X∈ For extractive summarization, Line 5 corresponds +η dist(y(s)) to direct calculations of sentence scores based on s S our proposed phrase-based function and U will de- X∈ note all full sentences from the original translated where d is a predefined constant damping factor to documents. penalize repeated occurrences of the same phrases, The outline of this algorithm is very similar to count(p, S) is the number of occurrences in the the greedy algorithm used by Morita et al. (2013) summary S for phrase p. All other terms are in- for subtree extraction, except that in our context herited from the sentence score definition. the increase of cost function when adding a sen- In the next section we describe our framework tence is exactly the cost of that sentence. to efficiently utilize this scoring function for cross- When the distortion term is ignored (η = 0), the language summarization. scoring function is clearly submodular 1 (Lin and 3.2 A Greedy Algorithm for Compressed Bilmes, 2010) in terms of the set of compressed Sentence Selection sentences, since the score now only consists of functional gains of phrases along with bigrams Utilizing the phrase-based score definition of sen- of a compressed sentence. Morita et al. (2013) tences, we can use greedy algorithms to simulta- have proved that when r = 1, this greedy algo- neously perform sentence selection and sentence rithm will achieve a constant approximation factor compression. Assuming that we have a predefined 1 (1 e 1) to the optimal solution. Note that this 2 − − budget B (e.g. total number of Chinese charac- only gives us the worst case guarantee. What we ters allowed) to restrict the total length of a gen- can achieve in practice is usually far better. erated summary. We use C(S) to denote the cost On the other hand, setting η < 0 will not affect of a summary S, measured by the number of Chi- 1 U nese characters contained in total. The greedy al- A set function F : 2 R defined over subsets of a universe set U is said to be→submodular iff it satisfies the gorithm we will use for our compressive summa- diminishing returns property: S T U u, we have rization is listed in Algorithm 1. F (S u ) F (S) F (T ∀ u⊆) F⊆(T ).\ ∪ { } − ≥ ∪ { } −

120 the performance guarantee too much. Intuitively Our compression process for each sentence s is this is because in most phrase-based translation displayed in Algorithm 2. It gradually expands the models a distortion limit constraint start(p ) set of phrases to be kept in the final compression, | k+1 − 1 end(p ) δ will be applied on distor- from the initial set of large density phrases (Line 4, − k | ≤ tion terms, while performing sentence compres- assuming that phrases with large scores and small sion can never increase distortion. The main con- costs will always be kept), we can recover the clusion is formulated as: compression with maximum density. The function greedy dist( , ) is the unit distortion penalty defined as Theorem 1. If Algorithm 1 outputs S while · · dist(a, b) = start(b) 1 end(a) . We define the optimal solution is OPT , we have | − − | p.score to be the sum of damped phrase score for count(p,Si 1) i 1 greedy 1 1 1 phrase p, i.e. p.score = − d g(p), F (S ) (1 e− )F (OPT ) + ηγ. i=1 − ≥ 2 − 2 when the current partial summary is Si 1. There- P − Here γ > 0 is a constant controlled by distor- fore during each iteration of the greedy selection tion difference between sentences, which is rel- process, the compression procedure will also be atively small in practice compared with phrase affected by sentences that have already been in- scores. η < 0 is the distortion parameter. Note cluded. Define p.cost as the number of p that when η is set to be 0, the scoring function is contains. 1 1 submodular and then we recover the (1 e− ) 2 − Algorithm 2 A growing algorithm for finding the approximation factor as studied by Morita et al. maximum density compressed sentence (2013). We leave the proof of Theorem 1 to sup- 1: function GET MAX DENSITY COMPRESSION(s, Si 1) plementary materials due to space limit. The sub- 2: queue Q , kept − modularity term in the score plays an important 3: for each phrase← ∅ p in←s.phrases ∅ do role in the proof. 4: if p.score/p.cost > 1 then 5: kept kept p 6: Q.enqueue(← p∪{) } 3.3 Finding the Maximum Density 7: end if Compression 8: end for 9: while Q = do In Algorithm 1, the most important part is the 10: p 6Q.deque()∅ greedy selection process (Line 5). The greedy se- 11: ppv← p.previous phrase, pnx p.next phrase ppv.score← +bg(ppv,p)+ηdist(ppv,p←) lection criteria here is to maximize the gain-cost 12: if ppv.cost+p.cost > 1 then ratio. For compressive summarization, we are try- 13: Q.enqueue(ppv), kept kept ppv 14: end if ← ∪{ } ing to compress each unselected sentence s to s˜, pnx.score+bg(pnx,p)+ηdist(p,pnx) 15: if p.cost+pnx.cost > 1 then aiming at maximizing the gain-cost ratio, where 16: Q.enqueue(pnx), kept kept pnx ← ∪{ } the gain corresponds to 17: end if 18: end while F (Si 1 s˜ ) F (Si 1) − ∪{ } − − 19: return s˜ = kept, ratio = s.˜ cost F (Si 1 s ) F (Si 1) 20: − ∪ { } − − end function count(p,S) = di 1g(p) + bg(s) + ηdist(s), − Empirically we find this procedure gives al- p s i=1 X∈ X most the same results with exhaustive search while and then add the compressed sentence s˜ with max- maintaining efficiency. Assuming that sentence imum gain-cost ratio to the summary. We will also length is no more than L, then the asymptotic com- address the compression process for each sentence plexity of Algorithm 2 will be O(L) since the al- as finding the maximum density compression. The gorithm requires two passes of all phrases. There- whole framework forms a joint selection and com- fore the whole framework requires O(kNL) time pression process. for a document cluster containing N sentences in In our phrase-based scoring for sentences, al- total to generate a summary with k sentences. though there exist no apparent optimal substruc- In the final compressed sentence we just leave ture available for exact dynamic programming the selected phrases continuously as they are, rely- due to nonlocal distortion penalty, we can have a ing on bigram scores to ensure local smoothness. tractable approximate procedure since the search The task is after all a summarization task, where space is only defined by local decisions on bigram scores play a role of not only controlling whether a phrase should be kept or dropped. grammaticality but keeping main information of

121 the original documents. glish sentence ranking in the original doc- Later we will see that this compression process uments. The scoring function is designed will not hurt grammatical fluency of translated to be document frequencies of English bi- sentences in general. In many cases it may even grams, which is similar to the second term improve fluency by deleting redundant parenthe- in our proposed sentence scoring function in ses or removing incorrectly reordered (unimpor- Section 3.1 and is submodular. 3 The ex- tant) phrases. tracted English summary is finally automati- cally translated into the corresponding Chi- 4 Experiments nese summary. This is also known as the 4.1 Data summary translation scheme. Currently there are not so many available datasets Baseline (CN): This baseline relies on • for our particular setting of the cross-language merely the Chinese-side information for Chi- summarization task. Hence we only evaluate our nese sentence ranking. The scoring function method on the same dataset used by Wan (2011). is similarly defined by document frequency The dataset is created by manually translating the of Chinese bigrams. The Chinese summary reference summaries into Chinese from the origi- sentences are then directly extracted from the nal DUC 2001 dataset in English. We will refer to translated Chinese documents. This is also this dataset as the DUC 2001 dataset in this paper. known as the document translation scheme. There are 30 English document sets in the DUC CoRank: We reimplement the graph-based 2001 dataset for multi-document summarization. • CoRank algorithm, which gives the state-of- Each set contains several documents related to the the-art performance on the same DUC 2001 same topic. Three generic reference English sum- dataset for comparison. maries are provided by NIST annotators for each document set. All these English summaries have Baseline (ENcomp): This is a compressive • been translated to Chinese by native Chinese an- baseline where the extracted English sen- notators. tences in Baseline (EN) will be compressed All the English sentences in the original docu- before being translated to Chinese. The com- ments have been automatically translated into Chi- pression process follows from an integer lin- nese using Google Translate. We also collect the ear program as described by Clarke and La- phrase alignment information from the responses pata (2008). This baseline gives strong per- of Google Translate (stored in JSON format) along formance as we have found on English DUC with the translated texts. We use the Stanford Chi- 2001 dataset as well as other monolingual nese Word Segmenter 2 for Chinese word segmen- datasets. tation. We experiment with two kinds of summary bud- The parameters in the algorithms are simply set gets for comparative study. The first one is limit- to be r = 1, d = 0.5, η = 0.5. − ing the summary length to be no more than five 4.2 Evaluation sentences. The second one is limiting the total number of Chinese characters of each produced We will report the performance of our compres- summary to be no more than 300. They will be sive solution, denoted as PBCS (for Phrase-Based addressed as Sentence Budgeting and Character Compressive Summarization), with comparisons Budgeting in the experimental results respectively. of the following systems: Similar to traditional summarization tasks, we PBES: The acronym comes from Phrase- use the ROUGE metrics for automatic evalua- • Based Extractive Summarization. It is the tion of all systems in comparison. The ROUGE extractive counterpart of our solution without metrics measure summary quality by counting calling Algorithm 2. overlapping word units (e.g. n-grams) between the candidate summary and the reference sum- Baseline (EN): This baseline relies on • mary. Following previous work in the same merely the English-side information for En- 3In our experiments this method gives similar perfor- 2http://nlp.stanford.edu/software/ mance compared with graph-based pipelining baselines im- segmenter.shtml plemented in previous work.

122 task, we report the following ROUGE F-measure Also our system overperforms the compressive scores: ROUGE-1 (unigrams), ROUGE-2 (bi- pipelining system (Baseline(ENcomp)) as well. grams), ROUGE-W (weighted longest common Note that the latter only considers information subsequence; weight=1.2), ROUGE-L (longest from the source language side. Meanwhile sen- common subsequences), and ROUGE-SU4 (skip tence compression may sometimes causes worse bigrams with a maximum distance of 4). Here translations compared with translating the full we investigate two kinds of ROUGE metrics for original sentence. Chinese: ROUGE metrics based on words (after For manual evaluation, the average score and Chinese word segmentation) and ROUGE metrics standard deviation for each metric is displayed based on singleton Chinese characters. The latter in Table 3. From the comparison between com- metrics will not suffer from the problem of word pressive summarization and the extractive ver- segmentation inconsistency. sion, there exist slight improvements of non- To compare our method with extractive base- redundancy. This exactly matches what we can lines in terms of information loss and grammati- expect from sentence compression that keeps only cal quality, we also ask three native Chinese stu- important part and drop redundancy. We also ob- dents as annotators to carry out manual evalua- serve certain amount of improvements on refer- tion. The aspects considered during evaluation ential clarity. This may be a result of deletions include Grammaticality (GR), Non-Redundancy of some phrases containing , such as he (NR), Referential Clarity (RC), Topical Focus said. Most of such phrases are semantically unim- (TF) and Structural Coherence (SC). Each aspect portant and will be dropped during the process of is rated with scores from 1 (poor) to 5 (good) 4. finding the maximum density compression. This evaluation is performed on the same random Despite not directly using syntactic informa- sample of 10 document sets from the DUC 2001 tion, our compressive summaries do not suffer too dataset. One group of the gold-standard sum- much loss of grammaticality. This suggest that maries is left out for evaluation of human-level bigrams can be treated as good indicators of lo- performance. The other two groups are shown to cal grammatical smoothness. We reckon that sen- the annotators, giving them a sense of topics talked tences describing the same events may partially about in the document sets. share descriptive bigram patterns, thus sentences selected by the algorithm will consist of mostly 4.3 Results and Discussion important patterns that appear repeatedly in the Table 1 and Table 2 display the ROUGE results for original document cluster. Only those words that our proposed methods and the baseline methods, are neither semantically important nor syntacti- including both word-based and character-based cally pivotal will be deleted. evaluation. We also conduct pairwise t-test and Figure 1 lists the summaries for the first docu- find that almost all the differences between PBCS ment set D04 in the DUC 2001 dataset produced and other systems are statistically significant with by the proposed compressive system. The Chi- p 0.01 5 except for the ROUGE-W metric. nese side sentences have been split with spaces ac-  We have the same observations with previous cording to phrase alignment results. Phrases that work on the inferiority of using information from have been compressed are grayed out. We also only one-side, while using Chinese-side infor- include original English sentences for reference, mation only is more beneficial than English-side with deletions according to word alignments from only. The CoRank algorithm utilizes both sides the Chinese sentences. We can observe that our of information together and achieves significantly compressive system tries to compress sentences by better performance over Baseline(EN) and Base- removing relatively unimportant phrases. The ef- line(CN). Our compressive system outperforms fect of translation errors (e.g. the word watch in on the CoRank algorithm 6 in all metrics. storm watch has been incorrectly translated in the example) can also be reduced since those incor- 4 Fractional numbers are allowed for cases where the an- rectly translated words will be dropped for having notators feel uncertain about. 5The significance level holds after Bonferroni adjustment, low information gains. In some cases the gram- for the purpose of multiple testing. 6There exists ignorable difference between the results of Wan (2011). We believe that this comes from different ma- our reimplemented version of CoRank and those reported by chine translation results output by Google Translate.

123 Sentence Budgeting ROUGE-1 ROUGE-2 ROUGE-W ROUGE-L ROUGE-SU4 Baseline(EN) 0.23655 0.03550 0.05324 0.12559 0.06410 Baseline(CN) 0.23454 0.03858 0.05753 0.13120 0.06962 PBES 0.25313 0.04073 0.06103 0.13583 0.06970 CoRank (reported) N/A 0.04282 0.06158 0.14521 0.07805 CoRank (reimplemented) 0.24257 0.04115 0.06076 0.13717 0.07453 Baseline(ENcomp) 0.24879 0.04441 0.05865 0.13233 0.07543 PBCS 0.26872 0.04815 0.06425 0.14607 0.08065

Character Budgeting ROUGE-1 ROUGE-2 ROUGE-W ROUGE-L ROUGE-SU4 Baseline(EN) 0.21460 0.03494 0.05150 0.12343 0.06278 Baseline(CN) 0.21589 0.03732 0.05420 0.12867 0.06405 PBES 0.22825 0.04037 0.05527 0.12856 0.06894 CoRank (reimplemented) 0.22593 0.04069 0.05887 0.12818 0.07241 Baseline(ENcomp) 0.23663 0.04245 0.06134 0.13070 0.07365 PBCS 0.24917 0.04632 0.06252 0.13591 0.07953 Table 1: Results of word-based ROUGE evaluation

Sentence Budgeting ROUGE-1 ROUGE-2 ROUGE-W ROUGE-L ROUGE-SU4 Baseline(EN) 0.34842 0.11823 0.05505 0.15665 0.12320 Baseline(CN) 0.34901 0.12015 0.05664 0.15942 0.12625 PBES 0.36618 0.12281 0.05913 0.16018 0.11317 CoRank (reimplemented) 0.37601 0.12570 0.06088 0.17350 0.13352 Baseline(ENcomp) 0.36982 0.13001 0.06906 0.16233 0.13543 PBCS 0.37890 0.13549 0.07102 0.17632 0.14098

Character Budgeting ROUGE-1 ROUGE-2 ROUGE-W ROUGE-L ROUGE-SU4 Baseline(EN) 0.33602 0.10546 0.05263 0.15437 0.12161 Baseline(CN) 0.34075 0.12012 0.05678 0.15736 0.11981 PBES 0.35483 0.11902 0.05642 0.15899 0.11205 CoRank (reimplemented) 0.36147 0.12305 0.05847 0.16962 0.13364 Baseline(ENcomp) 0.36654 0.12960 0.06503 0.15987 0.13421 PBCS 0.37842 0.13441 0.07005 0.16928 0.13985 Table 2: Results of character-based ROUGE evaluation

System GR NR RC TF SC CoRank 3.00 0.75 3.35 0.57 3.55 0.82 3.90 0.79 3.55 0.74 PBES 2.90±0.89 3.25±0.70 3.50±0.87 3.96±0.80 3.45±0.50 PBCS 2.90±0.83 3.60±0.49 3.75±0.82 3.93±0.68 3.40±0.58 Human 4.60±0.49 4.15±0.73 4.35±0.73 4.93±0.25 3.90±0.94 ± ± ± ± ± Table 3: Manual evaluation results matical fluency can even be improved from sen- from 0, 0.2, 0.5, 1, 3 , while fixing d = { − − − − } tence compression, as redundant parentheses may 0.5. The results suggest that, for our purposes of sometimes be removed. We leave the output sum- summarization, the difference of considering dis- maries from all systems for the same document set tortion penalty or not is obvious. At certain level, to supplementary materials. the effect brought by different values distortion pa- rameter becomes stable. In our experiments, we also study the influ- We also empirically study the effect of approx- ence of relevant parameter settings. Figure 2a de- imation. The compressive summarization frame- picts the variation of ROUGE-2 F-measure when work proposed in this paper can be trivially cast changing the damping factor d from different val- into an integer linear program (ILP), with the ues in 1, 2 1, 3 1, 4 1, 5 1 , while η = 0.5 { − − − − } − number of variables being too large to make the being fixed. We can see that under proper range problem tractable 7. In this experiment, we use the value of d does not effect the result for too much. No damping or too much damping will 7By casting decisions on whether to select a certain phrase severely decrease the performance. Figure 2b or bigram as binary variables, with additional linear con- straints on phrase/bigram selection consistency, we get an shows the performance change under different set- ILP with essentially the same objective function and a linear tings of the distortion parameter η taking values budget constraint. This is conceptually equivalent to solving

124 凯特 女士 硬朗 , 紧急服务 在佛罗里达州 的 戴德 县, 承担了 风暴 的冲击 主 任 估计, 安德鲁 已经 造成 150亿 美元 到 200亿 美元 的损害 ( 75亿 英镑 , cially in the context of compressive summariza- 100亿 英镑 ) 。 Ms Kate Hale, director of emergency services in Florida's Dade County, which bore the brunt tion. Zajic et al. (2006) tries a pipeline strat- of the storm, estimated that Andrew had already caused Dollars 15bn to Dollars 20bn (Pounds 7.5bn-Pounds 10bn) of damage. egy with heuristics to generate multiple candidate

雨果飓风 , 袭击 东海岸 在 1989年9月 , 花费了 保险业 约 42亿 美元 。 compressions and extract from this compressed Hurricane Hugo, which hit the east coast in September 1989, cost the insurance industry sentences. Berg-Kirkpatrick et al. (2011) cre- about Dollars 4.2bn. ate linear models of weights learned by struc- 美国城市 沿 墨西哥湾的 阿拉巴马州 到得克萨斯州 东部 是 在 风暴 手表 昨晚 安 德鲁 飓风 向西 横跨 佛罗里达州南部 席卷 后 ,造成 至少 八人死亡 和严重的 财 tural SVMs for different components and tried 产损失 。 US CITIES along the Gulf of Mexico from Alabama to eastern Texas were on storm watch to jointly formulate sentence selection and last night as Hurricane Andrew headed west after sweeping across southern Florida, causing at least eight deaths and severe property damage. tree trimming in integer linear programs. Wood-

过去的 严重 飓风 美国 ,雨果 , 袭击 南卡罗来纳州 于1989年 , 耗资 从 保险 send and Lapata (2012) propose quasi tree sub- 损失 行业 42亿 美元 ,但 造成的 总伤害 的 估计 60亿 美元 和 100亿 美元 之间 不等 。 stitution for multiple rewriting oper- The last serious US hurricane, Hugo, which struck South Carolina in 1989, cost the industry Dollars 4.2bn from insured losses, though estimates of the total damage caused ranged ations. All these methods involve integer lin- between Dollars 6bn and Dollars 10bn. ear programming solvers to generate compressed 最初的 报道称, 至少有一人 已经 死亡 , 75 人受伤 ,数千 取得 沿着 路易斯安 那州海岸 无家可归 , 14 证实 在佛罗里达州和 死亡 三 巴哈马群岛 后 。 summaries, which is time-consuming for multi- Initial reports said at least one person had died, 75 been injured and thousands made document summarization tasks. Almeida and homeless along the Louisiana coast, after 14 confirmed deaths in Florida and three in the Bahamas. Martins (2013) form the compressive summariza- Figure 1: Example compressive summary tion problem in a more efficient dual decomposi- tion framework. Models for sentence compression lp solve package 8 as the ILP solver to ob- and extractive summarization are trained by multi- tain an exact solution on the first document cluster task learning techniques. Wang et al. (2013) ex- (D04) in DUC 2001 dataset. In Figure 2c, we de- plore different types of compression on constituent pict the objective value achieved by ILP as exact parse trees for query-focused summarization. Li solution, comparing with results from sentences et al. (2013) propose a guided sentence compres- which are gradually selected and compressed by sion model with ILP-based summary sentence se- our greedy algorithm. We can see that the approx- lection. Their following work (Li et al., 2014) in- imation is close. corporate various constraints on constituent parse trees to improve the linguistic quality of the com- 5 Related Work pressed sentences. In these studies, the best- performing systems require supervised learning The task focused in this paper is cross-language for different subtasks. More recent work tries to document summarization. Several pilot studies formulate document summarization tasks as opti- have investigated this task. Before Wan (2011)’s mization problems and use their solutions to guide work that explicitly utilizes bilingual information sentence compression(Li et al., 2015; Yao et al., in a graph-based framework, earlier methods often 2015). Bing et al. (2015) employ integer linear use information only from one language (de Chal- programming for conducting phrase selection and endar et al., 2005; Pingali et al., 2007; Orasan and merging simultaneously to form compressed sen- Chiorean, 2008; Litvak et al., 2010). tences after phrase extraction. This work is closely related to greedy algo- rithms for budgeted submodular maximization. 6 Conclusion and Future Work Many studies have formalized text summarization tasks as submodular maximization problems (Lin In this paper we propose a phrase-based frame- and Bilmes, 2010; Lin and Bilmes, 2011; Morita work for the task of cross-language document et al., 2013). A more recent work (Dasgupta summarization. The proposed scoring scheme can et al., 2013) discussed the problem of maximiz- be naturally operated on compressive summariza- ing a function with a submodular part and a non- tion. We use efficient greedy procedure to ap- submodular dispersion term, which may appear to proximately optimize the scoring function. Exper- be closer to our scoring functions. imental results show improvements of our com- In recent years, some research has made pressive solution over state-of-the-art systems. progress beyond extractive summarization, espe- Even though we do not explicitly use any syntactic information, the generated summaries of our sys- the original maximization problem with pruned brute-force enumeration and therefore exactly optimal but too costly. tem do not lose much grammaticality and fluency. 8http://lpsolve.sourceforge.net/ The scoring function in our framework is in-

125 0.05 0.049 160 Greedy 0.0485 140 Exact 0.045 0.048 120

0.0475

0.04 100

2F 0.047 2F - - 0.0465 80 0.035 0.046 60 objective value objective ROUGE ROUGE 0.03 0.0455 0.045 40 0.025 0.0445 20 0.044 η 0 0.02 d 0.0435 1 2 3 4 5 1 1/2 1/3 1/4 1/5 1/10 0 -0.2 -0.5 -1 -3 number of sentences (a) Effect of damping factor d (b) Effect of distortion parameter η (c) Comparison with exact solution Figure 2: Experimental analysis spired by earlier phrase-based machine translation 2011. Jointly learning to extract and compress. In models. Our next step is to try more fine-grained Proceedings of the 49th Annual Meeting of the Asso- scoring schemes using similar techniques from ciation for Computational Linguistics: Human Lan- guage Technologies, pages 481–490, Portland, Ore- modern approaches of statistical machine transla- gon, USA, June. Association for Computational Lin- tion. To further improve grammaticality of gener- guistics. ated summaries, we may try to sacrifice the time efficiency for a little bit and use syntactic informa- Lidong Bing, Piji Li, Yi Liao, Wai Lam, Weiwei Guo, tion provided by syntactic parsers. and Rebecca Passonneau. 2015. Abstractive multi- document summarization via phrase selection and Our framework currently uses only the single merging. In Proceedings of the 53rd Annual Meet- best translation. It will be more powerful to inte- ing of the Association for Computational Linguistics grate machine translation and summarization, uti- and the 7th International Joint Conference on Natu- lizing multiple possible translations. ral Language Processing (Volume 1: Long Papers), pages 1587–1597, Beijing, China, July. Association Currently many successful statistical machine for Computational Linguistics. translation systems are phrase-based with align- ment information provided and we utilize this fact James Clarke and Mirella Lapata. 2008. Global in- in this work. It is interesting to explore how will ference for sentence compression: An integer linear the performance be affected if we are only pro- programming approach. Journal of Artificial Intelli- gence Research, 31:273–381. vided with parallel sentences and then alignments can only be derived using an independent aligner. Anirban Dasgupta, Ravi Kumar, and Sujith Ravi. 2013. Summarization through submodularity and disper- Acknowledgments sion. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Vol- We thank all the anonymous reviewers for help- ume 1: Long Papers), pages 1014–1022, Sofia, Bul- ful comments and suggestions. This work was garia, August. Association for Computational Lin- supported by National Hi-Tech Research and guistics. Development Program (863 Program) of China Gael¨ de Chalendar, Romaric Besanc¸on, Olivier Ferret, (2015AA015403, 2014AA015102) and National Gregory Grefenstette, and Olivier Mesnard. 2005. Natural Science Foundation of China (61170166, Crosslingual summarization with thematic extrac- 61331011). The contact author of this paper, ac- tion, syntactic sentence simplification, and bilin- cording to the meaning given to this role by Peking gual generation. In Workshop on Crossing Barri- University, is Xiaojun Wan. ers in Text Summarization Research, 5th Interna- tional Conference on Recent Advances in Natural Language Processing (RANLP2005).

References Philipp Koehn, Franz Josef Och, and Marcu Daniel. Miguel Almeida and Andre Martins. 2013. Fast 2003. Statistical phrase-based translation. In Hu- and robust compressive summarization with dual de- man Language Technologies: The 2003 Annual composition and multi-task learning. In Proceed- Conference of the North American Chapter of the ings of the 51st Annual Meeting of the Association Association for Computational Linguistics, pages for Computational Linguistics (Volume 1: Long Pa- 48–54, Edmonton, May-June. Association for Com- pers), pages 196–206, Sofia, Bulgaria, August. As- putational Linguistics. sociation for Computational Linguistics. Philipp Koehn. 2009. Statistical Machine Translation. Taylor Berg-Kirkpatrick, Dan Gillick, and Dan Klein. Cambridge University Press.

126 Chen Li, Fei Liu, Fuliang Weng, and Yang Liu. 2013. ceedings of the 49th Annual Meeting of the Associ- Document summarization via guided sentence com- ation for Computational Linguistics: Human Lan- pression. In Proceedings of the 2013 Conference on guage Technologies, pages 1546–1555, Portland, Empirical Methods in Natural Language Process- Oregon, USA, June. Association for Computational ing, pages 490–500, Seattle, Washington, USA, Oc- Linguistics. tober. Association for Computational Linguistics. Lu Wang, Hema Raghavan, Vittorio Castelli, Radu Flo- Chen Li, Yang Liu, Fei Liu, Lin Zhao, and Fuliang rian, and Claire Cardie. 2013. A sentence com- Weng. 2014. Improving multi-documents summa- pression based framework to query-focused multi- rization by sentence compression based on expanded document summarization. In Proceedings of the constituent parse trees. In Proceedings of the 2014 51st Annual Meeting of the Association for Compu- Conference on Empirical Methods in Natural Lan- tational Linguistics (Volume 1: Long Papers), pages guage Processing (EMNLP), pages 691–701, Doha, 1384–1394, Sofia, Bulgaria, August. Association for Qatar, October. Association for Computational Lin- Computational Linguistics. guistics. Kristian Woodsend and Mirella Lapata. 2012. Mul- Piji Li, Lidong Bing, Wai Lam, Hang Li, and Yi Liao. tiple aspect summarization using integer linear pro- 2015. Reader-aware multi-document summariza- gramming. In Proceedings of the 2012 Joint Con- tion via sparse coding. In IJCAI. ference on Empirical Methods in Natural Language Processing and Computational Natural Language Hui Lin and Jeff Bilmes. 2010. Multi-document sum- Learning, pages 233–243. Association for Compu- marization via budgeted maximization of submod- tational Linguistics. ular functions. In Human Language Technologies: The 2010 Annual Conference of the North American Jin-ge Yao, Xiaojun Wan, and Jianguo Xiao. 2015. Chapter of the Association for Computational Lin- Compressive document summarization via sparse guistics, pages 912–920, Los Angeles, California, optimization. In IJCAI. June. Association for Computational Linguistics. David M Zajic, Bonnie Dorr, Jimmy Lin, and Richard Hui Lin and Jeff Bilmes. 2011. A class of submodu- Schwartz. 2006. Sentence compression as a compo- lar functions for document summarization. In Pro- nent of a multi-document summarization system. In ceedings of the 49th Annual Meeting of the Associ- Proceedings of the 2006 Document Understanding ation for Computational Linguistics: Human Lan- Workshop, New York. guage Technologies, pages 510–520, Portland, Ore- gon, USA, June. Association for Computational Lin- guistics.

Marina Litvak, Mark Last, and Menahem Friedman. 2010. A new approach to improving multilingual summarization using a genetic algorithm. In Pro- ceedings of the 48th Annual Meeting of the Associa- tion for Computational Linguistics, pages 927–936, Uppsala, Sweden, July. Association for Computa- tional Linguistics.

Hajime Morita, Ryohei Sasano, Hiroya Takamura, and Manabu Okumura. 2013. Subtree extractive sum- marization via submodular maximization. In Pro- ceedings of the 51st Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 1023–1032, Sofia, Bulgaria, August. Association for Computational Linguistics.

Constantin Orasan and Oana Andreea Chiorean. 2008. Evaluation of a cross-lingual romanian-english multi-document summariser. In LREC.

Prasad Pingali, Jagadeesh Jagarlamudi, and Vasudeva Varma. 2007. Experiments in cross language query focused multi-document summarization. In Work- shop on Cross Lingual Information Access Address- ing the Information Need of Multilingual Societies in IJCAI2007. Citeseer.

Xiaojun Wan. 2011. Using bilingual information for cross-language document summarization. In Pro-

127