Using Supervised Bigram-Based ILP for Extractive Summarization

Using Supervised Bigram-based ILP for Extractive Summarization Chen Li, Xian Qian, and Yang Liu The University of Texas at Dallas Computer Science Department chenli,qx,[email protected] Abstract tion. Their system achieved the best result in the In this paper, we propose a bigram based TAC 09 summarization task based on the ROUGE supervised method for extractive docu- evaluation metric. In this approach the goal is ment summarization in the integer linear to maximize the sum of the weights of the lan- programming (ILP) framework. For each guage concepts that appear in the summary. They bigram, a regression model is used to es- used bigrams as such language concepts. The as- timate its frequency in the reference sum- sociation between the language concepts and sen- mary. The regression model uses a vari- tences serves as the constraints. This ILP method ety of indicative features and is trained dis- is formally represented as below (see (Gillick and criminatively to minimize the distance be- Favre, 2009) for more details): tween the estimated and the ground truth bigram frequency in the reference sum- max i wici (1) mary. During testing, the sentence selec- s.t. s Occ c (2) tion problem is formulated as an ILP prob- j P ij ≤ i s Occ c (3) lem to maximize the bigram gains. We j j ij ≥ i demonstrate that our system consistently ljsj L (4) P j ≤ outperforms the previous ILP method on cPi 0, 1 i (5) different TAC data sets, and performs ∈ { }∀ sj 0, 1 j (6) competitively compared to the best results ∈ { }∀ in the TAC evaluations. We also con- ci and sj are binary variables (shown in (5) and ducted various analysis to show the im- (6)) that indicate the presence of a concept and pact of bigram selection, weight estima- a sentence respectively. wi is a concept’s weight tion, and ILP setup. and Occij means the occurrence of concept i in sentence j. Inequalities (2)(3) associate the sen- 1 Introduction tences and concepts. They ensure that selecting a Extractive summarization is a sentence selection sentence leads to the selection of all the concepts problem: identifying important summary sen- it contains, and selecting a concept only happens tences from one or multiple documents. Many when it is present in at least one of the selected methods have been developed for this problem, in- sentences. cluding supervised approaches that use classifiers There are two important components in this to predict summary sentences, graph based ap- concept-based ILP: one is how to select the con- proaches to rank the sentences, and recent global cepts (ci); the second is how to set up their weights optimization methods such as integer linear pro- (wi). Gillick and Favre (Gillick and Favre, 2009) gramming (ILP) and submodular methods. These used bigrams as concepts, which are selected from global optimization methods have been shown to a subset of the sentences, and their document fre- be quite powerful for extractive summarization, quency as the weight in the objective function. because they try to select important sentences and In this paper, we propose to find a candidate remove redundancy at the same time under the summary such that the language concepts (e.g., bi- length constraint. grams) in this candidate summary and the refer- Gillick and Favre (Gillick and Favre, 2009) in- ence summary can have the same frequency. We troduced the concept-based ILP for summariza- expect this restriction is more consistent with the 1004 Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 1004–1013, Sofia, Bulgaria, August 4-9 2013. c 2013 Association for Computational Linguistics ROUGE evaluation metric used for summarization summary. The goal is to find z that maximizes (Lin, 2004). In addition, in the previous concept- Gain(sum) (formula (8)) under the length con- based ILP method, the constraints are with respect straint L. to the appearance of language concepts, hence it This problem can be casted as an ILP problem. cannot distinguish the importance of different lan- First, using the fact that guage concepts in the reference summary. Our min a, x = 0.5( x a + x + a), x, a 0 method can decide not only which language con- { } −| − | ≥ cepts to use in ILP, but also the frequency of these we have language concepts in the candidate summary. To estimate the bigram frequency in the summary, min n , z(s) n = { b,ref ∗ b,s} we propose to use a supervised regression model b s X X that is discriminatively trained using a variety of 0.5 ( n z(s) n + ∗ −| b,ref − ∗ b,s| features. Our experiments on several TAC sum- b s marization data sets demonstrate this proposed X X n + z(s) n ) method outperforms the previous ILP system and b,ref ∗ b,s s often the best performing TAC system. X Now the problem is equivalent to: 2 Proposed Method max ( nb,ref z(s) nb,s + z −| − ∗ | 2.1 Bigram Gain Maximization by ILP s Xb X We choose bigrams as the language concepts in n + z(s) n ) b,ref ∗ b,s our proposed method since they have been suc- s X cessfully used in previous work. In addition, we s.t. z(s) S L; z(s) 0, 1 ∗ | |≤ ∈ { } expect that the bigram oriented ILP is consistent s with the ROUGE-2 measure widely used for sum- X marization evaluation. This is equivalent to the ILP: We start the description of our approach for the max ( z(s) n C ) (9) scenario where a human abstractive summary is ∗ b,s − b b s provided, and the task is to select sentences to X X s.t. z(s) S L (10) form an extractive summary. Then Our goal is ∗ | |≤ s to make the bigram frequency in this system sum- X z(s) 0, 1 (11) mary as close as possible to that in the reference. ∈ { } For each bigram b, we define its gain: Cb nb,ref z(s) nb,s Cb − ≤ − ∗ ≤ s Gain(b, sum) = min n , n (7) X { b,ref b,sum} (12) where n is the frequency of b in the reference b,ref where C is an auxiliary variable we introduce that summary, and n is the frequency of b in the b b,sum is equal to n z(s) n , and n is automatic summary. The gain of a bigram is no | b,ref − s ∗ b,s| b,ref a constant that can be dropped from the objective more than its frequency in the reference summary, function. P hence adding redundant bigrams will not increase the gain. 2.2 Regression Model for Bigram Frequency The total gain of an extractive summary is de- Estimation fined as the sum of every bigram gain in the sum- In the previous section, we assume that nb,ref is mary: at hand (reference abstractive summary is given) and propose a bigram-based optimization frame- Gain(sum)= Gain(b, sum) work for extractive summarization. However, for b X the summarization task, the bigram frequency is = min n , z(s) n (8) { b,ref ∗ b,s} unknown, and thus our first goal is to estimate such s Xb X frequency. We propose to use a regression model where s is a sentence in the document, nb,s is for this. the frequency of b in sentence s, z(s) is a binary Since a bigram’s frequency depends on the sum- variable, indicating whether s is selected in the mary length (L), we use a normalized frequency 1005 in our method. Let nb,ref = Nb,ref L, where Word Level: n(b,ref) ∗ • Nb,ref = is the normalized frequency – 1. Term frequency1: The frequency of b n(b,ref) in the summary. Now the problem is to automati- this bigram in the given topic. P cally estimate Nb,ref . – 2. Term frequency2: The frequency of 1 Since the normalized frequency Nb,ref is a real this bigram in the selected sentences . number, we choose to use a logistic regression – 3. Stop word ratio: Ratio of stop words model to predict it: in this bigram. The value can be 0, 0.5, exp w f(b) { N = { ′ } (13) 1 . b,ref exp w f(b ) } j { ′ j } – 4. Similarity with topic title: The where f(bj) is the featureP vector of bigram bj and number of common tokens in these two w′ is the corresponding feature weight. Since even strings, divided by the length of the for identical bigrams bi = bj, their feature vectors longer string. may be different (f(b ) = f(b )) due to their dif- i j – 5. Similarity with description of the ferent contexts, we sum up frequencies for identi- topic: Similarity of the bigram with cal bigrams b b = b : { i| i } topic description (see next data section Nb,ref = Nbi,ref about the given topics in the summariza- i,b =b tion task). Xi exp w f(b ) i,bi=b ′ i = { } (14) Sentence Level: (information of sentence j exp w′f(bj) • P { } containing the bigram) To train this regression model using the given P reference abstractive summaries, rather than trying – 6. Sentence ratio: Number of sentences to minimize the squared error as typically done, that include this bigram, divided by the we propose a new objective function. Since the total number of the selected sentences. normalized frequency satisfies the probability con- – 7. Sentence similarity: Sentence similarity with topic’s query, which is the straint b Nb,ref = 1, we propose to use KL di- vergence to measure the distance between the es- concatenation of topic title and descrip- timatedP frequencies and the ground truth values.

Load more