Using External Resources and Joint Learning for Bigram Weighting in ILP-Based Multi-Document Summarization

Using External Resources and Joint Learning for Bigram Weighting in ILP-Based Multi-Document Summarization Chen Li1, Yang Liu1, Lin Zhao2 1 Computer Science Department, The University of Texas at Dallas Richardson, Texas 75080, USA 2 Research and Technology Center, Robert Bosch LLC Palo Alto, California 94304, USA chenli,[email protected] { [email protected] } { } Abstract whether or not a sentence is in the summary, or un- supervised methods such as graph-based approaches Some state-of-the-art summarization systems to rank the sentences. Recently global optimiza- use integer linear programming (ILP) based tion methods such as integer linear programming methods that aim to maximize the important (ILP) have been shown to be quite powerful for this concepts covered in the summary. These con- task. For example, Gillick et al. (2009) used ILP cepts are often obtained by selecting bigrams to achieve the best result in the TAC 09 summa- from the documents. In this paper, we improve such bigram based ILP summarization meth- rization task. The core idea of this summarization ods from different aspects. First we use syn- method is to select the summary sentences by maxi- tactic information to select more important bi- mizing the sum of the weights of the language con- grams. Second, to estimate the importance of cepts that appear in the summary. Bigrams are often the bigrams, in addition to the internal features used as these language concepts because Gillick et based on the test documents (e.g., document al. (2009) stated that the bigrams gave consistently frequency, bigram positions), we propose to better performance than unigrams or trigrams for a extract features by leveraging multiple external resources (such as word embedding from variety of ROUGE measures. The association be- additional corpus, Wikipedia, Dbpedia, Word- tween the language concepts and sentences serves Net, SentiWordNet). The bigram weights are as the constraints. This ILP method is formally rep- then trained discriminatively in a joint learn- resented as below (see (Gillick et al., 2009) for more ing model that predicts the bigram weights details): and selects the summary sentences in the ILP framework at the same time. We demonstrate that our system consistently outperforms the max i wici (1) prior ILP method on different TAC data sets, s.t. s Occ c (2) and performs competitively compared to other j P ij ≤ i previously reported best results. We also con- sjOccij ci (3) j ≥ ducted various analyses to show the contribu- j ljsj L (4) tions of different components. P ≤ c 0, 1 i (5) Pi ∈ { } ∀ s 0, 1 j (6) j ∈ { } ∀ 1 Introduction ci and sj are binary variables that indicate the pres- Extractive summarization is a sentence selection ence of a concept and a sentence respectively. lj problem: identifying important summary sentences is the sentence length and L is maximum length of from one or multiple documents. Many methods the generated summary. wi is a concept’s weight have been developed for this problem, including su- and Occij means the occurrence of concept i in sen- pervised approaches that use a classifier to predict tence j. Inequalities (2)(3) associate the sentences 778 Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, pages 778–787, Denver, Colorado, May 31 – June 5, 2015. c 2015 Association for Computational Linguistics and concepts. They ensure that selecting a sen- geted Maximal Coverage problem in (Khuller et al., tence leads to the selection of all the concepts it con- 1999). The concept-based ILP system performed tains, and selecting a concept only happens when it very well in the TAC 2008 and 2009 summariza- is present in at least one of the selected sentences. tion task (Gillick et al., 2008; Gillick et al., 2009). In such ILP-based summarization methods, how After that, the global optimization strategy attracted to determine the concepts and measure their weights increasing attention in the summarization task. Lin is the key factor impacting the system performance. and Bilmes (2010) treated the summarization task as Intuitively, if we can successfully identify the im- a maximization problem of submodular functions. portant key bigrams to use in the ILP system, or as- Davis et al. (2012) proposed an optimal combina- sign large weights to those important bigrams, the torial covering algorithm combined with LSA to system generated summary sentences will contain as measure the term weight for extractive summariza- many important bigrams as possible. The oracle ex- tion. Takamura and Okumura (2009) also defined periment in (Gillick et al., 2008) showed that if they the summarization problem as a maximum cover- just use the bigrams extracted from human generated age problem and used a branch-and-bound method summaries as the input of the ILP system, much bet- to search for the optimal solution. Li et al. (2013b) ter ROUGE scores can be obtained than using the used the same ILP framework as (Gillick et al., automatically selected bigrams. 2009), but incorporated a supervised model to es- In this paper, we adopt the ILP summarization timate the bigram frequency in the final summary. framework, but make improvement from three as- Similar optimization methods are also widely pects. First, we use the part-of-speech tag and used in the abstractive summarization task. Martins constituent parse information to identify important and Smith (2009) leveraged ILP technique to jointly bigram candidates: bigrams from base NP (noun select and compress sentences for multi-document phrases) and bigrams containing verbs or adjectives. summarization. A novel summary guided sentence This bigram selection method allows us to keep the compression was proposed by (Li et al., 2013a) and important bigrams and filter useless ones. Second, to it successfully improved the summarization perfor- estimate the bigrams’ weights, in addition to using mance. Woodsend and Lapata (2012) and Li et information from the test documents, such as doc- al. (2014) both leveraged constituent parser trees to ument frequency, syntactic role in a sentence, etc., help sentence compression, which is also modeled we utilize a variety of external resources, including in the optimization framework. But these kinds of a corpus of news articles with human generated sum- work involve using complex linguistic information, maries, Wiki documents, description of name en- often based on syntactic analysis. tities from DBpedia, WordNet, and SentiWordNet. Since the language concepts (or bigrams) can be Discriminative features are computed based on these considered as key phrases of the documents, the external resources with the goal to better represent other line related to our work is how to extract and the importance of a bigram and its semantic similar- measure the importance of key phrases from doc- ity with the given query. Finally, we propose to use uments. In particular, our work is related to key a joint bigram weighting and sentence selection pro- phrase extraction by using external resources. A cess to train the feature weights. Our experimental survey by (Hasan and Ng, 2014) showed that us- results on multiple TAC data sets show the competi- ing external resources to extract and measure key tiveness of our proposed methods. phrases is very effective. In (Medelyan et al., 2009), Wikipedia-based key phrases are determined based 2 Related Work on a candidate’s document frequency multiplied by the ratio of the number of Wikipedia articles con- Optimization methods have been widely used in taining the candidate as a link to the number of ar- extractive summarization lately. McDonald (2007) ticles containing the candidate. Query logs were first introduced the sentence level ILP for summa- also used as another external resource by (Yih et rization. Later Gillick et al. (2009) revised it to al., 2006) to exploit the observation that a candidate concept-based ILP, which is similar to the Bud- is potentially important if it was used as a search 779 query. Similarly terminological databases have been bigrams/concepts in the ILP optimization method. exploited to encode the salience of candidate key Details are described in (Gillick et al., 2009). phrases in scientific papers (Lopez and Romary, In this paper, rather than considering all the bi- 2010). In summarization, external information has grams, we propose to utilize syntactic information to also been used to measure word salience. Some help select important bigrams. Intuitively bigrams TAC systems like (Kumar et al., 2010; Jia et al., containing content words carry more topic related 2010) used Wiki as an important external resource information. As proven in (Klavans and Kan, 1998), to measure the words’ importance, which helped im- nouns, verbs, and adjectives were indeed beneficial prove the summarization results. Hong and Nenkova in document analysis. Therefore we focus on choos- (2014) introduced a supervised model for predicting ing bigrams containing these words. First, we use word importance that incorporated a rich set of fea- a bottom-up strategy to go through the constituent tures. Tweets information is leveraged by (Wei and parse tree and identify the ‘NP’ nodes in the low- Gao, 2014) to help generate news highlights. est level of the tree. Then all the bigrams in these In this paper our focus is on choosing useful bi- base NPs are kept as candidates. Second, we find grams and estimating accurate weights to use in the the verbs and adjectives from the sentence based on concept-based ILP methods. We explore many ex- the POS tags, and construct bigrams by concatenat- ternal resources to extract features for bigram candi- ing the previous or the next word of that verb or ad- dates, and more importantly, propose to estimate the jective. If these bigrams are not included in those feature weights in a joint process via structured per- already found from the base NPs, they are added to ceptron learning that optimizes summary sentence the bigram candidates.

Using External Resources and Joint Learning for Bigram Weighting in ILP-Based Multi-Document Summarization

Evaluation of Machine Learning Algorithms for Sms Spam Filtering

NLP - Assignment 2

3 Dictionaries and Tolerant Retrieval

N-Gram-Based Machine Translation

Title and Link Description Improving Distributional Similarity With

Arxiv:2006.14799V2 [Cs.CL] 18 May 2021 the Same User Input

Word Senses and Wordnet Lady Bracknell

A Unigram Orientation Model for Statistical Machine Translation

Heafield, K. and A. Lavie. "Voting on N-Grams for Machine Translation

Part I Word Vectors I: Introduction, Svd and Word2vec 2 Natural Language in Order to Perform Some Task

Phrase Based Language Model for Statistical Machine Translation

Phrase-Based Attentions