Topic-based Multi-Document Summarization with Probabilistic

Leonhard Hennig DAI Labor, TU Berlin Berlin, Germany [email protected]

Abstract based on the identification of topics (or thematic foci) We consider the problem of query-focused multi- to construct generic or query-focused summaries. Of- document summarization, where a summary ten, thematic features rely on identifying and weight- containing the information most relevant to a ing important keywords [21], or creating topic signa- user’s information need is produced from a set tures [14, 10]. Sentences are scored by combinations of of topic-related documents. We propose a new keyword scores, or by computing similarities between method based on probabilistic latent semantic sentences and queries. Yet it is well known that term analysis, which allows us to represent sentences and queries as probability distributions over la- matching has severe drawbacks due to the ambivalence tent topics. Our approach combines query- of and to differences in usage and personal focused and thematic features computed in the style across authors. This is especially important for latent topic space to estimate the summary- , as summaries produced by relevance of sentences. In addition, we evaluate humans may differ significantly, potentially not shar- several different similarity measures for comput- ing very many terms [16]. ing sentence-level feature scores. Experimental Latent Semantic Indexing (LSI) is an approach to results show that our approach outperforms the best reported results on DUC 2006 data, and also overcome these problems by mapping documents to a compares well on DUC 2007 data. latent semantic space, and has been shown to work well for text summarization [9, 23]. However, LSI has a number of drawbacks, namely its unsatisfactory sta- Keywords tistical foundations. The technique of probabilistic la- tent semantic analysis (PLSA) assumes a latent lower text summarization, probabilistic latent semantic analysis, plsa dimensional as the origin of observed term co-occurrence distributions, and can be seen as a prob- abilistic analogue to LSI [11]. It has a solid statistical 1 Introduction foundation, is based on the likelihood principle and de- fines a proper generative model for data. PLSA models Automatically producing summaries from large tex- documents as a list of mixing proportions for mixture tual sources is an extensively studied problem in IR components that can be viewed as representations of and NLP [17, 12]. In this paper, we investigate the “topics” [4]. problem of multi-document summarization, where a summary is created from a set of related documents In this paper, we are primarily interested the ca- and optionally fulfills a specific information need of a pability of the PLSA approach to model documents user. In particular, we focus on generating an extrac- as mixtures of topics. Unlike previous approaches tive summary by selecting sentences from a document in PLSA-based extractive summarization, we repre- cluster [8]. Multi-document summarization is an in- sent sentences, queries, and documents as probability creasingly important task: With the rapid growth of distributions over topics. We train the probabilistic online information, and many documents covering the model on the term-sentence matrix of all sentences in same topic, the condensation of information from dif- a document cluster, and proceed by folding queries, ferent sources into an informative summary helps to document titles and cluster centroid vectors into the reduce information overload. Automatically created trained model. This allows us to compute various the- summaries can either consist of the most important matic and query-focused similarity measures, as well information overall (generic summarization) or of the as redundancy measures, in the space of latent top- information most relevant with respect to a user’s in- ics, in order to estimate the summary-worthiness of formation need (query-focused summarization). sentences. A major aspect of identifying relevant information Our system improves on previous approaches in is to find out what a text is about. A document three ways: First, we investigate PLSA in the context will generally contain a variety of information cen- of multi-document summarization, modeling topic dis- tered around a main theme, and covering different tributions across documents and taking into account aspects of the main topic. Similarly, human sum- information redundancy. Second, we do not only pick maries tend to cover different topics of the original sentences from topics with the highest likelihood in source text to increase the informative content of the the training data as in [3], but compute a sentence’s summary. Various approaches have exploited features score based on a linear function of query-focused and 144

International Conference RANLP 2009 - Borovets, Bulgaria, pages 144–149 thematic features. Third, we examine how a PLSA The standard procedure for maximizing the likeli- model can be used to represent documents, sentences hood function in the presence of latent variables is and queries in the context of multi-document summa- the Expectation Maximization (EM) algorithm. EM rization, and investigate which measures are most use- is an iterative algorithm where each iteration consists ful for computing similarities in the latent topic space. of two steps, an expectation step where the posterior We evaluate our approach on the data sets of the DUC probabilities for the latent classes z are computed, and 2006 and DUC 2007 text summarization challenges, a maximization step where the conditional probabili- and show that the resulting summaries compare fa- ties of the parameters given the posterior probabili- vorably on ROUGE metrics with those produced by ties of the latent classes are updated. Alternating the existing state-of-the-art summarization systems. expectation and maximization steps, one arrives at a The rest of this paper is organized as follows: In converging point which describes a local maximum of Section 2 we describe the probabilistic latent semantic the log likelihood. The output of the algorithm are analysis algorithm. Next, in Section 3, we give de- the mixture components, as well as the mixing propor- tails of our summarization system, the sentence-level tions over the components for each training document, features we use, as well as of the similarity measures i.e. the conditional probabilities P (w z) and P (z d). we evaluate. In Section 4, we give experimental re- For details of the EM algorithm and its| application| to sults showing that our approach leads to improvements PLSA, see [11]. over a LSI baseline, and that overall scores compare well with those of existing systems on ROUGE met- rics. We then compare our system to related work in 3 Topic-based summarization Section 5, and finally Section 6 concludes the paper. Our approach for producing a summary consists of three steps: First, we associate sentences and queries 2 Probabilistic Latent Semantic with a representation in the latent topic space of a Analysis PLSA model by estimating their mixing proportions P (z d)1. We then compute several sentence-level fea- Probabilistic latent semantic analysis is a latent vari- tures| based on the similarity of sentence and query able model for co-occurrence data which has been distributions over latent topics. Finally, we combine found to provide better results than LSI for term individual feature scores linearly into an overall sen- matching in retrieval applications [11]. It associates tence score to create a ranking, which we use to select an unobserved class variable z = z1,...,zk sentences for the summary. We follow a greedy ap- with each observation (d, w),∈ where Z word{ w } proach for selecting sentences, and penalize candidate = w , . . . , w occurs in document d =∈ sentences based on their similarity to the partial sum- W { 1 i} ∈ D d1, . . . , dj . Each word in a document is considered mary. as{ a sample} from a mixture model, where the mixture components are multinomial random variables that can be viewed as representations of latent topics. A 3.1 Sentence representation in the la- document is represented as a list of mixing proportions tent topic space for the mixing components, i.e. it is reduced to a prob- ability distribution over a fixed set of latent classes. Given a corpus of topic-related documents, we per- form sentence splittingD on each document using the In terms of a generative model, PLSA can be defined 2 as follows: NLTK toolkit . Each sentence is represented as a bag- of-words w = (w1, . . . , wm). During preprocessing, we select a document d with probability P (d), remove stop words, and apply using Porter’s • stemmer [22]. We discard all sentences which contain pick a latent class z with probability P (z d), • | less than lmin = 5 or more than lmax = 20 content generate a word w with probability P (w z). words, as these sentences are unlikely to be useful for • | a summary [24]. We create a term-sentence matrix For each observation pair (d, w) the resulting likeli- TS containing all sentences of the corpus, where each hood expression is: entry TS(i, j) is given by the frequency of term i in P (d, w) = P (d)P (w d), where (1) sentence j. We then train the PLSA model on the | term-sentence matrix TS. P (w d) = P (w z)P (z d). (2) After the model has been trained, it provides a rep- | | | z resentation of the sentences as probability distribu- X∈Z tions P (z s) over the latent topics z. This represen- A document d and a word w are assumed to be con- tation can| be interpreted as follows: Since the source ditionally independent given the unobserved topic z. documents cover multiple topics related to a central Following the maximum likelihood principle, the mix- theme, each sentence can be viewed as representing ing components and the mixing proportions are deter- one or more of these topics. By applying PLSA, we mined by the maximization of the likelihood function arrive at a representation of sentences as a vector in = n(d, w)logP (d, w), (3) L 1 From hereon, we will use P (z s) and P (z q) to denote topic d w | | X∈D X∈W distributions over sentences and queries respectively, but for all purposes these can be considered identical to the notation where n(d, w) denotes the term frequency, i.e. the P (z d) of the original PLSA model. number of times w occurred in d. 2 http://nltk.org| 145 the “topic-space” of the document cluster : The symmetric KL-divergence is defined as follows: D KL(S, Q) = D (S Q) + D (Q S) P (z s) = (p(z1 s), p(z2 s), . . . , p(zK s)), (4) KL || KL || | | | | S(i) = S(i) log where p(zk s) is the conditional probability of topic Q(i) | I k given the sentence s. The probability distribution X P (z s) hence tells us how many and which topics this Q(i) | 3 + Q(i) log . (5) sentence covers , and how likely the different topics S(i) are for this sentence. XI In order to produce a query-focused summary, we To use the KL-divergence as a similarity measure, we also need to represent the query in the latent topic scale divergence values to [0, 1] and invert by subtract- space. This is achieved by folding the query into the ing from 1, hence trained model. The folding is performed by EM itera- tions, where the factors P (w z) are kept fixed, and only rKL = 1 KL(S, Q)scaled. (6) the mixing proportions P (z |q) are adapted in each M- − step [11]. The representation| of sentences and queries The Jensen-Shannon divergence is a symmetrized and in the latent topic space allows us to apply similarity smoothed version of the KL-divergence, computing the measures in this space. Furthermore, the topic space KL-divergence of S, Q with respect to the average of is much smaller than the original term vector space. the two input distributions. The JS-divergence based similarity rJS is then defined as:

3.2 Computing query-focused and the- rJS(S, Q) = 1 [DJS(S Q)] (7) − || matic sentence features 1 1 = 1 D (S M) + D (Q M) , − 2 KL || 2 KL || Since we are interested in creating a summary that   covers the main topics of a document set and is also where M = 1/2(S + Q). Finally, the focused on satisfying a user’s information need, speci- T fied by a query, we create sentence-level features that is defined as rCOS(S, Q) = S Q. attempt to capture these different aspects in the form As the training of a PLSA model using the EM algo- of per-sentence scores. We then combine the feature rithm with random initialization converges on a local scores to arrive at an overall sentence score. maximum of the likelihood of the observed data, differ- ent initializations will result in different locally optimal Each of our evaluation data sets contains a title and models. As the authors of [5] have shown, the effect a narrative for each cluster of topic-related documents. of random initialization can be reduced by generating The narrative consists of one or more sentences de- several PLSA models, then computing features accord- scribing a user’s information need. This allows us to ing to the different models, and finally averaging the compute the following sentence features, where each feature values. We have implemented this model av- feature measures the similarity of the sentence’s topic eraging in our approach using 5 iterations of training distribution S with a “query” topic distribution: the PLSA model. r(S, CT ) - cluster title • 3.3 Sentence scoring r(S, N) - cluster narrative • The system described so far assigns a vector of similar- ity feature values to each sentence s. The overall score r(S, T ) - document title s s • of a sentence s based on the feature vector (r1, . . . , rP ) is: r(S, D) - document term vector s • score(s) = wprp, (8) P r(S, C) - cluster centroid vector X • where wp is a feature-specific weight. Sentences are To compute the features, we fold the title and the ranked by this score, and the highest-scoring sentences narrative of the document clusters, the document ti- are selected for the summary. tles, and document and cluster term vectors into the For our system, we trained the feature weights by trained PLSA model. Query term vectors are prepro- initializing all weights to a default value of 1. We then cessed in the same way as training sentences, except optimized one feature weight at a time while keeping that no sentence splitting is performed. Document the others fixed. The training was performed on the and document cluster term vectors are computed by DUC 2006 data set. The most dominant features in aggregating sentence term vectors. our experiments are the sentence-narrative similarity We evaluate three similarity measures r in our ap- r(S, N) and the sentence-document similarity r(S, D), proach: The symmetric Kullback-Leibler (KL) diver- which confirms previous research. On the other hand, gence, the Jensen-Shannon (JS) divergence and the the sentence-title similarity r(S, T ) did not have a sig- cosine similarity, but a variety of other similarity mea- nificant influence on the resulting summaries. sures can be utilized towards this end. When generating a summary, we also need to deal with the problem of repetition of information. This 3 In terms of topics whose probability is not negligible, problem is especially important for multi-document i.e. larger than some small quantity ǫ. summarization, where multiple documents will discuss 146 System k Rouge-1 Rouge-2 Rouge-SU4 where each cluster contains 25 news articles related PLSA-JS 192 0.43283 0.09698 0.15568 to the same topic. Participants are asked to generate PYTHY - - 0.096 0.147 summaries of at most 250 words for each cluster. For PLSA-COS 256 0.42444 0.09588 0.15409 each cluster, a title and a narrative describing a user’s peer 24 - 0.40980 0.09505 0.15464 information need are provided. The narrative is usu- PLSA-KL 256 0.42956 0.09465 0.15474 ally composed of a set of questions or a multi-sentence LSI 128 0.42155 0.08880 0.14938 Lead - 0.30217 0.04947 0.09788 task description. We present the results of our system in Table 1. Table 1: DUC-06: ROUGE recall scores for best We compare the results to the best peer (peer24 ) and number of latent topics k. PLSA-JS, -KL and to the best reported results on this data set by the -COS are system variants using the Jensen-Shannon- PYTHY system [25]. In addition, we also give the divergence, symmetric KL-divergence, and Cosine results for the LSI and the Lead baselines. similarity respectively. Best LSI model based on a In the table, system PLSA-JS uses the Jensen- rank-k approximation with k = 128. Shannon divergence as the similarity measure r(S, Q), PLSA-KL the symmetric KL-divergence and PLSA- COS the cosine similarity. The results are given for the same topic. We model redundancy similar to the the empirically best value of the parameter k (num- maximum marginal relevance framework [6]. MMR ber of latent topics) for each system variant. The is a greedy approach that iteratively selects the best- system using the JS-divergence outperforms the best scoring sentence for the summary, and then updates existing systems at k = 192 with a ROUGE-2 score sentence scores by computing a penalty based on the of 0.9698, although the improvements for ROUGE- similarity of each sentence with the current summary: 2 and ROUGE-SU4 are not significant at p < 0.05. ROUGE-1 scores are significantly better than the re- score(s) = λ(score(s)) (1 λ)r(S, SUM), (9) sults reported by peer24. A comparison to the PYTHY − − system on ROUGE-1 scores was not possible as the where the score of sentence s is scaled to [0, 1] and authors do not specify this score for their system. All r(S, SUM) is the cosine similarity of the sentence and variants of our system outperform the LSI baseline on the summary centroid vector, which is based on the ROUGE-2. averaged topic distribution of sentences selected for the summary. λ is set experimentally to 0.5, weighting 4.2 DUC 2007 relevance and redundancy scores equally. The multi-document summarization task in DUC-2007 is the same as in DUC-2006, with participants asked 4 Experiments to produce 250 word multi-document summaries for a total of 45 document clusters. The results of our For the evaluation of our summarization system, we system are presented in Table 2. use two data sets from recent summarization tasks: ROUGE-2 and ROUGE-SU4 scores of our system Multi-document summarization in DUC 2006 and in are lower than those of the best system (peer15 ), but DUC 2007. For all our evaluations, we use ROUGE still very competitive, with the PLSA-JS variant rank- metrics4. ROUGE metrics are recall-oriented and ing 5th for ROUGE-2 and 2nd for ROUGE-SU4 when based on n-gram overlap. ROUGE-1 has been shown compared to other participating systems. Again we to correlate well with human judgements [15]. In ad- see that all three system variants outperform the LSI dition, we also report the performance on ROUGE-2 baseline. We observe that both the PLSA-JS and the ( overlap) and ROUGE-SU4 (skip bigram) met- PLSA-COS variant require a much smaller number rics. of latent classes than the LSI model for comparable We implemented two baseline systems, Lead and a ROUGE-2 results. system using LSI [9]. The Lead system selects the We can also see that the PLSA-JS variant outper- lead sentences from the most recent news article in the forms peer15 on ROUGE-1, and achieves almost the document cluster as the summary. The LSI baseline same score as the top-performing system for ROUGE- computes the rank-k singular value decomposition of SU4, with the differences in both cases not being the term-sentence matrix. The resulting right-singular significant. This suggests that the PLSA model vectors, scaled by the singular values, represent the can adequately capture the importance of individual sentences in the latent semantic space. We compute words for ROUGE-1 recall, and word co-occurrences the same sentence-level features as for the PLSA-based for ROUGE-SU4 skip-bigram recall. The ROUGE-2 system, using the cosine similarity measure, and apply score, on the other hand, is significantly lower than our greedy ranking and redundancy removal strategy that of peer15. This indicates that the PLSA model, to create a summary. which was trained on the co-occurrence counts of in- vididual words, could benefit from the inclusion of bi- gram co-occurrence counts. 4.1 DUC 2006 In the multi-document summarization task in DUC- 4.3 Effect of system variations 2006, participants are given 50 document clusters, Next, we look at the effect of varying the number of 4 ROUGE version 1.5.5, with arguments -n 4 -w 1.2 -m -2 4 -u latent topics: For all systems we find that using less -c 95 -r 1000 -f A -p 0.5 -t 0 than k = 32 latent classes, the model cannot cope with 147 148 gle document, then picks the topics with the highest References posterior probabilities p(z), and selects sentences with [1] R. Arora and B. Ravindran. Latent dirichlet allocation based the highest likelihood p(s z) within these topics for multi-document summarization. In Proc. of AND ’08, pages the summary. The approach| produces generic sum- 91–97, 2008. maries based on the most likely topics of the PLSA [2] R. Barzilay and M. Elhadad. Using Lexical Chains for Text model. In contrast, our system focuses on query- Summarization, pages 111–121. MIT Press, 1999. oriented multi-document summarization, and models [3] H. Bhandari, M. Shimbo, T. Ito, and Y. Matsumoto. Generic text summarization using probabilistic latent semantic index- redundancy when creating the summary. ing. In Proc. of IJCNLP 2008, 2008. More closely related to our approach is recent work [4] D. M. Blei, A. Y. Ng, M. I. Jordan, and J. Lafferty. Latent by [1], who employ Latent Dirichlet Allocation [4] to dirichlet allocation. JMLR, 3:2003, 2003. create multi-document summaries on DUC 2002 data. [5] T. Brants, F. Chen, and I. Tsochantaridis. Topic-based docu- The authors report an improvement of ROUGE-1 re- ment segmentation with probabilistic latent semantic analysis. call scores over the best known DUC 2002 system. In Proc. of CIKM ’02, pages 211–218, 2002. However, their approach is similar to the approach [6] J. Carbonell and J. Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. of [3] in being restricted to selecting sentences from In Proc. of SIGIR ’98, pages 335–336, 1998. the topics with the largest likelihoods. As compared [7] G. Erkan and D. Radev. Lexrank: graph-based centrality as to our approach, their system does not seem to per- salience in text summarisation. JAIR, 2004. form any redundancy checking except for relying on [8] J. Goldstein, V. Mittal, J. Carbonell, and M. Kantrowitz. the discriminative quality of the latent classes. Fur- Multi-document summarization by . In thermore, our approach utilizes narrative and other NAACL-ANLP 2000 Workshop on Automatic summariza- meta-information of the document cluster to create not tion, pages 40–48, 2000. only generic, but also query-focused summaries. [9] Y. Gong and X. Liu. Generic text summarization using rele- vance measure and latent semantic analysis. In Proc. of SIGIR ’01, pages 19–25, 2001. [10] S. Harabagiu and F. Lacatusu. Topic themes for multi- 6 Conclusion document summarization. In Proc. of SIGIR ’05, pages 202– 209, 2005. We introduced an approach to query-focused multi- [11] T. Hofmann. Probabilistic latent semantic indexing. In Proc. document summarization based on probabilistic latent of SIGIR ’99, pages 50–57, 1999. semantic analysis. After training a PLSA model on the [12] K. S. Jones. Automatic summarising: The state of the art. Inf. term-sentence matrix of document clusters from recent Process. Manage., 43(6):1449–1481, 2007. summarization tasks, we represent each sentence as a [13] J. Leskovec, N. Milic-Frayling, and M. Grobelnik. Impact of linguistic analysis on the semantic graph coverage and learning distribution over latent topics. Using this represen- of document extracts. In Proc. of AAAI’05, 2005. tation, we combine query-focused and thematic sen- [14] C.-Y. Lin and E. Hovy. The automated acquisition of topic tence features into an overall sentence score. Sentences signatures for text summarization. In Proc. of Coling ’00, are ranked and selected for the summary according to pages 495–501, 2000. this score, choosing a greedy approach for sentence se- [15] C.-Y. Lin and E. Hovy. Automatic evaluation of summaries lection and penalizing redundancy with a maximum using n-gram co-occurrence statistics. In Proc. of NAACL- marginal relevance method. HLT 2003, pages 71–78, 2003. Our results are among the best reported on the [16] C.-Y. Lin and E. Hovy. The potential and limitations of auto- matic sentence extraction for summarization. In Proc. of the DUC-2006 and DUC-2007 multi-document summari- HLT-NAACL 2003 Workshop on Text Summarization, pages zation tasks for ROUGE-1, ROUGE-2 and ROUGE- 73–80, 2003. SU4 scores. Our approach outperforms the previous [17] H. Luhn. The automatic creation of literature abstracts. IBM best performing system on DUC 2006 data, although J. of Research & Development, 1958. the improvements are not statistically significant. We [18] I. Mani and E. Bloedorn. Machine learning of generic and have achieved these very competitive results using a user-focused summarization. In Proc. of AAAI ’98/IAAI ’98, pages 820–826, 1998. simple unsupervised approach. The comparison with [19] D. Marcu. The rhetorical of natural language texts. In a system using latent semantic indexing shows that the Proc. of the 35th Annual Meeting of the ACL, pages 96–103, PLSA model can better capture the sparse information 1997. contained in a sentence than a comparable LSI model. [20] R. Mihalcea. Graph-based ranking algorithms for sentence ex- We also studied the effect of different measures to traction, applied to text summarization. In Proc. of ACL 2004, compute sentence-level similarity features in the latent page 20, 2004. topic space. We found that using the Jensen-Shannon [21] A. Nenkova, L. Vanderwende, and K. McKeown. A composi- tional context sensitive multi-document summarizer: exploring divergence resulted in the best ROUGE scores, as well the factors that influence summarization. In Proc. of SIGIR as being very robust to changes of the number of latent ’06, pages 573–580, 2006. classes. [22] M. F. Porter. An algorithm for suffix stripping. Program, In future research, we would like to extend our 14(3):130–137, 1980. method with additional linguistic knowledge. Given [23] J. Steinberger, M. A. Kabadjov, M. Poesio, and O. Sanchez- that the unigram, bag-of-words approach ignores syn- Graillet. Improving LSA-based summarization with anaphora tactic structure information, we would like to study resolution. In Proc. of HLT-EMNLP ’05, pages 1–8, 2005. the effect of including such information — by means [24] S. Teufel and M. Moens. Sentence extraction as a classification task. In ACL/EACL-97 Workshop on Intelligent and Scalable of bi- or co-occurrence counts — in a PLSA Text Summarization, pages 58–65, 1997. model. The performance differences of our system in [25] K. Toutanova, C. Brockett, M. Gamon, J. Jagarlamudi, terms of ROUGE-2 as compared to ROUGE-1 and H. Suzuki, and L. Vanderwende. The PYTHY summarization system: Microsoft research at duc 2007. In Proc. of DUC 2007, ROUGE-SU4 suggests that the model could benefit 2007. from including n-gram co-occurrences. 149