Topic-Based Multi-Document Summarization with Probabilistic Latent Semantic Analysis

Topic-based Multi-Document Summarization with Probabilistic Latent Semantic Analysis Leonhard Hennig DAI Labor, TU Berlin Berlin, Germany [email protected] Abstract based on the identification of topics (or thematic foci) We consider the problem of query-focused multi- to construct generic or query-focused summaries. Of- document summarization, where a summary ten, thematic features rely on identifying and weight- containing the information most relevant to a ing important keywords [21], or creating topic signa- user’s information need is produced from a set tures [14, 10]. Sentences are scored by combinations of of topic-related documents. We propose a new keyword scores, or by computing similarities between method based on probabilistic latent semantic sentences and queries. Yet it is well known that term analysis, which allows us to represent sentences and queries as probability distributions over la- matching has severe drawbacks due to the ambivalence tent topics. Our approach combines query- of words and to differences in word usage and personal focused and thematic features computed in the style across authors. This is especially important for latent topic space to estimate the summary- automatic summarization, as summaries produced by relevance of sentences. In addition, we evaluate humans may differ significantly, potentially not shar- several different similarity measures for computing very many terms [16]. ing sentence-level feature scores. Experimental Latent Semantic Indexing (LSI) is an approach to results show that our approach outperforms the best reported results on DUC 2006 data, and also overcome these problems by mapping documents to a compares well on DUC 2007 data. latent semantic space, and has been shown to work well for text summarization [9, 23]. However, LSI has a number of drawbacks, namely its unsatisfactory sta- Keywords tistical foundations. The technique of probabilistic latent semantic analysis (PLSA) assumes a latent lower text summarization, probabilistic latent semantic analysis, plsa dimensional topic model as the origin of observed term co-occurrence distributions, and can be seen as a probabilistic analogue to LSI [11]. It has a solid statistical 1 Introduction foundation, is based on the likelihood principle and de- fines a proper generative model for data. PLSA models Automatically producing summaries from large tex- documents as a list of mixing proportions for mixture tual sources is an extensively studied problem in IR components that can be viewed as representations of and NLP [17, 12]. In this paper, we investigate the “topics” [4]. problem of multi-document summarization, where a summary is created from a set of related documents In this paper, we are primarily interested the ca- and optionally fulfills a specific information need of a pability of the PLSA approach to model documents user. In particular, we focus on generating an extrac- as mixtures of topics. Unlike previous approaches tive summary by selecting sentences from a document in PLSA-based extractive summarization, we repre- cluster [8]. Multi-document summarization is an in- sent sentences, queries, and documents as probability creasingly important task: With the rapid growth of distributions over topics. We train the probabilistic online information, and many documents covering the model on the term-sentence matrix of all sentences in same topic, the condensation of information from dif- a document cluster, and proceed by folding queries, ferent sources into an informative summary helps to document titles and cluster centroid vectors into the reduce information overload. Automatically created trained model. This allows us to compute various the- summaries can either consist of the most important matic and query-focused similarity measures, as well information overall (generic summarization) or of the as redundancy measures, in the space of latent top- information most relevant with respect to a user’s in- ics, in order to estimate the summary-worthiness of formation need (query-focused summarization). sentences. A major aspect of identifying relevant information Our system improves on previous approaches in is to find out what a text is about. A document three ways: First, we investigate PLSA in the context will generally contain a variety of information cen- of multi-document summarization, modeling topic dis- tered around a main theme, and covering different tributions across documents and taking into account aspects of the main topic. Similarly, human sum- information redundancy. Second, we do not only pick maries tend to cover different topics of the original sentences from topics with the highest likelihood in source text to increase the informative content of the the training data as in [3], but compute a sentence’s summary. Various approaches have exploited features score based on a linear function of query-focused and 144 International Conference RANLP 2009 - Borovets, Bulgaria, pages 144–149 thematic features. Third, we examine how a PLSA The standard procedure for maximizing the likeli- model can be used to represent documents, sentences hood function in the presence of latent variables is and queries in the context of multi-document summa- the Expectation Maximization (EM) algorithm. EM rization, and investigate which measures are most use- is an iterative algorithm where each iteration consists ful for computing similarities in the latent topic space. of two steps, an expectation step where the posterior We evaluate our approach on the data sets of the DUC probabilities for the latent classes z are computed, and 2006 and DUC 2007 text summarization challenges, a maximization step where the conditional probabili- and show that the resulting summaries compare fa- ties of the parameters given the posterior probabili- vorably on ROUGE metrics with those produced by ties of the latent classes are updated. Alternating the existing state-of-the-art summarization systems. expectation and maximization steps, one arrives at a The rest of this paper is organized as follows: In converging point which describes a local maximum of Section 2 we describe the probabilistic latent semantic the log likelihood. The output of the algorithm are analysis algorithm. Next, in Section 3, we give de- the mixture components, as well as the mixing propor- tails of our summarization system, the sentence-level tions over the components for each training document, features we use, as well as of the similarity measures i.e. the conditional probabilities P (w z) and P (z d). we evaluate. In Section 4, we give experimental re- For details of the EM algorithm and its| application| to sults showing that our approach leads to improvements PLSA, see [11]. over a LSI baseline, and that overall scores compare well with those of existing systems on ROUGE metrics. We then compare our system to related work in 3 Topic-based summarization Section 5, and finally Section 6 concludes the paper. Our approach for producing a summary consists of three steps: First, we associate sentences and queries 2 Probabilistic Latent Semantic with a representation in the latent topic space of a Analysis PLSA model by estimating their mixing proportions P (z d)1. We then compute several sentence-level fea- Probabilistic latent semantic analysis is a latent vari- tures| based on the similarity of sentence and query able model for co-occurrence data which has been distributions over latent topics. Finally, we combine found to provide better results than LSI for term individual feature scores linearly into an overall sen- matching in retrieval applications [11]. It associates tence score to create a ranking, which we use to select an unobserved class variable z = z1,...,zk sentences for the summary. We follow a greedy ap- with each observation (d, w),∈ where Z word{ w } proach for selecting sentences, and penalize candidate = w , . , w occurs in document d =∈ sentences based on their similarity to the partial sum- W { 1 i} ∈ D d1, . , dj . Each word in a document is considered mary. as{ a sample} from a mixture model, where the mixture components are multinomial random variables that can be viewed as representations of latent topics. A 3.1 Sentence representation in the la- document is represented as a list of mixing proportions tent topic space for the mixing components, i.e. it is reduced to a probability distribution over a fixed set of latent classes. Given a corpus of topic-related documents, we per- form sentence splittingD on each document using the In terms of a generative model, PLSA can be defined 2 as follows: NLTK toolkit . Each sentence is represented as a bag- of-words w = (w1, . , wm). During preprocessing, we select a document d with probability P (d), remove stop words, and apply stemming using Porter’s • stemmer [22]. We discard all sentences which contain pick a latent class z with probability P (z d), • | less than lmin = 5 or more than lmax = 20 content generate a word w with probability P (w z). words, as these sentences are unlikely to be useful for • | a summary [24]. We create a term-sentence matrix For each observation pair (d, w) the resulting likeli- TS containing all sentences of the corpus, where each hood expression is: entry TS(i, j) is given by the frequency of term i in P (d, w) = P (d)P (w d), where (1) sentence j. We then train the PLSA model on the | term-sentence matrix TS. P (w d) = P (w z)P (z d). (2) After the model has been trained, it provides a rep- | | | z resentation of the sentences as probability distribu- X∈Z tions P (z s) over the latent topics z. This represen- A document d and a word w are assumed to be con- tation can| be interpreted as follows: Since the source ditionally independent given the unobserved topic z. documents cover multiple topics related to a central Following the maximum likelihood principle, the mix- theme, each sentence can be viewed as representing ing components and the mixing proportions are deter- one or more of these topics.

Load more