From Social Bookmarking to Social Summarization: An Experiment in Community-Based Summary Generation∗

Oisin Boydell Barry Smyth Adaptive Information Cluster Adaptive Information Cluster School of Computer Science and School of Computer Science and Informatics Informatics University College Dublin University College Dublin Belfield, Dublin 4 Belfield, Dublin 4 [email protected] [email protected]

ABSTRACT The Social Web We describe a novel document summarization technique that uses informational cues, such as social bookmarks or search In this paper we suggest a novel approach to summarization Social queries, as the basis for summary construction by leverag- that is inspired by the recent of so-called Web ing the snippet-generation capabilities of standard search en- services, in which communities of users are playing an gines. A comprehensive evaluation demonstrates how the so- increasingly important role when it comes to producing, en- cial summarization technique can generate summaries that riching, organising, and facilitating access to Web content. are of significantly higher quality that those produced by a For example, the rapid growth of Web Logs () is just number of leading alternatives. one example of the dynamic new world of user-generated content. We also find communities of users eager to con- ACM Classification: H.4 [Information Systems Applica- tribute to existing content by submitting their own reviews 1 2 tions]: Miscellaneous; H.3.3 [Information Storage and Re- and opinions. Sites like Amazon and TripAdvisor have trieval]: Information Search and Retrieval learned to embraced this as a valuable source of social con- tent for some time now, by allowing users to submit their General terms: Design, Algorithms, Human Factors reviews and opinions on consumer products (Books, DVDs, etc. in the case of Amazon) or travel services (vacations, ho- Keywords: summarization, social bookmarks, click-through tels etc. in the case of TripAdvisor). Indeed, as we come data, community, Web search to appreciate the willingness of users to participate in these types of services a number of innovators have recognised the INTRODUCTION power of social interactions to drive entirely new types of so- The ability to effectively summarize a document — to ac- cial media. The , Digg3, is a case in point: by curately and concisely capture its key information — is an allowing users to submit and rate news stories found on the important area of research that is dominated by a wide range Web, plays the role of a community-based news aggre- of techniques which employ language models of varying de- gator and in just 18 months has attracted a reader-base that grees of sophistication. These summarization techniques at- is fast approaching that of the New York Times4. Services tempt to automatically capture the salient content from a doc- like Flickr5 and Del.icio.us6 harness a very different form of ument in order to present it to a human reader in a more con- social content by encouraging users to label or content densed form, but one of the problems of these traditional ‘one to facilitate the sharing of, and access to, said content. For size fits all’ approaches is that there is limited emphasis on instance, allows users to upload, tag, and share their the needs and preferences of the end users. While the result- photo libraries, whereas Del.icio.us, allows users to manage ing summaries may perform well in general terms — effec- and share their Web bookmarks. Importantly, these tagging tively extracting the core content of the document in question services allow various users to express their views of con- — they may not appeal to the needs and preferences of indi- tent, by submitting the tags that they deem to be appropriate, vidual users or a community of users. leading to the development of an alternative content taxon- ∗This material is based on works supported by Science Foundation Ireland omy (or [1]) to serve as an alternative content under Grant No. 03/IN.3/I361. index for facilitating search.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are 1 : not made or distributed for profit or commercial advantage and that copies http //www.amazon.com 2 : bear this notice and the full citation on the first page. To copy otherwise, to http //www.tripadvisor.com 3 republish, to post on servers or to redistribute to lists, requires prior specific http : //www.digg.com permission and/or a fee. 4http : //www.alexaholic.com/digg.com + nyt.com IUI’07, January 28–31, 2007, Honolulu, Hawaii, USA.. 5http : //www.flickr.com/ Copyright 2007 ACM 1-59593-481-2/07/0001 ...$5.00. 6http : //del.icio.us/

42

Towards Social Summarization Extraction vs. Abstraction The point of all this is to emphasise how today’s content con- To begin with, there are two broad approaches to summariza- sumers are no longer playing a passive role when it comes to tion: extraction or intrinsic summarization versus abstrac- accessing online content. Instead, there are more and more tion or extrinsic summarization. Extraction techniques at- services that allow users to play more active roles, by pro- tempt to summarize a document by identifying and extract- viding feedback (opinions, annotations, ratings etc.) that can ing those parts of the document that are deemed to be the then be used to enrich or enhance the consumed content. In most important or salient, and the final summary is thus a this work we argue that a combination of user feedback and collection of sentences or sentence fragments from the orig- Web search engines can be harnessed to drive a effective inal document. In contrast, abstraction approaches do not and efficient form of document summarization. The key to preserve the original document content and instead prefer to our idea is the use of search engines as a means of generat- paraphrase the source content to provide a more concise con- ing short, query-sensitive document summaries. Specifically, tent representation. In general, abstraction techniques have when we submit a query q to a , each result r is the potential to produce more condensed summaries than ex- accompanied by a so-called snippet text, a brief extract con- traction techniques, but rely heavily on sophisticated natural taining fragments of sentences from the document that are language processing and generation. Extraction techniques, related to the target query. These snippets provide the raw on the other hand, generally rely on shallow natural language material for document summaries and to generate a summary techniques, usually relying on statistical term-counting as the of document d we can look to the set of queries q1,...,qn basis for sentence selection and ordering. that have been submitted to a search engine which resulted in d being selected. The associated snippets, s1,...,sn, can One of the earliest automatic summarization techniques was then be combined to provide a document summary. The point described by [12] in 1958. Very briefly, a statistical approach is that the queries serve to identify key points of interest for was proposed in which the frequency of word occurrences in users who have been interested in d and provide a skeleton a document were used as an indicator of word significance. around which a summary can be constructed. These significance indicators were then combined with po- sitional information to obtain a sentence-level significance Thus, by mining search logs we can leverage implicit user value so that a final summary could be produced from the feedback (query-result selections) to construct social sum- top ranking sentences. This approach was refined in [6] with maries. In this paper, however, we focus on an alternative the inclusion of a range of additional document features for form of feedback by demonstrating how the bookmarks, and estimating the significance of each sentence. For example, their associated tags, in a social bookmarking service like structural elements such as title, subtitles and headings, as Del.icio.us can be similarly harnessed. Consider a user u that well as language features such as cue words, were all com- has bookmarked a page p using some tags b.We bined to guide a more sophisticated model of sentence scor- can treat b as a query for the bookmarked page, and retrieve a ing. Today, most commercial summarizers are still based on snippet for p by submitting b to a search engine and locating these basic approaches, with considerable research effort de- the snippet associated with the result that corresponds to p. voted to refinements such as the automatic learning of fea- ture weights for sentence selection [9] and shallow language Paper Summary based approaches such as analysing the discourse structure In this paper we describe the above technique in detail, fo- of text to aid sentence selection [13]. cusing on how to construct a summary of the page by ranking the snippet fragments according to their relative importance. Two extraction-based summaries that are especially relevant Furthermore, we demonstrate how the resulting summaries to this work are the Open Text Summarizer (OTS) [17] and out-perform those produced by a number of leading bench- the MEAD summarizer [16], which we use as benchmark mark summarization systems under a variety of experimental comparison systems in our evaluation. Both combine shal- conditions. Finally, we consider a number of additional ben- low NLP techniques with more conventional statistical word- efits of this social summarization technique, including the frequency methods to produce document abstracts from high potential to produce community-focused and query-sensitive scoring sentences. For example OTS incorporates NLP tech- summaries and the role of these summaries as a form of en- niques via an English language lexicon with synonyms and hanced, personalized result-snippets for search engines. cue terms as well as rules for stemming and parsing. These are used in combination with a statistical word frequency RELATED WORK based method for sentence scoring. Similarly, MEAD har- Automatic document summarization is a well established re- nesses statistical NLP data by using a database of English search area with a comprehensive programme of research words and their corresponding inverse document frequency providing a broad range of different approaches and tech- scores calculated from a large document corpus. Once again niques. While much work on summarization tends to focus this information is combined with word occurrence and po- on specialist types of document collections, such as news sto- sitional information to extract high scoring sentences. ries [14], technical papers [15] or medical documents [2], the rapid growth of the Web has motivated researchers to look at Abstract-oriented summarization requires a more knowledge- the summarization of more diverse Web content. In the fol- rich approach to summarization, and the research can be di- lowing sections we will outline the main approaches to docu- vided into two main areas. Very briefly, the first relies heavily ment summarization and discuss their relationship to the so- on syntactic parse trees for producing a structural conden- cial summarization technique proposed in this paper. sate. For example, the work of [4] uses a model of topic pro-

43

gression derived from lexical chains. The second approach good effect to generate high quality summaries of Web pages. also uses natural language processing, but the final source The novelty of our approach, however, stems from the use of text representation is conceptual rather than syntactic. This query-focused snippets in addition to the raw query terms. In semantic conceptual space is manipulated to eliminate redun- addition, we focus on how this perspective can be adapted to dant information, merge graphs and establish connectivity utilize other types of interaction data, such as the terms used patterns to produce a conceptual condensate of the original to annotate bookmarks in social bookmarking services such text (see for example the work of [8]). In general, generat- as Del.icio.us. ing abstract summaries is a very challenging task, requiring a combination of natural language understanding and gener- SOCIAL SUMMARIZATION ation, and is beyond the scope of this work. Social summarization is a method of producing intrinsic Web page summaries by using query-result selection pairs as a Web Page Summarization way to identify fragments of page content that may be com- bined to produce an effective page summary. The summaries With the advent of the , the need for docu- are social in the sense that they are derived from the inter- ment summarization has become more mainstream, and has actions of communities of users as they search. That being brought new challenges to automatic summarization. As said, this technique is not limited to the mining of search mentioned above, in the past many summarization techniques logs and in this section we will describe how our social sum- were carefully optimized for particular types of documents marization technique can harness the interaction information (news articles, scientific papers etc.). Such optimizations are captured by social bookmarking services such as Del.icio.us. often not feasible or appropriate in the more content-diverse world of the Web. That said, Web content introduces addi- The Core Idea tional features which may assist and guide the summarization As mentioned in the introduction, the core idea behind our process. For instance, Web pages include information fea- social summarization technique involves three basic ideas: tures beyond their core content compared to a generic doc- ument, such as the structural information implicit in HTML 1. A page p can be associated with a set of queries, Q(p)= mark-up. Moreover, Web pages do not exist in isolation since q1,...,qn, such that each qi has led to the selection of p the hyper-linked structure of the Web means that each doc- among a given search engine’s result-list; ument can be located within a network of inward and out- ward links. This connectivity information can also be used 2. For a given query, qi, the search engine (SE) will pro- to guide summarization. The InCommonSense system [3] duce a query-sensitive snippet, SSE(p, qi), which contains mines a Web page’s context by extracting segments of text a number of sentence fragments; surrounding in-links to the page, followed by a filter process ( ) that chooses the most accurate segment to return as an extrin- 3. The social summary for p, SSSE p , can be constructed ( ) sic summary of the page. This contextual idea is elaborated from the combination of fragments associated with Q p such that SSSE(p)=f( SSE(p, qi)), with the on by [5] who look at combining this type of in-linking text qiQ(p) with the original page content to produce a more sophisti- fragments rank-ordered according to their importance.7. cated summary. In the following sections we will describe these stages in Particularly relevant to this paper is recent work on harness- more detail, focusing on how individual snippet fragments ing search engine click-through data to guide Web page sum- are scored and ranked during the summarization process. But marization. For example, the work of [19] explains how first, we will begin by describing how this methodology can two traditional approaches can be adapted to incorporate also be applied to the interaction data that makes up social click-through information during extract-based summariza- bookmarking such as Del.icio.us. tion. Specifically, [19] proposes adapting Luhn’s aforemen- tioned sentence-selection algorithm by using both the local From Bookmarks to Queries contents of a Web page and the query click-through data Services like Del.icio.us allow users to maintain online col- (queries submitted that have led to the selection of the tar- lections of their favourite bookmarks, with each bookmarked get document) to modify the basic word selection metric so page p associated with a set of terms, t1,...,tu,whichmake that the frequency of a word in the document is combined up the bookmark’s tag. Thus a given page, p, which has been with the frequency of a word in the queries to produce a hy- tagged by a set of users u1,...,uv, will be associated with brid word significance weight. A related approach is also set of bookmark tags, b1,...,bv; each tag, bi referring to the used to adapt a Latent Semantic Analysis approach to sum- set of terms submitted by user ui when tagging p. marization, originally described in [7]. Again, the weight of query words is increased according to its frequency within The set of p’s tags, b1,...,bv, obviously refer to the dif- the query collection. Both cases report significant improve- ferent ways that the users who bookmarked p view its con- ments over the summary quality of non click-through based tent; typically each tag contains 2-3 salient terms. As such, methods. bookmark tags look very much like search queries—that is, Q(p)={b1,...,bv}—and thus can be used to extract the The work described in this paper is concerned with produc- necessary snippet texts for p during social summarization as ing extract-based summaries of Web pages, and like the work described below. described in [19] we too believe that interaction or usage 7In the following we will drop the SE superscript for convenience, and refer data (such as search engine click-through data) can be used to to SS(p) and S(p, qi) without loss of generality.

44

Generating a Social Summary formally, matching sentence fragments are identified accord- Given a bookmarked page, p, a set of bookmark tags for this ing to Equations 3 and 4. If there is a significant overlap page, b1,...,bv, and a standard Web search engine, SE, between fragments (in practice, t =0.8 works well) then a social summary for p, SS(p), is generated in four basic the shorter fragment is said to be dominated by the longer steps: 1) extract the snippet texts, S(bi,p) to produce a set fragment; see Equation 5. To normalize the fragments across of sentence fragments; 2) normalise sentence fragments to the snippets for page p we replace all dominated fragments cope with fragment overlap and subsumption; 3) score each with copies their maximally dominating partners. In what sentence fragment according to its frequency of occurrence follows we use S (p, bj,y) to refer to the normalized version across the snippets; and finally, 4) rank-order the normalised of S(p, bj,y). fragments to produce the final summary.

Snippet Extraction Harvesting the snippet texts that form ( ( ) ( )) = the raw material for the social summaries is relatively straight- overlap S p, bi,x ,S p.bj,y (3) forward. In principle, each bi for p is submitted as a search |S(p, bi,x) S(p, bj,y)| engine query, p is located within the search engine’s result- max(|S(p, bi,x)|, |S(p, bj,y)|) list, and its snippet text is recorded; see Equation 1. There are a couple of options to consider here. First, there is no guarantee that p will appear near the top of the result-list for bi p . In fact there is no guarantee that will even be retrieved match?(S(p, bi,x),S(p.bj,y)) = (4) for bi by the search engine as it is not unusual for users to tag their bookmarks using terms that are not present in the 1 overlap(S(p, bi,x),S(p.bj,y)) ≥ t bookmarked page. In practice, we limit the search for p to 0 otherwise the top k results returned by the search engine for each bi; SEk(bi) refers to these top k results. This limits the cost of snippet extraction to be O(n, k), but does mean that certain bookmark tags may not lead to snippets. dominates?(S(p, bi,x),S(p, bj,y)) = (5) ⎧ ⎨1 match?(S(p, bi,x),S(p, bj,y)) ( )= ( ( )) ∩| ( )| | ( )| Sk p, b1,...,bn S p, bi (1) ⎩ S p, bi,x > S p, bj,y ∀bi:pSEk(bi) 0 otherwise

Of course an alternative snippet extraction approach is feasi- ble if one has direct access to a given search engine’s snippet S (p, bj,y)= (6) generator, in which case the snippet for page p given a query can be directly obtained. In fact, in the evaluation described S(p, bi,x) dominates?(S(p, bi,x),S(p, bj,y)) below we draw on the query-sensitive snippet extraction li- S(p.bi,y) otherwise brary from the Apache Foundation’s Lucene project8 to do this. Scoring Fragments For a page p we now have a set of snip- Normalizing Sentence Fragments More formally, each snip- pets (generated from queries over p), each made up of a pet is composed of a set of m sentence fragments (Equation normalized set of sentence fragments. Intuitively, it seems 2) that have been extracted from the text of the target page reasonable to assume that fragments which occur more fre- by the search engine. In general, for a given set of queries quently are likely to be more important; after all they are we can expect to generate a large collection of sentence frag- associated with page segments that are linked to the com- ments. Some of these fragments may be identical to each mon ways (queries or bookmark tags) that users refer to p. other, some fragments might subsume other fragments, and In this way the scoring model favours aspects of pages that many fragments will overlap. many users are interested in and these aspects will be more prominent in the resulting social summaries. Accordingly, S(p, bi)= S(p, bi,j) (2) we can compute the score of some fragment, f,as the num- j=1...m ber of times that f occurs in the snippets generated for p;see Equation 7. Our final summaries will be produced directly from these sentence fragments and significant overlaps will have an im- score(f)= occurs?(f,S (p, bi)) (7) pact on summary quality; there is little to be gained from in- i=1...v cluding two fragments in a summary that are all but identical, for example. While we could process the summaries to elim- inate this type of redundancy at summary formation time, we occurs?(f,S(p, bi)) = (8) choose instead to eliminate redundancy by producing a nor- malized set of fragments prior to summary formation. More 1 f{S (p, bi, 1),...,S (p, bi,m)} 0 8http : //lucene.apache.org/ otherwise

45

Fragment Ranking & Summary Generation Producing a fi- most recent bookmark tags up to a maximum of 50 per page; nal summary from the scored, normalised snippet fragments these tags are the information cues that we will use to gener- is now straightforward. First, compute the union of all of the ate the snippets used by our summarizer (SS). Furthermore normalised fragments (see Equation 9). Second, rank order 1386 of these pages contained description text embedded these fragments in descending order of their frequency scores within the HTML meta-content description tag. This facil- as shown in Equation 10. ity optionally allows a page author to provide a brief sum- mary description of the page in question and, for the pur- poses of this experiment, plays the role of a gold-standard ( )= ( ) Frags p, b1,...,bv S p, bi,j (9) (human-generated) summary. These 1386 pages, their meta ∀i,j descriptions, and their recent tags form the test data used in this study.

Methodology For each page we generate 3 different sum- SS(p)={fi :1≤ i ≤|Frags(p, b1,...,bv)|∩ (10) mary types from the visible page content; note that this ∀ ≤| ( )| ( ) ≥ ( )} i Frags p, b1,...,bv ,score fi score fi+1 means no meta data or HTML content or structural informa- tion is made available to any of the summarizers. Each SS An Example Social Summary summary is generated according to the approach described Figure 1 shows an overview of social summary generation in this paper. Note that for the purpose of this experiment, for a portion of the Wikipedia page “Java Platform”, using rather than relying on an existing commercial search engine the queries java platform and java virtual machine.Thesnip- to produce snippets, we used the Lucene snippet generator pet produced by a Web search engine for each query q1 and from the Apache Foundation, configuring it to produce snip- q2 is composed of text fragments extracted from the source pets of similar length and number of fragments to the main document, which are related to the query terms. For exam- Web search engines such as Google, Yahoo and MSN Search. ple, in Figure 1(a) and (b) we see that the snippet for query For each SS summary we also generate a comparable OTS q2=“java platform” is composed of three fragments from the and MEAD summary of the same length for the purpose of source text: comparison. Moreover, we generate SS summaries under a • f1=“The Java platform is the name for a bundle of related number of different conditions, varying parameters such as programs, or platform, from Sun Microsystems” the number of queries used to generate the SS summary and • f2=“Java Platform (formerly Java 2 Platform[1])” the target length of the summary, as discussed below. • f3=“the current version of the Java Platform is alterna- tively specified as version 1.5 or version 5.0 or version 5” Evaluation Metrics When comparing each automatically gen- erated summary to the corresponding gold-standard we used The union of all such fragments from the two snippets gener- the the ROUGE (Recall-Oriented Understudy for Gisting ated for q1 and q2, namely SSE(p, q1) and SSE(p, q2),pro- Evaluation) package [10]. ROUGE measures summarization vide the core content for the social summary, and the frag- quality by counting overlapping n-gram, word sequences, ments are scored and rank-ordered according to the method and word pairs between the candidate and gold-standard sum- described above to produce the final summary, SSSE(p) maries, a common approach that has been shown to correlate (Figure 1(c)). very well with human evaluations [11]. According to [11], EVALUATION the ROUGE-1 (unigram co-occurrence) metric is highly ef- So far we have described an approach to document sum- fective for single document summarization evaluation tasks marization that harnesses external information cues, such and evaluation of short summaries and so we used this along as search queries or bookmark tags, to guide a summariza- with ROUGE-L, which is based on co-occurrence between tion process. This social summarization technique constructs the longest common n-grams between candidate and gold- a summary by selecting fragments of sentences that recur standard summaries in our experiments. within the snippets that are generated from these information cues, thereby tailoring the final summary according to the Experiment 1 - A Comparison of Summary Quality ways in which it is used in practice. In this section we evalu- To begin with we will look at overall summary quality by ate the quality of the summaries generated by our social sum- comparing the summaries produced for each page by each of marization technique to comparable summaries generated by the 3 techniques (SS, OTS, MEAD), with SS using the full two leading benchmark systems, OTS [17] and MEAD [16]. complement of tags/queries retrieved for the page in ques- tion; this means that all fragments occur in the final SS sum- Setup mary. Note, the average length of the SS summaries was Test Data To begin with we will explain the test data, in- 24% of the original. In each case we compare the resulting cluding: a collection of documents to summarize; a set of summaries to the gold-standard using the ROUGE measures. gold standard summaries to compare against the automatic summaries; and a set of information cues to use a queries The results are presented in Figure 2 and show a clear bene- over the document collection. fit to SS across all evaluation metrics. For example, we see that SS achieves a relative improvement in its precision, re- As mentioned previously we use data from the Del.icio.us call, and F-measure scores over OTS by between 31% and social bookmarking service for this study. To begin with we 39%; for MEAD the relative advantage enjoyed by SS is be- downloaded a sample of 3781 bookmarked pages and their tween 24% and 29%. In all cases the differences between

46

Figure 1: An overview of social summarization. (a) A page p is associated with a set of queries used to access p (or tags used to bookmark p), in this case q1 and q2. (b) For each query, the search engine will produce a query-focused snippet, SSE(p, q1) and SSE(p, q2), composed of sentence fragments from the page content that are related to the queries. (c) Fragments are scored and rank-ordered to produce the final social summary, SSSE(p)

47

Figure 2: Overall summary quality in terms of precision Figure 3: Recall scores for SS, OTS, and MEAD sum- (P), recall (R), and F-measure (F), under ROUGE-1 marizers generating summaries of different lengths. and ROUGE-L, for SS, OTS and MEAD summarizers.

again the results point to a clear advantage for the SS ap- SS and the benchmark summarizers is significant at the 95% proach, which produces summaries of significantly higher confidence level, with the appropriate error bars shown in the recall across all sizes. Moreover, the relative advantage of SS figure. is largest for more compact summaries. For example, when generating summaries that are 10% of the original document, Experiment 2 - Summary Length vs. Quality the SS approach produces summaries that are 43-45% bet- One of the main motivations behind our work is the desire ter than those produced by OTS and 35%-37% better than to produce competent, compact summaries of Web pages for those produced by MEAD, in terms of recall. Of course, use in applications such as search result summarization or as expected we do see the quality of the OTS and MEAD converting content for small-screen mobile devices. As such summaries improving as summary size increases but it is we are interested in generating highly compact summaries. worth noting that both OTS and MEAD require significantly In this experiment we will consider the quality of summaries larger summaries to achieve the SS recall at the 10% level; in of different lengths by eliminating low scoring fragments both cases summary size must be about 30% before OTS or from the final social summary. The experiment above gener- MEAD summaries achieve the SS recall scores achieved by ated summaries with an average length of 24% of the source the 10% summaries. document and in this experiment we look at summaries that are 10%, 20%, 30%, 40% and 50% of the original document Although we have just presented the recall data in this ex- length. periment, similar trends are found for the corresponding pre- One point to note here is that we cannot always generate cision and f-measure scores. In each case we find that the a social summary above a certain length for a given docu- SS methods continues to significantly out-perform OTS and ment because we only focus on a fixed set of queries during MEAD across all summary sizes. summarization and these available queries may not lead to Experiment 3 - Search Activity vs Quality snippets that provide broad coverage of the document. Thus when reporting quality results below we highlight how many So far we have demonstrated how the SS technique can gen- documents were summarized for each size. Otherwise the erate superior quality summaries than OTS and MEAD by experiment proceeded in the usual way: for each document using information cues, such as social bookmark tags (or in- we generated SS, OTS and MEAD summaries of size k% and deed search queries), as a basis for fragment identification evaluated the result, relative to the gold-standard summaries, and selection. In this section we will consider the influence using the various ROUGE metrics. of these information cues on summary quality. In particular, we consider the relationship between the number of available The results are presented in Figure 3. For clarity we only cues (bookmark tags, in this case) and summary quality. To show the recall results in this experiment because we are es- investigate this we use different size subsets of queries as the pecially interested in demonstrating how, as each technique basis for SS summary generation, focusing on query sets of creates longer summaries, these summaries cover more and size 1-10, 11-20, 21-30, 31-40, and 41-50 queries. For each more of the gold-standard concepts and content. Thus, the page we generated SS summaries using different numbers of graph presents the ROUGE-1 and ROUGE-L recall scores queries selected at random from those available, producing for each of the 3 summarizers when generating summaries nearly 25,000 different summaries in total. that are 10% to 50% of the original document size in incre- ments of 10%. Note that the numbers in brackets along the To begin with it is worth speaking briefly about the relation- x-axis denote the number of documents tested for each size ship between average SS summary size and the number of summary. unique queries/bookmarks available for use during summa- rization. As the number of available queries increases there is As expected, as summary size increases we see a gradual a tendency towards larger social summaries; in general more improvement in the recall score for each technique. Once queries means greater coverage of document content. Bear-

48

we propose combining the snippets and text fragments that are associated with queries that are similar to qT .Moreover, when scoring a particular fragment we can give more weight to fragments that are associated with queries that are more similar to qT . |qT qC | Sim(qT ,qC )= (11) |qT qC |

score(f,qT )= occurs?(f,S (p, qi)) • sim(qT ,qi) i=1...v Figure 4: Recall scores for SS, OTS, and MEAD sum- (12) marizers generating summaries according to the avail- ability of different numbers of queries/bookmarks. For example, consider a page p which has been selected for searches for queries q1,...,qv. Equation 11 presents a sim- ple measure of term overlap as a query similarity metric and ing this in mind we present the recall scores for SS sum- this is used in Equation 12 as a modified version of our pre- maries generated for various numbers of available queries vious fragment scoring metric. This time the score that a in Figure 4 , alongside the recall scores for the equivalent fragment f accumulates depends not only on its frequency OTS and MEAD summaries. As expected recall is seen to of occurrence within the various snippets, but also on the increase across the board because more queries means larger similarity between the target query and the queries that led summaries. Crucially we see that SS performs well even with to these snippets. relatively modest numbers of queries; with no more than 10 A query focused summary can then be produced in the nor- queries we see that SS can generate summaries with recall mal way, by rank ordering the fragments in descending or- scores in excess of 35%. And once again we observe a clear der of their scores. In turn, summaries of arbitrary length advantage for SS, which generates summaries whose recall (up to the total number of available fragments) can be gen- scores are consistently and significantly better than those for erated by truncating the summary after the top k fragments. OTS and MEAD across all conditions. Indeed, if we average This approach can be used to associate query-focused social the recall scores for each technique across the various differ- summaries with search engine results in place of traditional ent query sizes then we find that SS produces summaries with query-focused snippets. Indeed the approach also facilitates recall scores that are 31% better than the OTS summaries and the generation of social summaries where the size of the sum- approximately 28% better than the MEAD summaries. mary correlates with the predicted relevance of the search re- Finally it is also worth pointing out that although precision sult. For example, top ranking results may be associated with and F-measure results are not presented here, we once again longer (more detailed) social summaries than lower ranking find our social summaries representing a significant improve- results. ment when compared to OTS and MEAD summaries. For Community-Focused Social Summaries example, SS summaries generated using 1-10 queries show a One of the central ideas behind this work concerns the relative improvement in precision of over 50% for ROUGE-1 one-size-fits-all approach to snippet generation in traditional and nearly 55% for ROUGE-L compared to OTS, and a 36% search engines. Specifically, when two different users sub- precision improvement for both ROUGE-1 and ROUGE-L mit the same query to a search engine, not only do they re- in comparison to MEAD. These improvements are consistent ceive the same results in return, but they also receive the same across all different numbers of queries. snippets for these results, regardless of the interests or pref- DISCUSSION erences. For example, consider a motoring enthusiast search- Query-Focused Social Summaries ing for ‘jaguar parts’. At the time of writing, one of Google’s So far we have considered the generation of a social summary top results was for a UK parts specialist supplying, accord- for a particular page, SS(p), regardless of context. This tech- ing to the snippet, “Genuine Jaguar, Land Rover and Range nique can be readily adapted to provide means of generating Rover OEM and brand name aftermarket parts”. Clearly a more focused social summary that is informed perhaps by this snippet has been generated with reference to the target the context provided by some target user query, SS(p, qT ). query and no doubt is likely to appeal to many searchers us- This is especially applicable in the context of Web search, for ing this query. But consider a searcher who is interested in example, and could be used to generate query-focused social finding a parts supplier to provide them with rare parts for summaries to used as enhanced snippet texts. their ongoing restoration of their classic Jaguar S Type. This searcher uses the same query but is not interested in most The work of [20] shows the advantages of standard query- parts suppliers, only those that deal or specialise in classic focused snippets in information retrieval. These are, of course, components. As it turns out the UK supplier above does deal the very types of snippets that we rely on at the heart of our in this niche market but of course the query-sensitive snippet social summarization technique. These snippets are gener- does not reflect this and so the result may be passed over by ated with respect to the current target query. In contrast, our searcher.

49

Community-Based Search Recent work in the area of community-based search has demonstrated how the search results of a generic search engine such as Google, Yahoo or MSN Search can be adapted according to the learned pref- erences of a community of like-minded users [18]. Specifi- cally, an approach called Collaborative Web Search (CWS) has been proposed that is capable of mining the search his- tory of a community of users in order to identify results that can be promoted in line with inferred community interests. To date this work has focused on the identification and pro- motion of results with no attention paid to the snippets asso- ciated with promoted results, for example. We believe that our social summarization technique can be used to generate query focused snippets that better reflect the niche needs of a Figure 5: Recall scores for community-focused SS particular community of searchers. summaries compared to standard OTS summaries.

Generating Community-Focused Snippets To generate a snip- pet for page p and community C, relative to some target recall against the terms used by the tags in the test set, mak- query, qT , we first identify those queries that have led to the ing sure to remove tags that occur in the training set from the past selection of p by community members and that are sim- test set. In other words, for each page the test set represents ilar to qT . The required snippet can then be generated us- a unique set of queries that are also relevant to the commu- ing the query-focused technique described above. The set of nity and this recall test looks for overlaps between the social similar queries is localised to those used by the community of summary and these unseen terms; these test terms correspond searchers and so will better reflect the community’s perspec- to an alternative set of terms that the community in question tive of relevant page content. In this way, a community of may use to reference each page and thus corresponds to a classic-car enthusiasts might receive a more relevant snippet test interest profile for the community. For comparison pur- such as “The one-stop-shop for genuine restoration Jaguar pose each page is also summarized using OTS and its recall parts for all classic models including S Type, X Type, X300 against the test terms is also computed. The above is repeated - XJR, ...” for their ‘jaguar parts’ query if the page in ques- across 5 different random splits of training and test data and tion has been also selected by the community for a range of the average recall is reported below. related queries such as, ‘classic jaguar parts’, ’jaguar restora- tion’, etc. The results are presented in Figure 5 for two communities: the travel community referred to above and a skiing com- Preliminary Results By way of a preliminary evaluation of munity generated in a similar manner (from 847 users, 332 this idea we again used Del.icio.us to extract a set of web pages, and 765 unique queries). For each community we pages, bookmark tags, and the users who bookmarked these report the average recall of the SS and OTS summaries rel- pages to serve as a experimental search community. In this ative to the unseen test terms. In each community we see a instance we focused on travel-related information. Starting clear and significant advantage due to SS. For example, in with a seed tag (e.g., ’travel’) we extracted the top 100 book- the skiing community, on average the SS summaries include marked pages for this tag and then we extracted the top book- 45% of the test terms (which, remember, represent alterna- mark tags used to label each of these pages. Next, we gen- tive, but previously unseen, points of interest for the com- erated a new set of seed tags (e.g., ‘European travel’, ’travel munity with respect to these pages) where as the OTS sum- tips’, etc) from these top bookmark tags, making sure that all maries only cover about 32% of the test terms; a relative re- the tags we used are related to travel, being the core focus call improvement for SS of 36%. Thus, these results confirm of our community, and we repeated the process. The result that by leveraging the past search histories of a community of is a collection of 1153 bookmarked pages and 5291 unique searchers, is able to generate page summaries that are more sets of terms that were used to bookmark these pages by in line with the interests of the community, compared to a 6290 unique users. The point is that we can use this data as static summarizer such as OTS. a travel-related search community by interpreting the book- mark tags as queries for corresponding pages. CONCLUSION In this paper we have described an approach to summariza- Next we need a gold-standard summary to compare to each tion, which we call social summarization, that uses the differ- community-focused social summary. However, this time the ent ways that users refer to documents (e.g., search queries gold-standard also must be a community-focused summary or bookmark tags) as cues for the summarization process. and so static meta data will not suffice. Instead, for each These cues can be used to identify relevant fragments of text page we divide its ‘queries’ (bookmark tags) into a training within a document; for example, the standard snippet gen- set and a test set. The former are used to populate the com- eration techniques used by modern search engines perform munity history used for the generation of social summaries this very task. In turn, we have described how a social sum- for each page. Thus, for each page in the training set we gen- mary can be constructed from the these fragments after they erated a social summary using its most popular query as the have been scored and ranked according to their distribution target query. To evaluate the result summary we compute its frequencies.

50

We have compared this approach to summarization against 10. Chin-Yew Lin. Rouge: a package for automatic evalu- two leading benchmark summarization systems (OTS and ation of summaries. In Proceedings of the Workshop on MEAD), using social bookmarking information as interest Text Summarization Branches Out (WAS 2004), 2004. cues. The results clearly highlight the potential benefits of this social summarization technique, which is seen to pro- 11. Chin-Yew Lin and Eduard H. Hovy. Automatic evalua- duce higher-quality summaries, relative to a human gener- tion of summaries using n-gram co-occurrence statistics. ated gold-standard, than either benchmark, under a wide vari- In HLT-NAACL, 2003. ety of experimental conditions. We have also described how 12. H. P. Luhn. The automatic creation of literature ab- this approach can be used to generate query-focused social stracts. IBM Journal of Research and Development, summaries that may be applied to search engine result sum- 2:159–165, 1958. marization to provide searchers with improved result-snippet summaries. And we have speculated that this same technique 13. Daniel Marcu. The theory and practice of discourse pars- can be used to generate community-focused summaries — ing and summarization. The MIT Press, 2000. summaries that better reflect the needs of communities of like minded users — than traditional techniques, providing 14. Kathleen McKeown and Dragomir R. Radev. Generating preliminary experimental data to support this claim. summaries of multiple news articles. In Edward A. Fox, Peter Ingwersen, and Raya Fidel, editors, SIGIR, pages REFERENCES 74–82. ACM Press, 1995. 1. Harris Wu, Mohammad Zubair, and Kurt Maly. Harvest- ing social knowledge from . In HYPER- 15. Chris D. Paice and Paul A. Jones. The identification of TEXT ’06: Proceedings of the seventeenth conference on important concepts in highly structured technical papers. Hypertext and hypermedia, pages 111–114, New York, In Robert Korfhage, Edie M. Rasmussen, and Peter Wil- NY, USA, 2006. ACM Press. lett, editors, SIGIR, pages 69–78. ACM, 1993. 2. Stergos D. Afantenos, Vangelis Karkaletsis, and Panagi- 16. Dragomir Radev, Timothy Allison, Sasha Blair- otis Stamatopoulos. Summarization from medical doc- Goldensohn, John Blitzer, Arda C¸ elebi, Stanko Dim- uments: a survey. Artificial Intelligence in Medicine, itrov, Elliott Drabek, Ali Hakim, Wai Lam, Danyu Liu, 33(2):157–177, 2005. Jahna Otterbacher, Hong Qi, Horacio Saggion, Simone Teufel, Michael Topper, Adam Winkel, and Zhu Zhang. 3. Einat Amitay and C´ecile Paris. Automatically summaris- MEAD - a platform for multidocument multilingual text ing web sites - is there a way around it? In CIKM, pages summarization. In LREC 2004, Lisbon, Portugal, May 173–179. ACM, 2000. 2004. 4. R. Barzilay and M. Elhadad. Using lexical chains for 17. Nadav Rotem. The Open Text Summarizer. text summarization. In ACL’97/EACL’97 Workshop on http://libots.sourceforge.net/, 2003. Intelligent Scalable Text Summarization, Madrid, Spain, May 1997. 18. Barry Smyth, Evelyn Balfe, Jill Freyne, Peter Briggs, Maurice Coyle, and Ois´ın Boydell. Exploiting query rep- 5. Jean-Yves Delort, Bernadette Bouchon-Meunier, and etition and regularity in an adaptive community-based Maria Rifqi. Enhanced web document summarization web search engine. User Modeling and User-Adapted using hyperlinks. In Hypertext, pages 208–215. ACM, Interaction, 14(5):383–423, 2004. 2003. 19. Jian-Tao Sun, Dou Shen, Hua-Jun Zeng, Qiang Yang, 6. H. P. Edmundson. New methods in automatic extracting. Yuchang Lu, and Zheng Chen. Web-page summariza- ACM, 16(2):264–285, 1969. tion using clickthrough data. In Ricardo A. Baeza-Yates, Nivio Ziviani, Gary Marchionini, Alistair Moffat, and 7. Yihong Gong and Xin Liu. Generic text summarization John Tait, editors, SIGIR, pages 194–201. ACM, 2005. using relevance measure and latent semantic analysis. In W. Bruce Croft, David J. Harper, Donald H. Kraft, and 20. Anastasios Tombros and Mark Sanderson. Advantages Justin Zobel, editors, SIGIR, pages 19–25. ACM, 2001. of query biased summaries in information retrieval. In SIGIR, pages 2–10. ACM, 1998. 8. U. Hahn and U. Reimer. Knowledge-based text sum- marization: Salience and generalization operators for knowledge base abstraction. The MIT Press, 1999. 9. Julian Kupiec, Jan O. Pedersen, and Francine Chen. A trainable document summarizer. In Edward A. Fox, Peter Ingwersen, and Raya Fidel, editors, SIGIR, pages 68–73. ACM Press, 1995.

51