From Social Bookmarking to Social Summarization: an Experiment in Community-Based Summary Generation∗

From Social Bookmarking to Social Summarization: an Experiment in Community-Based Summary Generation∗

From Social Bookmarking to Social Summarization: An Experiment in Community-Based Summary Generation∗ Oisin Boydell Barry Smyth Adaptive Information Cluster Adaptive Information Cluster School of Computer Science and School of Computer Science and Informatics Informatics University College Dublin University College Dublin Belfield, Dublin 4 Belfield, Dublin 4 [email protected] [email protected] ABSTRACT The Social Web We describe a novel document summarization technique that uses informational cues, such as social bookmarks or search In this paper we suggest a novel approach to summarization Social queries, as the basis for summary construction by leverag- that is inspired by the recent emergence of so-called Web ing the snippet-generation capabilities of standard search en- services, in which communities of users are playing an gines. A comprehensive evaluation demonstrates how the so- increasingly important role when it comes to producing, en- cial summarization technique can generate summaries that riching, organising, and facilitating access to Web content. are of significantly higher quality that those produced by a For example, the rapid growth of Web Logs (blogs) is just number of leading alternatives. one example of the dynamic new world of user-generated content. We also find communities of users eager to con- ACM Classification: H.4 [Information Systems Applica- tribute to existing content by submitting their own reviews 1 2 tions]: Miscellaneous; H.3.3 [Information Storage and Re- and opinions. Sites like Amazon and TripAdvisor have trieval]: Information Search and Retrieval learned to embraced this as a valuable source of social con- tent for some time now, by allowing users to submit their General terms: Design, Algorithms, Human Factors reviews and opinions on consumer products (Books, DVDs, etc. in the case of Amazon) or travel services (vacations, ho- Keywords: summarization, social bookmarks, click-through tels etc. in the case of TripAdvisor). Indeed, as we come data, community, Web search to appreciate the willingness of users to participate in these types of services a number of innovators have recognised the INTRODUCTION power of social interactions to drive entirely new types of so- The ability to effectively summarize a document — to ac- cial media. The news aggregator, Digg3, is a case in point: by curately and concisely capture its key information — is an allowing users to submit and rate news stories found on the important area of research that is dominated by a wide range Web, Digg plays the role of a community-based news aggre- of techniques which employ language models of varying de- gator and in just 18 months has attracted a reader-base that grees of sophistication. These summarization techniques at- is fast approaching that of the New York Times4. Services tempt to automatically capture the salient content from a doc- like Flickr5 and Del.icio.us6 harness a very different form of ument in order to present it to a human reader in a more con- social content by encouraging users to label or tag content densed form, but one of the problems of these traditional ‘one to facilitate the sharing of, and access to, said content. For size fits all’ approaches is that there is limited emphasis on instance, Flickr allows users to upload, tag, and share their the needs and preferences of the end users. While the result- photo libraries, whereas Del.icio.us, allows users to manage ing summaries may perform well in general terms — effec- and share their Web bookmarks. Importantly, these tagging tively extracting the core content of the document in question services allow various users to express their views of con- — they may not appeal to the needs and preferences of indi- tent, by submitting the tags that they deem to be appropriate, vidual users or a community of users. leading to the development of an alternative content taxon- ∗This material is based on works supported by Science Foundation Ireland omy (or folksonomy [1]) to serve as an alternative content under Grant No. 03/IN.3/I361. index for facilitating search. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are 1 : not made or distributed for profit or commercial advantage and that copies http //www.amazon.com 2 : bear this notice and the full citation on the first page. To copy otherwise, to http //www.tripadvisor.com 3 republish, to post on servers or to redistribute to lists, requires prior specific http : //www.digg.com permission and/or a fee. 4http : //www.alexaholic.com/digg.com + nyt.com IUI’07, January 28–31, 2007, Honolulu, Hawaii, USA.. 5http : //www.flickr.com/ Copyright 2007 ACM 1-59593-481-2/07/0001 ...$5.00. 6http : //del.icio.us/ 42 Towards Social Summarization Extraction vs. Abstraction The point of all this is to emphasise how today’s content con- To begin with, there are two broad approaches to summariza- sumers are no longer playing a passive role when it comes to tion: extraction or intrinsic summarization versus abstrac- accessing online content. Instead, there are more and more tion or extrinsic summarization. Extraction techniques at- services that allow users to play more active roles, by pro- tempt to summarize a document by identifying and extract- viding feedback (opinions, annotations, ratings etc.) that can ing those parts of the document that are deemed to be the then be used to enrich or enhance the consumed content. In most important or salient, and the final summary is thus a this work we argue that a combination of user feedback and collection of sentences or sentence fragments from the orig- Web search engines can be harnessed to drive a effective inal document. In contrast, abstraction approaches do not and efficient form of document summarization. The key to preserve the original document content and instead prefer to our idea is the use of search engines as a means of generat- paraphrase the source content to provide a more concise con- ing short, query-sensitive document summaries. Specifically, tent representation. In general, abstraction techniques have when we submit a query q to a search engine, each result r is the potential to produce more condensed summaries than ex- accompanied by a so-called snippet text, a brief extract con- traction techniques, but rely heavily on sophisticated natural taining fragments of sentences from the document that are language processing and generation. Extraction techniques, related to the target query. These snippets provide the raw on the other hand, generally rely on shallow natural language material for document summaries and to generate a summary techniques, usually relying on statistical term-counting as the of document d we can look to the set of queries q1,...,qn basis for sentence selection and ordering. that have been submitted to a search engine which resulted in d being selected. The associated snippets, s1,...,sn, can One of the earliest automatic summarization techniques was then be combined to provide a document summary. The point described by [12] in 1958. Very briefly, a statistical approach is that the queries serve to identify key points of interest for was proposed in which the frequency of word occurrences in users who have been interested in d and provide a skeleton a document were used as an indicator of word significance. around which a summary can be constructed. These significance indicators were then combined with po- sitional information to obtain a sentence-level significance Thus, by mining search logs we can leverage implicit user value so that a final summary could be produced from the feedback (query-result selections) to construct social sum- top ranking sentences. This approach was refined in [6] with maries. In this paper, however, we focus on an alternative the inclusion of a range of additional document features for form of feedback by demonstrating how the bookmarks, and estimating the significance of each sentence. For example, their associated tags, in a social bookmarking service like structural elements such as title, subtitles and headings, as Del.icio.us can be similarly harnessed. Consider a user u that well as language features such as cue words, were all com- has bookmarked a page p using some bookmark tags b.We bined to guide a more sophisticated model of sentence scor- can treat b as a query for the bookmarked page, and retrieve a ing. Today, most commercial summarizers are still based on snippet for p by submitting b to a search engine and locating these basic approaches, with considerable research effort de- the snippet associated with the result that corresponds to p. voted to refinements such as the automatic learning of fea- ture weights for sentence selection [9] and shallow language Paper Summary based approaches such as analysing the discourse structure In this paper we describe the above technique in detail, fo- of text to aid sentence selection [13]. cusing on how to construct a summary of the page by ranking the snippet fragments according to their relative importance. Two extraction-based summaries that are especially relevant Furthermore, we demonstrate how the resulting summaries to this work are the Open Text Summarizer (OTS) [17] and out-perform those produced by a number of leading bench- the MEAD summarizer [16], which we use as benchmark mark summarization systems under a variety of experimental comparison systems in our evaluation. Both combine shal- conditions. Finally, we consider a number of additional ben- low NLP techniques with more conventional statistical word- efits of this social summarization technique, including the frequency methods to produce document abstracts from high potential to produce community-focused and query-sensitive scoring sentences. For example OTS incorporates NLP tech- summaries and the role of these summaries as a form of en- niques via an English language lexicon with synonyms and hanced, personalized result-snippets for search engines.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    10 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us