The Summarization of Technical Internet Relay Chats
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the ACL conference. Ann Arbor, MI. July 2005. Digesting Virtual “Geek” Culture: The Summarization of Technical Internet Relay Chats Liang Zhou and Eduard Hovy University of Southern California Information Sciences Institute 4676 Admiralty Way Marina del Rey, CA 90292-6695 {liangz, hovy} @isi.edu used for analyzing the impact of virtual social in- Abstract teractions and virtual organizational culture on software/product development. This paper describes a summarization sys- The emergence of email thread discussions and tem for technical chats and emails on the chat logs as a major information source has Linux kernel. To reflect the complexity prompted increased interest in thread summariza- and sophistication of the discussions, they tion within the Natural Language Processing are clustered according to subtopic struc- (NLP) community. One might assume a smooth ture on the sub-message level, and imme- transition from text-based summarization to email diate responding pairs are identified and chat-based summarizations. However, chat through machine learning methods. A re- falls in the genre of correspondence, which re- sulting summary consists of one or more quires dialogue and conversation analysis. This mini-summaries, each on a subtopic from property makes summarization in this area even the discussion. more difficult than traditional summarization. In particular, topic “drift” occurs more radically than in written genres, and interpersonal and pragmatic 1 Introduction content appears more frequently. Questions about the content and overall organization of the sum- The availability of many chat forums reflects the mary must be addressed in a more thorough way formation of globally dispersed virtual communi- for chat and other dialogue summarization sys- ties. From them we select the very active and tems. growing movement of Open Source Software In this paper we present a new system that clus- (OSS) development. Working together in a virtual ters sub-message segments from correspondences community in non-collocated environments, OSS according to topic, identifies the sub-message developers communicate and collaborate using a segment containing the leading issue within the wide range of web-based tools including Internet topic, finds immediate responses from other par- Relay Chat (IRC), electronic mailing lists, and ticipants, and consequently produces a summary more (Elliott and Scacchi, 2004). In contrast to for the entire IRC. Other constructions are possi- conventional instant message chats, IRCs convey ble. One of the two baseline systems described in engaging and focused discussions on collaborative this paper uses the timeline and dialogue structure software development. Even though all OSS par- to select summary content, and is quite effective. ticipants are technically savvy individually, sum- We use the term chat loosely in this paper. Input maries of IRC content are necessary within a IRCs for our system is a mixture of chats and virtual organization both as a resource and an or- emails that are indistinguishable in format ob- ganizational memory of activities (Ackerman and served from the downloaded corpus (Section 3). Halverson, 2000). They are regularly produced In the following sections, we summarize previ- manually by volunteers. These summaries can be ous work, describe the email/chat data, intra- message clustering and summary extraction proc- questions among speech acts (sentences). A search ess, and discuss the results and future work. heuristic procedure then finds the corresponding answers. Ries (2001) shows how to use keyword 2 Previous and Related Work repetition, speaker initiative and speaking style to achieve topical segmentation of spontaneous dia- There are at least two ways of organizing dialogue logues. summaries: by dialogue structure and by topic. Newman and Blitzer (2002) describe methods 3 Technical Internet Relay Chats for summarizing archived newsgroup conversa- tions by clustering messages into subtopic groups GNUe, a meta-project of the GNU project1–one of and extracting top-ranked sentences per subtopic the most famous free/open source software pro- group based on the intrinsic scores of position in jects–is the case study used in (Elliott and Scacchi, the cluster and lexical centrality. Due to the techni- 2004) in support of the claim that, even in virtual cal nature of our working corpus, we had to handle organizations, there is still the need for successful intra-message topic shifts, in which the author of a conflict management in order to maintain order message raises or responds to multiple issues in the and stability. same message. This requires that our clustering The GNUe IRC archive is uniquely suited for component be not message-based but sub- our experimental purpose because each IRC chat message-based. log has a companion summary digest written by Lam et al. (2002) employ an existing summar- project participants as part of their contribution to izer for single documents using preprocessed email the community. This manual summary constitutes messages and context information from previous gold-standard data for evaluation. emails in the thread. 2 Rambow et al. (2004) show that sentence extrac- 3.1 Kernel Traffic tion techniques are applicable to summarizing email threads, but only with added email-specific Kernel Traffic is a collection of summary digests features. Wan and McKeown (2004) introduce a of discussions on GNUe development. Each digest system that creates overview summaries for ongo- summarizes IRC logs and/or email messages (later ing decision-making email exchanges by first de- referred to as chat logs) for a period of up to two tecting the issue being discussed and then weeks. A nice feature is that direct quotes and extracting the response to the issue. Both systems hyperlinks are part of the summary. Each digest is use a corpus that, on average, contains 190 words an extractive overview of facts, plus the author’s and 3.25 messages per thread, much shorter than dramatic and humorous interpretations. the ones in our collection. 3.2 Corpus Download Galley et al. (2004) describe a system that iden- tifies agreement and disagreement occurring in The complete Linux Kernel Archive (LKA) con- human-to-human multi-party conversations. They sists of two separate downloads. The Kernel Traf- utilize an important concept from conversational fic (summary digests) are in XML format and were analysis, adjacent pairs (AP), which consists of downloaded by crawling the Kernel Traffic site. initiating and responding utterances from different The Linux Kernel Archives (individual IRC chat speakers. Identifying APs is also required by our logs) are downloaded from the archive site. We research to find correspondences from different matched the summaries with their respective chat chat participants. logs based on subject line and publication dates. In automatic summarization of spoken dia- logues, Zechner (2001) presents an approach to 3.3 Observation on Chat Logs obtain extractive summaries for multi-party dia- logues in unrestricted domains by addressing in- Upon initial examination of the chat logs, we trinsic issues specific to speech transcripts. found that many conventional assumptions about Automatic question detection is also deemed im- chats in general do not apply. For example, in most portant in this work. A decision-tree classifier was trained on question-triggering words to detect 1 http://www.gnu.org 2 http://kt.hoser.ca/kernel-traffic/index.html instant-message chats, each exchange usually con- 4 Fine-grained Clustering sists of a small number of words in several sen- tences. Due to the technical nature of GNUe, half To model the subtopic structure of each chat mes- of the chat logs contain in-depth discussions with sage, we apply clustering at the sub-message level. lengthy messages. One message might ask and an- swer several questions, discuss many topics in de- 4.1 Message Segmentation tail, and make further comments. This property, First, we look at each message and assume that which we call subtopic structure, is an important each participant responds to an ongoing discussion difference from informal chat/interpersonal banter. by stating his/her opinion on several topics or is- Figure 1 shows the subtopic structure and relation sues that have been discussed in the current chat of the first 4 messages from a chat log, produced log, but not necessarily in the order they were dis- manually. Each message is represented horizon- cussed. Thus, topic shifts can occur sequentially tally; the vertical arrows show where participants within a message. Messages are partitioned into responded to each other. Visual inspection reveals multi-paragraph segments using TextTiling, which in this example there are three distinctive clusters reportedly has an overall precision of 83% and re- (a more complex cluster and two smaller satellite call of 78% (Hearst, 1994). clusters) of discussions between participants at sub-message level. 4.2 Clustering After distinguishing a set of message segments, we cluster them. When choosing an appropriate clus- tering method, because the number of subtopics under discussion is unknown, we cannot make an assumption about the total number of resulting clusters. Thus, nonhierarchical partitioning meth- ods cannot be used, and we must use a hierarchical method. These methods can be either agglomera- Figure 1. An example of chat subtopic structure tive, which begin with an unclustered data set and and relation between correspondences. perform N – 1 pairwise joins, or divisive, which add all objects to a single cluster, and then perform 3.4 Observation on Summary Digests N – 1 divisions to create a hierarchy of smaller To measure the goodness of system-produced clusters, where N is the total number of items to be summaries, gold standards are used as references. clustered (Frakes and Baeza-Yates, 1992). Human-written summaries usually make up the Ward’s Method gold standards. The Kernel Traffic (summary di- gests) are written by Linux experts who actively Hierarchical agglomerative clustering methods are contribute to the production and discussion of the commonly used and we employ Ward’s method open source projects.