Proceedings of the ACL conference. Ann Arbor, MI. July 2005. Digesting Virtual “Geek” Culture: The Summarization of Technical Internet Relay Chats

Liang Zhou and Eduard Hovy University of Southern California Information Sciences Institute 4676 Admiralty Way Marina del Rey, CA 90292-6695 {liangz, hovy} @isi.edu

used for analyzing the impact of virtual social in- Abstract teractions and virtual organizational culture on software/product development. This paper describes a summarization sys- The emergence of email thread discussions and tem for technical chats and emails on the chat logs as a major information source has Linux kernel. To reflect the complexity prompted increased interest in thread summariza- and sophistication of the discussions, they tion within the Natural Language Processing are clustered according to subtopic struc- (NLP) community. One might assume a smooth ture on the sub-message level, and imme- transition from text-based summarization to email diate responding pairs are identified and chat-based summarizations. However, chat through machine learning methods. A re- falls in the genre of correspondence, which re- sulting summary consists of one or more quires dialogue and conversation analysis. This mini-summaries, each on a subtopic from property makes summarization in this area even the discussion. more difficult than traditional summarization. In particular, topic “drift” occurs more radically than in written genres, and interpersonal and pragmatic 1 Introduction content appears more frequently. Questions about the content and overall organization of the sum- The availability of many chat forums reflects the mary must be addressed in a more thorough way formation of globally dispersed virtual communi- for chat and other dialogue summarization sys- ties. From them we select the very active and tems. growing movement of Open Source Software In this paper we present a new system that clus- (OSS) development. Working together in a virtual ters sub-message segments from correspondences community in non-collocated environments, OSS according to topic, identifies the sub-message developers communicate and collaborate using a segment containing the leading issue within the wide range of web-based tools including Internet topic, finds immediate responses from other par- Relay Chat (IRC), electronic mailing lists, and ticipants, and consequently produces a summary more (Elliott and Scacchi, 2004). In contrast to for the entire IRC. Other constructions are possi- conventional instant message chats, IRCs convey ble. One of the two baseline systems described in engaging and focused discussions on collaborative this paper uses the timeline and dialogue structure software development. Even though all OSS par- to select summary content, and is quite effective. ticipants are technically savvy individually, sum- We use the term chat loosely in this paper. Input maries of IRC content are necessary within a IRCs for our system is a mixture of chats and virtual organization both as a resource and an or- emails that are indistinguishable in format ob- ganizational memory of activities (Ackerman and served from the downloaded corpus (Section 3). Halverson, 2000). They are regularly produced In the following sections, we summarize previ- manually by volunteers. These summaries can be ous work, describe the email/chat data, intra- message clustering and summary extraction proc- questions among speech acts (sentences). A search ess, and discuss the results and future work. heuristic procedure then finds the corresponding answers. Ries (2001) shows how to use keyword 2 Previous and Related Work repetition, speaker initiative and speaking style to achieve topical segmentation of spontaneous dia- There are at least two ways of organizing dialogue logues. summaries: by dialogue structure and by topic. Newman and Blitzer (2002) describe methods 3 Technical Internet Relay Chats for summarizing archived newsgroup conversa- tions by clustering into subtopic groups GNUe, a meta-project of the GNU project1–one of and extracting top-ranked sentences per subtopic the most famous free/open source software pro- group based on the intrinsic scores of position in jects–is the case study used in (Elliott and Scacchi, the cluster and lexical centrality. Due to the techni- 2004) in support of the claim that, even in virtual cal nature of our working corpus, we had to handle organizations, there is still the need for successful intra-message topic shifts, in which the author of a conflict management in order to maintain order message raises or responds to multiple issues in the and stability. same message. This requires that our clustering The GNUe IRC archive is uniquely suited for component be not message-based but sub- our experimental purpose because each IRC chat message-based. log has a companion summary digest written by Lam et al. (2002) employ an existing summar- project participants as part of their contribution to izer for single documents using preprocessed email the community. This manual summary constitutes messages and context information from previous gold-standard data for evaluation. emails in the thread. 2 Rambow et al. (2004) show that sentence extrac- 3.1 Kernel Traffic tion techniques are applicable to summarizing email threads, but only with added email-specific Kernel Traffic is a collection of summary digests features. Wan and McKeown (2004) introduce a of discussions on GNUe development. Each digest system that creates overview summaries for ongo- summarizes IRC logs and/or email messages (later ing decision-making email exchanges by first de- referred to as chat logs) for a period of up to two tecting the issue being discussed and then weeks. A nice feature is that direct quotes and extracting the response to the issue. Both systems hyperlinks are part of the summary. Each digest is use a corpus that, on average, contains 190 words an extractive overview of facts, plus the author’s and 3.25 messages per thread, much shorter than dramatic and humorous interpretations. the ones in our collection. 3.2 Corpus Download Galley et al. (2004) describe a system that iden- tifies agreement and disagreement occurring in The complete Linux Kernel Archive (LKA) con- human-to-human multi-party conversations. They sists of two separate downloads. The Kernel Traf- utilize an important concept from conversational fic (summary digests) are in XML format and were analysis, adjacent pairs (AP), which consists of downloaded by crawling the Kernel Traffic site. initiating and responding utterances from different The Linux Kernel Archives (individual IRC chat speakers. Identifying APs is also required by our logs) are downloaded from the archive site. We research to find correspondences from different matched the summaries with their respective chat chat participants. logs based on subject and publication dates. In automatic summarization of spoken dia- logues, Zechner (2001) presents an approach to 3.3 Observation on Chat Logs obtain extractive summaries for multi-party dia- logues in unrestricted domains by addressing in- Upon initial examination of the chat logs, we trinsic issues specific to speech transcripts. found that many conventional assumptions about Automatic question detection is also deemed im- chats in general do not apply. For example, in most portant in this work. A decision-tree classifier was trained on question-triggering words to detect 1 http://www.gnu.org 2 http://kt.hoser.ca/kernel-traffic/index.html instant-message chats, each exchange usually con- 4 Fine-grained Clustering sists of a small number of words in several sen- tences. Due to the technical nature of GNUe, half To model the subtopic structure of each chat mes- of the chat logs contain in-depth discussions with sage, we apply clustering at the sub-message level. lengthy messages. One message might ask and an- swer several questions, discuss many topics in de- 4.1 Message Segmentation tail, and make further comments. This property, First, we look at each message and assume that which we call subtopic structure, is an important each participant responds to an ongoing discussion difference from informal chat/interpersonal banter. by stating his/her opinion on several topics or is- Figure 1 shows the subtopic structure and relation sues that have been discussed in the current chat of the first 4 messages from a chat log, produced log, but not necessarily in the order they were dis- manually. Each message is represented horizon- cussed. Thus, topic shifts can occur sequentially tally; the vertical arrows show where participants within a message. Messages are partitioned into responded to each other. Visual inspection reveals multi-paragraph segments using TextTiling, which in this example there are three distinctive clusters reportedly has an overall precision of 83% and re- (a more complex cluster and two smaller satellite call of 78% (Hearst, 1994). clusters) of discussions between participants at sub-message level. 4.2 Clustering

After distinguishing a set of message segments, we cluster them. When choosing an appropriate clus- tering method, because the number of subtopics under discussion is unknown, we cannot make an assumption about the total number of resulting clusters. Thus, nonhierarchical partitioning meth- ods cannot be used, and we must use a hierarchical method. These methods can be either agglomera- Figure 1. An example of chat subtopic structure tive, which begin with an unclustered data set and and relation between correspondences. perform N – 1 pairwise joins, or divisive, which add all objects to a single cluster, and then perform 3.4 Observation on Summary Digests N – 1 divisions to create a hierarchy of smaller To measure the goodness of system-produced clusters, where N is the total number of items to be summaries, gold standards are used as references. clustered (Frakes and Baeza-Yates, 1992). Human-written summaries usually make up the Ward’s Method gold standards. The Kernel Traffic (summary di- gests) are written by Linux experts who actively Hierarchical agglomerative clustering methods are contribute to the production and discussion of the commonly used and we employ Ward’s method open source projects. However, participant- (Ward and Hook, 1963), in which the text segment produced digests cannot be used as reference pair merged at each stage is the one that minimizes summaries verbatim. Due to the complex structure the increase in total within-cluster variance. of the dialogue, the summary itself exhibits some Each cluster is represented by an L-dimensional discourse structure, necessitating such reader guid- vector (xi1, xi2, …, xiL) where each xik is the word’s ance phrases such as “for the … question,” “on the tf • idf score. If mi is the number of objects in the … subject,” “regarding …,” “later in the same cluster, the squared Euclidean distance between thread,” etc., to direct and refocus the reader’s at- two segments i and j is: tention. Therefore, further manual editing and par- 2 L 2 dij = (xik # x jk ) titioning is needed to transform a multi-topic digest "K=1 into several smaller subtopic-based gold-standard When two segments are joined, the increase in reference summaries (see Section 6.1 for the trans- variance Iij is expressed as: formation). ! m m These corresponding pairs are formally intro- I = i j d 2 ij ij duced below, and the methods we experimented mi + m j with for identifying them are described. Number of Clusters 5.1 Adjacent Response Pairs The proces! s of joining clusters continues until the combination of any two clusters would destabilize An important conversational analysis concept, ad- the entire array of currently existing clusters pro- jacent pairs (AP), is applied in our system to iden- duced from previous stages. At each stage, the two tify initiating and responding correspondences from different participants in one chat log. Adja- clusters xik and xjk are chosen whose combination cent pairs are considered fundamental units of would cause the minimum increase in variance Iij, expressed as a percentage of the variance change conversational organization (Schegloff and Sacks, from the last round. If this percentage reaches a 1973). An adjacent pair is said to consist of two preset threshold, it means that the nearest two clus- parts that are ordered, adjacent, and produced by ters are much further from each other compared to different speakers (Galley et al., 2004). In our the previous round; therefore, joining of the two email/chat (LKA) corpus a physically adjacent represents a destabilizing change, and should not message, following the timeline, may not directly take place. respond to its immediate predecessor. Discussion Sub-message segments from resulting clusters participants read the current live thread and decide are arranged according to the sequence the original what he/she would like to correspond to, not nec- messages were posted and the resulting subtopic essarily in a serial fashion. With the added compli- structures are similar to the one shown in Figure 1. cation of subtopic structure (see Figure 1) the definition of adjacency is further violated. Due to 5 Summary Extraction its problematic nature, a relaxation on the adja- cency requirement is used in extensive research in Having obtained clusters of message segments fo- conversational analysis (Levinson, 1983). This re- cused on subtopics, we adopt the typical summari- laxed requirement is adopted in our research. zation paradigm to extract informative sentences Information produced by adjacent correspon- and segments from each cluster to produce sub- dences can be used to produce the subtopic-based topic-based summaries. If a chat log has n clusters, summary of the chat log. As described in Section then the corresponding summary will contain n 4, each chat log is partitioned, at sub-message mini-summaries. level, into several subtopic clusters. We take the All message segments in a cluster are related to message segment that appears first chronologically the central topic, but to various degrees. Some are in the cluster as the topic-initiating segment in an answers to questions asked previously, plus further adjacent pair. Given the initiating segment, we elaborative explanations; some make suggestions need to identify one or more segments from the and give advice where they are requested, etc. same cluster that are the most direct and relevant From careful analysis of the LKA data, we can responses. This process can be viewed equivalently safely assume that for this type of conversational as the informative sentence extraction process in interaction, the goal of the participants is to seek conventional text-based summarization. help or advice and advance their current knowl- edge on various technical subjects. This kind of 5.2 AP Corpus and Baseline interaction can be modeled as one problem- initiating segment and one or more corresponding We manually tagged 100 chat logs for adjacent problem-solving segments. We envisage that iden- pairs. There are, on average, 11 messages per chat tifying corresponding message segment pairs will log and 3 segments per message (This is consid- produce adequate summaries. This analysis follows erably larger than threads used in previous re- the structural organization of summaries from Ker- search). Each chat log has been clustered into one nel Traffic. Other types of discussions, at least in or more bags of message segments. The message part, require different discourse/summary organi- segment that appears earliest in time in a cluster zation. was marked as the initiating segment. The annota- tors were provided with this segment and one other • number of overlapping words segment at a time, and were asked to decide • number of overlapping content words whether the current message segment is a direct • ratio of overlapping words answer to the question asked, the suggestion that • ratio of overlapping content words was requested, etc. in the initiating segment. There • number of overlapping tech words are 1521 adjacent response pairs; 1000 were used for training and 521 for testing. Table 1. Lexical features. 1 Our baseline system selects the message seg- P (r | c) = exp( " f (c,r)) " Z (c) # i,r i,r ment (from a different author) immediately follow- " i ing the initiating segment. It is quite effective, with where Zλ(c) is a normalizing constant and the fea- an accuracy of 64.67%. This is reasonable because ture function for feature fi and response class r is not all adjacent responses are interrupted by mes- defi!n ed as: #1, if f 0 and r " r sages responding to different earlier initiating mes- i > = fi,r (c,r ") = $ . sages. %0, otherwise In the following sections, we describe two ma- λi,r is the feature-weight parameter for feature fi and chine learning methods that were used to identify response class r. Then, to determine the best class r the second in an adjacent response pair for the! can didate message segment c, we have: * and the features used for training. We view the r = arg maxrP(r | c) . problem as a binary classification problem, distin- guishing less relevant responses from direct re- 5.5 Support Vector Machine sponses. Our approach is to assign a candidate Suppo!rt vector machines (SVMs) have been shown message segment c an appropriate response class r. to outperform other existing methods (naïve Bayes, 5.3 Features k-NN, and decision trees) in text categorization (Joachims, 1998). Their advantages are robustness Structural and durational features have been dem- and the elimination of the need for feature selec- onstrated to improve performance significantly in tion and parameter tuning. SVMs find the hyper- conversational text analysis tasks. Using them, plane that separates the positive and negative Galley et al. (2004) report an 8% increase in training examples with maximum margin. Finding speaker identification. Zechner (2001) reports ex- this hyperplane can be translated into an optimiza- * cellent results (F > .94) for inter-turn sentence tion problem of finding a set of coefficients αi of r boundary detection when recording the length of the weight vector w for document di of class yi ∈ pause between utterances. In our corpus, dura- {+1 , –1}: r tional information is nonexistent because chats and r * w = #" i yid i, " i > 0 . emails were mixed and no exact time recordings i beside dates were reported. So we rely solely on Testing! data are classified depending on the side structural and lexical features. of the hyperplane they fall on. We used the 4 For structural features, we count the number of LIBSVM! package for training and testing. messages between the initiating message segment and the responding message segment. Lexical fea- Feature sets baseline MaxEnt SVM tures are listed in Table 1. The tech words are the 64.67% words that are uncommon in conventional litera- Structural 61.22% 71.79% ture and unique to Linux discussions. Lexical 62.24% 72.22% Structural + Lexical 72.61% 72.79% 5.4 Maximum Entropy

Maximum entropy has been proven to be an ef- fective method in various natural language proc- Table 2. Accuracy on identifying APs. essing applications (Berger et al., 1996). For training and testing, we used YASMET3. To esti- mate P(r | c) in the exponential form, we have:

3 http://www.fjoch.com/YASMET.html 4 http://www.csie.ntu.edu.tw/~cjlin/libsvm/ 5.6 Results Kernel Traffic digests are participant-written summaries of the chat logs. Each digest mixes the Entries in Table 2 show the accuracies achieved summary writer’s own narrative comments with using machine learning models and feature sets. direct quotes (citing the authors) from the chat log. As observed in Section 3.4, subtopics are inter- mingled in each digest. Authors use key phrases to 5.7 Summary Generation link the contents of each subtopic throughout texts. After responding message segments are identified, In Figure 3, we show an example of such a digest. we couple them with their respective initiating Discussion participants’ names are in italics and segment to form a mini-summary based on their subtopics are in bold. In this example, the conver- subtopic. Each initializing segment has zero or sation was started by Benjamin Reed with two more responding segments. We also observed zero questions: 1) asking for conventions for writing response in human-written summaries where par- /proc drivers, and 2) asking about the status of ticipants initiated some question or concern, but sysctl. The summary writer indicated that Linus others failed to follow up on the discussion. The Torvalds replied to both questions and used the AP process is repeated for each cluster created phrase “for the … question, he added…” to high- previously. One or more subtopic-based mini- light the answer to the second question. As the di- summaries make up one final summary for each chat log. Figure 2 shows an example. For longer Benjamin Reed wrote a wireless Ethernet driver that used /proc as its interface. But he was a little uncom- chat logs, the length of the final summary is arbi- fortable … asked if there were any conventions he trarily averaged at 35% of the original. should follow. He added, “and finally, what’s up with sysctl? …” Subtopic 1: Linus Torvalds replied with: “the thing to do is to cre- Benjamin Reed: I wrote a wireless ethernet driver a ate a …[program code]. The /proc/drivers/ directory is while ago... Are driver writers recommended to use already there, so you’d basically do something like … that over extending /proc or is it deprecated? [program code].” For the sysctl question, he added Linus Torvalds: Syscyl is deprecated. It’s useful in one “sysctl is deprecated. ...” way only ... Marcin Dalecki flamed Linus: “Are you just blind to the never-ending format/compatibility/… problems the Subtopic 2: whole idea behind /proc induces inherently? Benjamin Reed: I am a bit uncomfortable ... wondering …[example]” for a while if there are guidelines on … Linus Torvalds: The thing to do is to create ... Figure 3. An original Kernel Traffic digest.

Subtopic 3: Marcin Dalecki: Are you just blind to the never-ending Mini 1: format/ compatibility/ … problems the whole idea Benjamin Reed wrote a wireless Ethernet driver that behind /proc induces inherently? used /proc as its interface. But he was a little uncom- fortable … and asked if there were any conventions he Figure 2. A system-produced summary. should follow. Linus Torvalds replied with: the thing to do is to create a …[program code]. The /proc/drivers/ directory is 6 Summary Evaluation already there, so you’d basically do something like … [program code]. To evaluate the goodness of the system-produced Marcin Dalecki flamed Linus: Are you just blind to the never-ending format/ compatibility/ … problems the summaries, a set of reference summaries is used whole idea behind /proc induces inherently? for comparison. In this section, we describe the …[example] manual procedure used to produce the reference summaries, and the performances of our system Mini 2: Benjamin Reed: and finally, what’s up with sysctl? ... and two baseline systems. Linus Torvalds replied: sysctl is deprecated. ...

6.1 Reference Summaries Figure 4. A reference summary reproduced from a summary digest. gest goes on, Marcin Dalecki only responded to the the ongoing example. first question with his excited commentary. The summary digests from Kernel Traffic Since our system-produced summaries are sub- mostly consist of direct snippets from original topic-based and partitioned accordingly, if we use messages, thus making the reference summaries unprocessed Kernel Traffic as references, the com- extractive even after rewriting. This makes it pos- parison would be rather complicated and would sible to conduct an automatic evaluation. A com- increase the level of inconsistency in future as- puterized procedure calculates the overlap between sessments. We manually reorganized each sum- reference and system-produced summary units. mary digest into one or more mini-summaries by Since each system-produced summary is a set of subtopic (see Figure 4.) Examples (usually kernel mini-summaries based on subtopics, we also com- stats) and programs are reduced to “[example]” pared the subtopics against those appearing in ref- and “[program code].” Quotes (originally in sepa- erence summaries (precision = 77.00%, recall = rate messages but merged by the summary writer) 74.33 %, F = 0.7566). that contain multiple topics are segmented and the participant’s name is inserted for each segment. Recall Precision F-measure We follow clues like “to answer … question” to Baseline1 30.79% 16.81% .2175 pair up the main topics and their responses. Baseline2 63.14% 36.54% .4629 System Summary 52.57% 52.14% .5235 6.2 Summarization Results Topic-summ 52.57% 63.66% .5758

We evaluated 10 chat logs. On average, each con- Table 3. Summary of results. tains approximately 50 multi-paragraph tiles (parti- tioned by TextTile) and 5 subtopics (clustered by Table 3 shows the recall, precision, and F- the method from Section 4). measure from the evaluation. From manual analy- A simple baseline system takes the first sentence sis on the results, we notice that the original digest from each email in the sequence that they were writers often leave large portions of the discussion posted, based on the assumption that people tend to out and focus on a few topics. We think this is be- put important information in the beginning of texts cause among the participants, some are Linux vet- (Position Hypothesis). erans and others are novice programmers. Digest A second baseline system was built based on writers recognize this difference and reflect it in constructing and analyzing the dialogue structure their writings, whereas our system does not. The of each chat log. Participants often quote portions entry “Topic-summ” in the table shows system- of previously posted messages in their responses. produced summaries being compared only against These quotes link most of the messages from a the topics discussed in the reference summaries. chat log. The message segment that immediately follows the quote is automatically paired with the 6.3 Discussion quote itself and added to the summary and sorted according to the timeline. Segments that are not A recall of 30.79% from the simple baseline reas- quoted in later messages are labeled as less rele- sures us the Position Hypothesis still applies in vant and discarded. A resulting baseline summary conversational discussions. The second baseline is an inter-connected structure of segments that performs extremely well on recall, 63.14%. It quoted and responded to one another. Figure 5 is a shows that quoted message segments, and thereby shortened summary produced by this baseline for derived dialogue structure, are quite indicative of where the important information resides. Systems [0|0] Benjamin Reed: “I wrote an … driver … /proc built on these properties are good summarization …” systems and hard-to-beat baselines. The system [0|1] Benjamin Reed: “… /proc/ guideline …” described in this paper (Summary) shows an F- [0|2] Benjamin Reed: “… syscyl …” [1|0] Linus Torvalds responds to [0|0, 0|1, 0|2]: “the measure of .5235, an improvement from .4629 of thing to do is …” “sysctl is deprecated … “ the smart baseline. It gains from a high precision because less relevant message segments are identi- Figure 5. A short example from Baseline 2. fied and excluded from the adjacent response pairs, leaving mostly topic-oriented segments in summa- M. Galley, K. McKeown, J. Hirschberg, and E. ries. There is a slight improvement when assessing Shriberg. 2004. Identifying agreement and disagree- against only those subtopics appeared in the refer- ment in conversational speech: use of Bayesian net- ence summaries (Topic-summ). This shows that we works to model pragmatic dependencies. In the only identified clusters on their information con- Proceedings of ACL-04. tent, not on their respective writers’ experience and M. A. Hearst. 1994. Multi-paragraph segmentation of reliability of knowledge. expository text. In the Proceedings of ACL 1994. In the original summary digests, interactions and T. Joachims. 1998. Text categorization with support reactions between participants are sometimes de- vector machines: Learning with many relevant fea- scribed. Digest writers insert terms like “flamed”, tures. In Proceedings of the ECML, pages 137–142. “surprised”, “felt sorry”, “excited”, etc. To analyze D. Lam and S. L. Rohall. 2002. Exploiting e-mail struc- social and organizational culture in a virtual envi- ture to improve summarization. Technical Paper at ronment, we need not only information extracts IBM Watson Research Center #20–02. (implemented so far) but also passages that reveal the personal aspect of the communications. We S. Levinson. 1983. Pragmatics. Cambridge University Press. plan to incorporate opinion identification into the current system in the future. P. Newman and J. Blitzer. 2002. Summarizing archived discussions: a beginning. In Proceedings of Intelli- 7 Conclusion and Future Work gent User Interfaces. O. Rambow, L. Shrestha, J. Chen and C. Laurdisen. In this paper we have described a system that per- 2004. Summarizing email threads. In Proceedings of forms intra-message topic-based summarization by HLT-NAACL 2004: Short Papers. clustering message segments and classifying topic- initiating and responding pairs. Our approach is an K. Ries. 2001. Segmenting conversations by topic, ini- initial step in developing a framework that can tiative, and style. In Proceedings of SIGIR Work- shop: Information Retrieval Techniques for Speech eventually reflect the human interactions in virtual Applications 2001: 51–66. environments. In future work, we need to prioritize information according to the perceived knowl- E. A. Schegloff and H. Sacks. 1973. Opening up clos- edgeability of each participant in the discussion, in ings. Semiotica, 7-4:289–327. addition to identifying informative content and S. Wan and K. McKeown. 2004. Generating overview recognizing dialogue structure. While the approach summaries of ongoing email thread discussions. In to the detection of initiating-responding pairs is Proceedings of COLING 2004. quite effective, differentiating important and non- J. H. Ward Jr. and M. E. Hook. 1963. Application of an important topic clusters is still unresolved and hierarchical grouping procedure to a problem of must be explored. grouping profiles. Educational and Psychological Measurement, 23, 69–81. References K. Zechner. 2001. Automatic generation of concise M. S. Ackerman and C. Halverson. 2000. Reexaming summaries of spoken dialogues in unrestricted do- organizational memory. Communications of the mains. In Proceedings of SIGIR 2001. ACM, 43(1), 59–64. A. Berger, S. Della Pietra, and V. Della Pietra. 1996. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39–71. M. Elliott and W. Scacchi. 2004. Free software devel- opment: cooperation and conflict in a virtual organi- zational culture. S. Koch (ed.), Free/Open Source Software Development, IDEA publishing, 2004. W. B. Frakes and R. Baeza-Yates. 1992. Information retrieval: data structures & algorithms. Prentice Hall.