Investigating Usage of and Inter-passage Similarities to Improve Text

by

Shashank Paliwal, Vikram Pudi

in

8th International Conference on and Data Mining (MLDM 2012)

Report No: IIIT/TR/2012/-1

Centre for Data Engineering International Institute of Information Technology Hyderabad - 500 032, INDIA July 2012 Investigating Usage of Text Segmentation and Inter-passage Similarities to Improve Text Document Clustering

Shashank Paliwal and Vikram Pudi

Center for Data Engineering, International Institute of Information Technology, Hyderabad [email protected], [email protected]

Abstract. Measuring inter-document similarity is one of the most essential steps in text document clustering. Traditional methods rely on representing text documents using the simple Bag-of- (BOW) model. A document is an organized structure consisting of various text segments or passages. Such single term analysis of the text treats whole document as a single semantic unit and thus, ignores other semantic units like sentences, passages etc. In this paper, we attempt to take advantage of underlying subtopic structure of text documents and investigate whether clustering of text documents can be improved if text segments of two documents are utilized, while calculating similarity between them. We concentrate on examining effects of combining suggested inter- document similarities (based on inter-passage similarities) with traditional inter- document similarities following a simple approach for the same. Experimental results on standard data sets suggest improvement in clustering of text documents.

Keywords: Text Document Clustering, Text Segmentation, Document Similarity.

1 Introduction

With a large explosion in the amount of data found on the web, it has become necessary to devise better methods to classify data. A large part of this web data (like blogs, webpages, tweets etc.) is in the form of text. Text document clustering techniques play an important role in the performance of , search engines and systems by classifying text documents. The traditional clustering techniques fail to provide satisfactory results for text documents, primarily due to the fact that text data is very high dimensional and contains a large number of unique terms in a single document. Most of these documents do not particularly deal with a single topic, which makes it difficult to classify them under a single category. Such a scenario, thus gives rise to need for clustering methods which can classify documents on the basis of topic on which the document is primarily written i.e. theme of most of the passages or segments which combine together to form the whole document.

P. Perner (Ed.): MLDM 2012, LNAI 7376, pp. 555–565, 2012. © Springer-Verlag Berlin Heidelberg 2012 556 S. Paliwal and V. Pudi

Text documents are often represented as a vector where each term is associated with a weight. The Vector Space Model [13] is a popular method that abstracts each document as a vector with weighted terms acting as features. Most of the term extraction algorithms follow “Bag of Words” (BOW) representation to identify document terms. While such a representation is simple and easy to understand, it suffers from two problems. One, it relies heavily on vocabulary used by the author to calculate similarity between two documents. A pair of documents might be on similar topics, but still score very low on similarity value because of different set of terms being used in two documents. Two, it considers whole document as a single semantic unit. Two documents may talk about related topics and share common vocabulary, but they could still be judged dissimilar because of other unrelated topics present in the two documents, if one or both of the documents consist of varying topics. While the first problem can be tackled using dictionaries or a “” like lexical database [7], applying clustering algorithm to semantically independent units of text might help in reducing the defiling effect of drifting topics (as text segments on similar topics would be judged similar while those on unrelated topics as dissimilar) and varying length problem (as text segments are going to be of same fixed size). It is our intuition that calculating document-document similarity with the help of text segments of a particular length may help in improving quality of clustering by solving varying length problem and drifting topics problem to a small extent. In this paper, our primary aim is to investigate whether segmenting a document into various independent units could help in improving the clustering of text documents or not. So, we present a simple algorithm to efficiently calculate inter- passage similarities between text segments of two different documents and then effectively integrate these values with those obtained from considering each document as a single semantic unit, to obtain better clustering of text documents. Throughout this paper, we use text segments or text windows interchangeably and assume them to be same i.e. a segment of document consisting of a particular number of words which we refer to as “Window Size”. The rest of the paper is organized as follows. Section 2 briefly describes the related work. Section 3 explains the process of term segmentation in a text document and motivation behind this paper. Section 4 describes our approach to the calculation of similarity between two documents. Section 5 and 6 describe experimental results and the conclusion respectively.

2 Related Work

Many Vector Space Document based clustering models make use of single term analysis only. To further improve clustering of documents and rather than treating a document as a bag of words, considering term dependency while calculating document similarity has gained attention.[8, 14] Passage retrieval is the task of retrieving only those segments of text which are relevant to a particular information need. It has been extensively utilized in the field of information retrieval to improve the quality of retrieval [2, 3] and improve Investigating Usage of Text Segmentation and Inter-Passage Similarities 557 performance of systems [1]. [6] utilizes segmentation of web pages to improve the quality of web search. In [4], fragments of legal text documents are clustered. However, no segmentation algorithm is needed as legal documents are decomposable. [12] proposes passage- based text categorization model, which segments a document and then passage categories are merged into document categories to achieve final categorization of documents. Perhaps, the more closely related works are [5] and [11]. In [5], authors evaluate the impact of text segmentation on query specific clustering of text documents. [11] focuses on clustering of multi-topic documents using text segments. Our work is different from [11] in two aspects majorly. First, our focus is not on multi topic documents and second, we attempt to investigate effects on hard clustering, if similarity between text segments is also included in combined similarity between two documents, while [11] attempts to improve soft clustering of multi-topic documents utilizing each text segment as an independent semantic unit.

3 Basic Idea

The basis of this work is the intuition that two documents should be considered more similar for the purpose of clustering, if the set of common terms between the two documents are contained in a small region as compared to two other documents in which these terms are highly scattered across the documents. Traditional vector space model based techniques ignore the density of region in which these common terms fall and thus judge many similar (dissimilar) documents as dissimilar (similar).

3.1 Text Segmentation

Text segments can be categorized into three kinds of passages: discourse, semantic, and window. Discourse passages rely on the logical structure of the documents marked by punctuation. Semantic passages are obtained by partitioning a document into topics or sub-topics according to its semantic structure (e.g. TextTiling [10]). The third type of passages which are fixed-length passages or windows, are defined to contain a fixed number of words and were introduced in [9]. For the sake of simplicity, we use the fixed length passages in our experiments. We use both non-overlapping and overlapping passages to investigate effect of combining inter-document and inter-passage similarities on text document clustering.

Example: Document = “The flash washes out the photos, and the camera takes very long to turn on.” Window Size = 4

1. Non-Overlapping Passages are following

Passage 1: “The flash washes out” Passage 2: “the photos and the” Passage 3: “camera takes very long” Passage 4: “to turn on” 558 S. Paliwal and V. Pudi

2. Overlapping Passages with size of overlap = (Window size / 2) are following

Passage 1: “The flash washes out” Passage 2: “washes out the photos” Passage 3: “the photos and the” Passage 4: “and the camera takes” Passage 5: “camera takes very long” Passage 6: “very long to turn” Passage 7: “to turn on”

4 Similarity Computation

Let D be a document set with N number of documents:

D = {d1 ,d2 ,d3 ……. dN} th th Where dn = { t1,t2,t3 …… tm } and dn is the n document in corpus and ti is i term in document dn.

4.1 Traditional Inter-document Similarity

We calculate inter-document similarity by calculating cosine similarity between two document vectors with each feature weighted using tf-idf method. Tf-idf weight : log1 log 1 (1) , where tf(t,d) is term frequency of term t in document d and N is the total number of documents in corpus and xt is the number of documents in which term t occurs. Cosine similarity between two document vectors and is calculated as,

. , (2)

4.2 Passage-Based Inter-document Similarity For a document d consisting of m terms and assuming window size of w, document d will be segmented into

1. k windows for non-overlapping text windows where if ( % 0 1 % 0 .

2. k windows for overlapping text windows with size of overlap equal to 2) where k = 1 % 0 % 0 .

A window or passage too is represented using a feature vector with terms present in the passage being its features and tf-idf weighting scheme used to weigh these features. However, for weighting terms of passages, each passage is considered as a Investigating Usage of Text Segmentation and Inter-Passage Similarities 559 single document and all the passages of a single document together are treated as the full corpus. ' ' ' < Let d1 consists of { P1,P2...... Pr } and d2 of { P1 , P2...... Ps }, and assuming r s, then passage-based inter-document similarity for d1 and d2 is: , , (3) ' Where, j varies from 1 to s and inter-passage similarity Sim(Pi ,Pj ) is cosine similarity between feature vectors of two passages.

4.3 Combined Similarity Measure

Let traditional inter-document similarity for documents d1 and d2 be represented as Simd(d1,d2) and suggestedpassage-based inter-document similarity as Simp(d1,d2), then combined or effective similarity between d1 and d2 is :

Sim(d1,d2) = α*Simp(d1,d2) + (1-α)* Simd(d1,d2) (4) Where α is similarity blend factor [8] and 0 ≤ α ≤ 1.

5 Experimental Results

We conducted experiments to investigate the effectiveness of our method i.e. using both inter-document and inter-passage similarities together in improving text document clustering. The experiments were conducted for two types of fixed-length passages i.e. overlapping and non-overlapping. It is important to note we do not apply any kind of dimensionality reduction on original document vector which consists of only single term features since our aim is to investigate whether inter- passage similarities can be successfully utilized to improve clustering or not. In other words, we want to credit any improvement or deterioration in clustering to the suggested similarity measure.

5.1 Data Sets We used two data sets, out of which one is a web document data set1, manually collected and labeled from Canadian websites and second is a collection of articles posted on various USENET newsgroups. It is a subset of full 20-newsgroup dataset. It is available from the UCI KDD archive2. While web data set has moderate overlap between different classes, mini 20-newsgroup data set has varying overlap between different classes. Average length of a document in UW-Can data set is much greater than that of a document from mini 20-newsgroup dataset.

1 Link to web data set : http://pami.uwaterloo.ca/~hammouda/webdata 2 Link to mini newsgroup data set: http://kdd.ics.uci.edu/ 560 S. Paliwal and V. Pudi

Table 1. Showing data description Data Name Type # of Classes Avg. #of Set docs. words /doc 1. UW-Can HTML 314 10 469 2. Mini 20- USENET 2000 20 151 newsgroups

5.2 Evaluation Measure

We use F-measure score to evaluate the quality of the clustering. F-measure combines precision and recall by calculating their harmonic mean. Let there be a class i and cluster j, then precision and recall of cluster j with respect to class i are as follows: , , , (5) where

• nij is the number of documents belonging to class i in cluster j. • ni is number of documents belonging to class i. • nj is the number of documents in cluster j. Then F-score of class i is the maximum F-score it has in any of the clusters :

F(i) = (6) The overall F-score for clustering is the weighted average of F-score for each class i:

∑ (7) ∑

Higher F-score suggests better clustering as produced clusters are mapping to original classes with higher accuracy.

5.3 Clustering Algorithm

For clustering, we use Group Hierarchical Agglomerative Clustering with complete linkage with the help of a java based tool3.

5.4 Baseline Approach

We chose traditional tf-idf weighting based single term approach as our baseline approach since our aim is to investigate whether clustering can be improved by

3 Link to tool : http://www.cs.umb.edu/~smimarog/agnes/agnes.html Investigating Usage of Text Segmentation and Inter-Passage Similarities 561

Table 2. Showing baseline value of F-score with traditional vector based approach, for both the data sets Data Set Baseline Value UW-Can 0.7782 Mini 20-newsgroups 0.35126 combing traditional inter-document similarities with inter-passage similarities, as suggested by us.

5.5 Results Results have been summarized in Table3. For experiments with non-overlapping segments, we obtained maximum improvement of 7.39 % and 10.86 % in F-Score for UW-Can dataset and mini 20-newsgroups data set respectively. For experiments with over-lapping segments, we obtained maximum improvement of 10.04 % for UW-Can data set and 7.02 % for mini 20-newsgroup data set. For every experiment with overlapping segments, size of overlap is equal to half of window size.

Table 3. showing maximum improvement in terms of F-score over baseline approach with values of parameters like Window Size and Similarity Blend Factor Data set Text Segments Window Similarity Blend Maximum % Size Factor α improvement in F-score UW-Can Non-Overlapping 225 0.45 7.39 % UW-Can Overlapping 425 0.45 10.04 % Mini 20-Newsgroup Non-Overlapping 150 0.45 10.86 % Mini 20-Newsgroup Overlapping 225 0.6 7.02 %

5.5.1 Graphs for Selected Values of Parameters Window Size and Similarity Blend Factor α For all the experiments, similarity blend factor α assumes only five values i.e. 0.4, 0.45, 0.5, 0.55 and 0.6 as we want α to be moderate, so that the effectiveness of our method could be judged fairly. Similarity blend factor of 0.45 performs best for most of the experiments with both the data sets as evident from Fig 2, Fig 4, Fig 6 and Fig 8. If Fig 1and Fig 5 are compared with Fig 3 and Fig 7, it is clear that a larger value of window size is required for better performance when dealing with overlapping windows. Window sizes used for mini 20-newsgroups are smaller as compared to those used for UW-Can. This is in accordance with their average document length. Performance will be reduced if larger windows are used for smaller documents.

562 S. Paliwal and V. Pudi

1. For Data Set UW-Can

1.1 For Non-overlapping Text Segments

0.88 0.85 0.82 0.79

F-Score 0.76 0.73 0.7 100 125 150 175 200 225 250 275 Window Size

Fig. 1. Varying F-Score for different values of Window Size with α = 0.45

0.88 0.85 0.82 0.79

F-Score 0.76 0.73 0.7 0.4 0.45 0.5 0.55 0.6 Similarity Blend Factor α

Fig. 2. Varying F-Score for different values of α with Window size of 225.

1.2 For Overlapping Text Segments

0.88 0.85 0.82 0.79

F-Score 0.76 0.73 0.7 300 325 350 375 400 425 450 475 Window Size

Fig. 3. Varying F-Score for different values of Window Size with α = 0.45 Investigating Usage of Text Segmentation and Inter-Passage Similarities 563

0.88 0.85 0.82 0.79

F-Score 0.76 0.73 0.7 0.4 0.45 0.5 0.55 0.6 Similarity Blend Factor α

Fig. 4. Varying F-Score for different values of α with Window size of 225

2. For Data Set Mini 20-Newsgroups

2.1.1 For Non-Overlapping Text Segments

0.43 0.4 0.37 0.34

F-Score 0.31 0.28 0.25 75 100 125 150 175 200 225 250 Window Size

Fig. 5. Varying F-Score for different values of Window Size with α = 0.45

0.43 0.4 0.37 0.34

F-Score 0.31 0.28 0.25 0.4 0.45 0.5 0.55 0.6 Similarity Blend Factor α

Fig. 6. Varying F-Score for different values of α with Window size of 225 564 S. Paliwal and V. Pudi

2.2 For Overlapping Text Segments

0.45 0.4 0.35

F-Score 0.3 0.25 150 175 200 225 250 275 300 325 Window Size

Fig. 7. Varying F-Score for different values of Window Size with α = 0.45

0.45 0.4 0.35

F-Score 0.3 0.25 0.4 0.45 0.5 0.55 0.6 Similarity Blend Factor α

Fig. 8. Varying F-Score for different values of α with Window size of 225

6 Conclusion and Future Work

The presented approach might not provide best results but are definitely promising. It is to be kept in mind that the purpose of this paper is not suggesting an alternative clustering algorithm for text documents, but to determine whether document clustering can be improved or not, by combined usage of both inter-document and inter-passage similarities. There are many other possibilities such as to investigate effect on models other than vector space model, to take different similarity measure, to apply different weighting schemes for terms belonging to a text segment. In the future, we are working on developing a model which is suitable and makes use of inter-passage similarities more efficiently. Based on the results obtained, it is our intuition that if such a simple approach can improve the clustering then a more complex and complete approach can prove to be very useful and produce much better clustering. Investigating Usage of Text Segmentation and Inter-Passage Similarities 565

References

1. Tellex, S., Katz, B., Lin, J., Fernandes, A., Marton, G.: Quantitative evaluation of passage retrieval algorithms for question answering. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval (SIGIR 2003), pp. 41–47. ACM, New York (2003) 2. Salton, G., Allan, J., Buckley, C.: Approaches to passage retrieval in full text information systems. In: Korfhage, R., Rasmussen, E., Willett, P. (eds.) Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1993), pp. 49–58. ACM, New York (1993) 3. Kaszkiel, M., Zobel, J.: Effective ranking with arbitrary passages. J. Am. Soc. Inf. Sci. Technol. 52(4), 344–364 (2001) 4. Conrad, J.G., Al-Kofahi, K., Zhao, Y., Karypis, G.: Effective document clustering for large heterogeneous law firm collections. In: Proceedings of the 10th International Conference on Artificial Intelligence and Law, ICAIL 2005 (2005) 5. Lamprier, S., Amghar, T., Levrat, B., Saubion, F.: Using Text Segmentation to Enhance the Cluster Hypothesis. In: Dochev, D., Pistore, M., Traverso, P. (eds.) AIMSA 2008. LNCS (LNAI), vol. 5253, pp. 69–82. Springer, Heidelberg (2008) 6. Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Block-based web search. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2004), pp. 456–463. ACM, New York (2004) 7. Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering. In: Proceedings of the Semantic Web Workshop at SIGIR-2003, 26th Annual International ACM SIGIR Conference (2003b) 8. Hammouda, K.M., Kamel, M.S.: Efficient Phrase-Based Document Indexing for Web Document Clustering. IEEE Trans. on Knowl. and Data Eng. 16(10), 1279–1296 (2004) 9. Callan, J.P.: Passage-level evidence in document retrieval. In: Bruce Croft, W., van Rijsbergen, C.J. (eds.) Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1994), pp. 302–310. Springer-Verlag New York, Inc., New York (1994) 10. Hearst, M.A., Plaunt, C.: Subtopic structuring for full-length document access. In: Korfhage, R., Rasmussen, E., Willett, P. (eds.) Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development In Information Retrieval (SIGIR 1993), pp. 59–68. ACM, New York (1993) 11. Tagarelli, A., Karypis, G.: A segment-based approach to clustering multi-topic documents. In: Proceedings of the Text Mining Workshop, SIAM Data Mining Conference (2008) 12. Kim, J., Kim, M.H.: An Evaluation of Passage-Based Text Categorization. J. Intell. Inf. Syst. 23(1), 47–65 (2004) 13. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975) 14. Chim, H., Deng, X.: A new suffix tree similarity measure for document clustering. In: Proceedings of the 16th International Conference on World Wide Web (WWW 2007), pp. 121–130. ACM, New York (2007)