Quick viewing(Text Mode)

A Semi-Structured Document Model for Text Mining

A Semi-Structured Document Model for Text Mining

http://www.paper.edu.cn

Vol.17 No.5 J. Comput. Sci. & Technol. Sept. 2002

A Semi-Structured Document Model for Text Mining

YANG Jianwu (~) and CHEN Xiaoou (~.~-~) National Key Laboratory for Text Processing, Institute of Computer Science and Technology Peking University, Beijing 100871, P.R. China E-mail: {yjw, cxo}@icst.pku.edu.cn Received May 14, 2001; revised February 4, 2002.

Abstract A semi-structured document has more structured information compared to an ordinary document, and the relation among semi-structured documents can be fully utilized. In order to take advantage of the structure and link information in a semi-structured document for better mining, a structured link vector model (SLVM) is presented in this paper, where a vector represents a document, and vectors' elements are determined by terms, document structure and neighboring documents. Text mining based on SLVM is described in the procedure of K-means for briefness and clarity: calculating document similarity and calculating cluster center. The clustering based on SLVM performs significantly better than that based on a conventional vector space model in the experiments, and its F value increases from 0.65-0.73 to 0.82-0.86. Keywords semi-structured document, XML, text mining, vector space model, structured link vector model 1 Introduction

Lots of documents, such as HTML pages, BibTex files, SGML/XML documents, are semi-structured. With the development of the Internet and explosive growing of the amount of documents, the need for methods to search, filter, retrieve and mine semi-structured documents increases rapidly. Text mining has been studied for decades with great progresses. However, the effect is not satis- factory. In conventional mining, documents are treated as data without structure and each term of the document is taken as a unit. The order of terms, the documental structure and neighbors are not considered because information was missing in early documents. In semi-structured documents, especially in XML [1] documents, structural information is kept in a good condition and the relations of neighborhood are shown clearly by links. Recent progresses in the structural information and links of documents have drawn much attention [2-4] . A model was proposed to enhance categorization using hyperlinks [2]. There, the robust statistical models and a relaxation labeling technique for better classification by exploiting link infor- mation in a small neighborhood around the documents have been established. This technique also adapts itself gracefully to the fraction of neighboring documents with known topics. An approach was explored for the clustering of XML documents based on the topological graph of XML links and a tool was created to cluster XML documents [a]. The links of the topologicaI graph were weighted, and the clusters were obtained by minimizing the number of cut links between the clusters. Jeonghee Yi, et al. described a text classifier that can cope with structured documents with a structure vector model, where a structured vector represents a document and the vector's dements can be either terms or other structured vectors[4]. In the model, the vector is nested and defined recursively: ed(i,j) is a vector consisting of all sub-vectors ee(i + i, h) and vectors of the child elements, where 0 <~ h <~ md(i,j), ma(i,j) is the number of child nodes of ed(i,j). The vector is complicated, and the model does not take advantage of the information in the links of the document. Nevertheless, the researches are limited to some specific scopes or attributes. In this paper, a Structure and Link Vector Model (SLVM) is proposed, where the information in the structure and links of the document is effectively taken advantage of. The SLVM is based on the Vector Space Model (VSM), which represents a docmnent with a Structured Link Vector (d_struct, d_out, d_in). This research is supported by National Technology Innovation Project and Peking University Graduate Student Development Foundation as one of doctoral dissertation's imlovative research. 转载 中国科技论文在线 http://www.paper.edu.cn

604 YANG Jianwu, CHEN Xiaoou Vo1.17

d_struct, d_out and d_in are document's Structure Vector, Out-Link Vector and In-Link Vector, which are defined in Section 3. For briefness and clarity, they are described in the procedure of K-means: calculating document similarity and cluster center based on SLVM. The model is also useful in other text mining applications. Section 2 provides a brief review of conventional document models in text mining. In Section 3, we propose the SLVM model of semi-structured documents. Algorithms of similarity and cluster center calculation based on SLVM is given in Section 4. Section 5 shows the experimental results.

2 Review of Document Model in Text Mining

With the fast growing of the vast amount of text data, people may like to compare different documents, rank the importance and relevance of the documents, or find patterns and trends across multiple documents. Furthermore, the Internet can be viewed as a huge, interconnected and dynamic text~ database. Therefore, text mining is becoming an increasingly popular and essential theme in data mining. Document clustering is one of the most important techniques of text mining, and has been inves- tigated in a number of different areas[ 5I. It is a process of grouping a set of documents into several clusters. A document cluster is a collection of documents, which are similar to one another within the same cluster and dissimilar to documents in other clusters. There are a number of algorithms for clustering. K-means is a clustering algorithm suitable for vast document sets. The method of similarity's calculation can be used in classification, similarity-based retrieval and so on. In this paper, K-means is adopted to illustrate SLVM.

2.1 One of Clustering Algorithms: K-Means For a given instance set X, whose attributes are numerical, and a given integer k (_< n), /(-means divides X into k clusters, making the sum of distances between instances and their cluster centers minimum. Each cluster center is the mean of instances in the cluster. The process can be described as the following mathematical question P:

k n Minimize: P(W, Q) = ~ E wij "d(Xi, Ql) /=1 i=I k Satisfy: Ewi,z=l, wi,j>_O, i=l,...,n, j=l,...,k 1=1 where W, an n • k partition matrix, describes every instance in the cluster, and the sum of each line is equal to 1 in general; Xi is one instance of the given set X, Q1,Qs,... ,Qk are the result sets of clusters; d(., .) is the distance between two objects. Question P can be solved by repeating the solution of the following two sub-questions P1 and P2: 1) Question PI: Fixing Q = Q, question P can be predigested to P(W,-Q), and the cluster center is fixed. The main point is to calculate the similarity/distance between instance and cluster's center. Its solution is: Wi,l = 1 if d(Xi, Ql) <_ d(X~, Qt), for any t: 1 < t < k TcVi,t = 0 if for any t : t ~ l 2) Question Ps: Fixing W = W, question P can be predigested to P(W, Q), viz. update cluster center. Its solution is:

rt where 1

So, the algorithm for question P is the repeated process of two steps as follows: K-means algorithm: 1) arbitrarily choose k objects as the initial cluster centers. 2) repeat 中国科技论文在线 http://www.paper.edu.cn

No.5 A Semi-Structured Document Model for Text Mining 605

3) (re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster; 4) update the cluster means, i.e., calculate the mean value of the objects for each cluster. 5) until no change. The key of the algorithm is the calculation of similarity and cluster center.

2.2 Vector Space Model In text mining, the objects of mining are not so canonical as data in a DBMS. Generally, the documents are translated into canonical forms, like the records of a DBMS, which keep the characters of document content. For most of the clustering algorithms, documents are represented using the Vector Space Model [6] (VSM). In the model of VSM, each document d is considered to be a vector d in the term-space (set of document "words"): d = (d(1), d(2),... , d(n))- In its simplest form, each document is represented by the (TF) vector: dt/= (t fl, t f2,..., tfn). Here t3~ is the frequency of the i-th term in the document. Normally all common words are stripped out completely and different forms of a word are reduced to one canonical form. The TFIDF [7], a term weighting approach of VSM, weights each term based on its inverse document frequency (IDF) in the document collection. This discounts frequent words with little discriminating power. d(i) = TF(Wi, Doc). IDF(Wi) where TF(Wi, Doc) is the frequency of the term Wi in the document Doc.

IDF(Wi) = log D/ DF(W~)

D is the sum of the documents, DF(~Vi) is the number of documents where the term IJVi appears at least once. Usually, the similarity is defined as:

cos(di, dj) = (di * ds)/(lld~]l " IId~ll) where 9indicates the Cartesian product and [Idl] is the length of vector d. The cluster center is defined as: c= 1 Fe dCS where d is a document, d is the vector of document d, S is the document set of the cluster, ISI is the number of documents of S. The Vector Space Model does not consider the order of terms, structure of documents and neigh- boring documents.

3 Structured Link Vector Model (SLVM)

An XML document is a typical semi-structured document and is becoming a standard of data organizing and data exchanging in the new-generation web. We focus on XML documents in this paper. Example 1. XML document (? version = "1.0" encoding = "gb2312" ?} (addressbook / (person sex = "male") (name) John(/name) (email)johnQ~ml.net.cnI/email) (/person) (person sex = "female") 中国科技论文在线 http://www.paper.edu.cn

606 YANG Jianwu, CHEN Xiaoou Vot.17

Document i (name)Marymarkup language, and is a subset of SGML (Standard General- ized Markup Language), suitable for the In- ternet. In contrast to fiat text, XML docu- ments have the following characteristics. 1) The documents are structured, con- sisting of elements, and the structure Text [~_hnJ I~.net.cn [MM~ [email protected] .cn I of documents is validated; 2) Elements are composed of a tree Fig.1. DOM tree of XML document. structure, and one element can be nested in another; 3) The relation between elements or documents may be or reference. More information can be exploited in XML documents than in flat texts, so we propose a Structured Link Vector Model (SLVM), inciuding Structured Vector and Link Vector. Given a document set D, a document d E D, its Structured Link Vector is composed of Structured Vector, In-Link Vector and Out-Link Vector.

3.1 Structured Vector The structure of semi-structured documents can be expressed as a structure tree. The structure of XML documents is described by DTD (Data Type Definition), which is a list of regular expres- sions. And in other data models such as OEM [91, the structure of semi-structured documents is ex- pressed as a tree or other directed graph too. Example 2. Document Type Definition (DTD) (!ELEMENT addressbook (person)+} (!ELEMENT person (name, email?) ) (!ELEMENT name, (~PCDATA) } {[ELEMENT email (7~PCDATA)) tPCDATA{[PCDATA I (!ATTLIST person sex CDATA ~pREQUIRED > Fig.2. Structure tree Fig.3. Pure structure tree. In order to define Structured Vector, the nodes of Example 2. of a structure tree are labeled eo,..., e,~-l, which include element nodes, attribute nodes and text nodes. The element nodes of the tree correspond to the dements of the document, but the nodes, which have different paths from the root, are viewed as different kinds and given different tokens, even when the nodes correspond to the same element. We define the Structured Vector as follows. Definition 1 (Structure Vector). Given a set, D, of documents, a document d E D, then its Structure Vector is n-dimensional:

d_strue =

where r is a unit vector, corresponding to node e j, and m is the number of nodes. The relation of nodes could be reflected by unit vector ~j, and we define similarity matrix to reflect the relation in Section 4 of this paper. TF(Wi, Doc.ej) is the frequency of term V~i in the node ej of recent documents Doc. tDF(i~) = log D/DF(Wi) 中国科技论文在线 http://www.paper.edu.cn

No.5 A Semi-Structured Document Model for Text Mining 607

D is the sum of documents, DF(Wi) is the number of documents in which the term Wi appears at least once. IDF(Wi) embodies the ability to distinguish documents. Each d_struc(i) of document Structure Vector corresponds to a term, and is a vector too, because it contains unit vector r

3.2 Link Vector

To the link relation, the result of absorbing neighbor text (adding terms of neighbor to local) is worse than that based on local term alone, as pointed out by S. Chakrabarti et aI. when they enhanced hypertext categorization using hyperlinks [2] . We propose In-Link Vector and Out-Link Vector to describe the link relation between the documents, and they are considered to be independent of Structure Vector in our paper, which are based on the idea that the more links there are which link to (or link from) similar documents in two documents, the more similar the two documents are. In Xlink, Xlink:from defines the source resource of the link (the source anchor of local document in link and simple link) and Xlink:to defines the target resource (the target anchor of remote document in html link and simple link). The resource is either the whole document or elements of the document. Definition 2 (Out-Link Vector). The Out-Link Target Resource of Document Doc is defined as the target resource (Doen.ej ff link.to) of links whose source resources are the whole or part of the document Doe(link.from C Doe). The Out-Link Vector of Document Doc is defined as the sum of Structure Vector of its Out-Link Target Resource as follows:

d_out = (d_outo),..., d_out(n) ), d-out(o = )~o~,t " ~ Z (TF(Wi, Docn.ej). ej). IDF(Wi) link,from E Doe Doc,,.ej E link.to m = )~o~,t" ~-~STLout(Wi, Doc,ej). ej). IDF(Wi) (2) j=l where )%~,t is a constant, the weight of Out-Link Vector comparative to Structure Vector, Docn is neighbor of document Doc; TLout(Wi, Doc, ej) is the frequency of term Wi in the node ej of Out-link Target Resource of Doc. Other tokens are the same as formula (1). Definition 3 (In-Link Vector). In-Link Source Resource of Document Doc is defined as the source resource (Doen.ej E link,from) of links whose target resources are the whole or part of document Doc (link.toE Doc). The In-Link Vector of Document Doe is defined as the sum of Structure Vector of its In-Link Source Resource:

d_in = (d_in(1),..., d_in(n)} d_in(i) = Ain " Z Z ( TF(Wi, Docn.ej) " ej) . IDF(~VI) link.toE Doc Docn.ej E link.from m = Ain" E(TLin(Wi' Doc, ej). r IDF(Wi) (3) j=l

where Ain is a constant, the weight of In-Link Vector comparative to Structure Vector, Docn is neighbor of document Do G TLin(I/Vi, Doc, ej) is the frequency of term Wi in the node ej of In-link Source Resource of Doc. Other tokens are the same as formula (1).

3.3 Structured Link Vector

Definition 4 (Structure Link Vector). Given a set, D, of documents, a document d E D, then its Structure Link Vector is a "vector:.

d = (d .s'truc:t. d ov, t. d i'n~. 中国科技论文在线 http://www.paper.edu.cn

608 YANG Jianwu, CHEN Xiaoou Vol.17

where d_struct, d_out, d_in are the Structure Vector, Out-Link Vector and In-Link Vector of documen~ d. Definition 5 (SLVM: Structure Link Vector Model). Given a sat, D, of documents, a document d E D will be viewed as its Structure Link Vector.

4 Document Clustering Algorithm Based on SLVM

The key of K-means algorithm is the calculation of cluster center and similarity. The calculations of cluster center and similarity in K-means algorithm based on SLVM are defined as follows: 1 Cluster Center: c = ~ E d dES where c is an n-dimension vector (e(1),..., c(~)), and

i i d_in(i) /

Similarity between documents Docx and Docy: cos(d:~, dy) -- (d~ 9dy)/(lidxj[lldyll) where

i=1

= E(d_struet~(i) * d_structy(i) + d_out~(i) * d_outu(i) + d_in~(i) * d_inu(i)) i=1 m d-structx(o * d_structy(i) = ( E( TF(Wi'D~ " r . IDF(Wi)) j=l m 9 IDF(W )) k=l rn m = IDF(Wi) E rr( a, Doe .ejl. rF(W, Doa .e ). j=l k=l m m d_out~(~) . d_out~(~) = a~o~, 9IDF(Wd ~ . ~ }-~(TLout(W~, Doa~, aj). rLout(W~, Docy, ak). (~j 9~k)) j=l k=l

d_in:~(i) * d_iny(i) can be calculated as above, and we can get (d~ * dy), IId~ll, IId~ll- But the calculation of similarity does not finish, because ej * ek still exists in the expression. The r 9r means that weightiness of the same term Wi is not the same in different elements. It is one of original intentions of SLVM. That is to say, the same terms have the same importance (weightiness) anywhere in conventional VSM, but they have different importances when they are in different nodes of the document in SLVM. in order to describe the relations of nodes in documents, we bring forward the node vector similarity matrix (sjk = ej * ek) (0 < i,j <_ m - 1), where re is the number of nodes in the structure tree.

~0 " ' " ~k ' ' " Crn--1

SO SO0 9. . Sok .. , SO(re_l) : : ".. : '.. :

Sjk ~- s SjO 9. . Sjk Sj(rn--l)

: : ".. :

e,,~-I s(,,~-t)o 9 9 9 s(,~_l)~ s(,,~-t)(,~-l)

The matrix can be defined by users according to their concrete applications or can be one of the 中国科技论文在线 http://www.paper.edu.cn

No.5 A Semi-Structured Document Mode[ for Text Mining 609

For example, SJ k = 2-d,st .... (ej,e~) , lo-(depth(ej)-r-depth(ck))

where distance(ej,e~) is the distance between nodes ej and e~ in the structure tree; depth(ej) and depth(ek) are the depths of nodes ej and ek in the structure tree, respectively.

5 Experiments and Results

5.1 Evaluation of Cluster Quality There are many quality measures for text mining. In ~his paper, we use the F-measure invented by Bjorner Larsen[l~ a measure that combines the precision and recall ideas from information retrieval. We treat each cluster as if it were the result of a query and each class as if it were the desired set of the documents for a query. We then calculate the recall and precision of that duster for each given class. More specifically, for cluster j and class i Recall(i, j) = nij/ni p eci io (i, j) = where nij is the number of members of class i in cluster j, nj is the number of members of duster j and ni is the number of members of class i. The F-measure of cluster j and class i is then given by

F(i, j) = (2 9Recall(i, j) * Precision(i,j))/((Precision(i, j) + Recall(i, j))

The F-measure is given by the following.

F= E --ni max{F(i,j)}. 7~ 7 i

where the maxj is from all clusters (each j) for class i, n is the number of the documents and ni is the number of members of class i.

5.2 Datasets The dataset on which we run our experiments is part of Chinese Encyclopedia Database, which is one of the earliest large-scale national projects whose data adopt SGML/XML (http://www.ecph.com.cn/). In our experiments, the dataset contains hundreds of entries from several volumes of "Chinese Encyclopedia", and each entry is an XML document. Table 1 lists a group of typical datasets, which use 762 entries from the 12 volumes. We compare the document cluster based on SLVM with the one based on VSM (with TFIDF weighting). In our experiments, initial value of SVML is: Ao~,t = 0.5; Ain = 0.5; node similarity matrix: sjk = 2 -dist .... (ej,e~) , lo-(depth(ej)+depth(e~)) where distance(ej,ek) is the distance between nodes ej and ek in the structure tree; depth(ej) and depth(ek) are the depths of nodes ej and ek in the structure tree, respectively.

Table 1. A Group of Datasets Volume Philosophy Demotic [ Politics Economics Education Psychology # of entry 63 76 i 82 86 73 61 Volume Culture Strategic Archeology Hydraulics Mechanics Wushu "-# of entry 45 54 66 38 76 42

5.3 Results Table 2 and Table 3 are the final experimental results of the dataset in TaMe 1. The ni,j and F(i, j) are omitted because of tile limitation of space. The tokens of Tabh" 2 and Table 3 are the same as those in Subsection 5.1. 中国科技论文在线 http://www.paper.edu.cn

610 YANG Jianwu, CHEN Xiaoou Vo1.17

Table 2. Clustering Based on VSM (with TFIDF Weighting) N~ 63 76 82 86 73 61 45 54 66 38 76 42 Nj 60 69 88 87 67 78 45 63 74 28 68 35 nl,j(max(F(i,j))) 37 53 62 68 49 41 32 38 52 20 53 23 max(F(i,j)) 0.602 0.731 0.729 0.786 0.700 0.590 0.711 !0.650 0.743 0.606 0.736 0.597 F-measure: F = 0.69

Table 3. Clustering Based on SLVM Ni 63 76 82 86 73 61 45 54 66 38 76 42 Nj 59 71 89 86 70 69 42 58 71 35 7"2 40 nij (max(F(/, j))) 52 62 79 74 60 53 37 41 58 31 64 32 max(F(i,j)) 0.852 0.844 0.924 0.860 0.839 0.8150.851 0.732 0.847 0.849 0.865 0.780 F-measure: F = 0.84

In our experiments, the results are similar when we use some other datasets. The F is 0.65-0.73 in document clustering based on VSM (with TFIDF weighting), while the F increases to 0.82-0.86 in document clustering based on SLVM due to the information in the structure and link of document.

6 Conclusion

In this paper, a novel model (SLVM) of semi-structured documents for text mining is proposed, which the information of structure and link in documents is taken advantage of. Besides, a new algorithm for calculating cluster centers and similarities in the document clustering based on SLVM is presented. Based on our experiments, the model improves the accuracy of document clustering. In the next step, we plan to improve the efficiency of text mining for semi-structured documents on the basis of document structure and link.

References

[t] Bray T, Paoli J, Sperberg-McQueen C M. Extensible Markup Language (XML) 1.0. W3C Recommendation. Consortium, Feb. 1998. http://www.w3.org/TR/1998/REC-xml-19980210. [2] Chakrabarti S, Dom B, Indyk P. Enhanced hypertext categorization using hyperlinks. In Proc. ACM SIGMOD Conference, Seattle, "Washington, 1998. [3] Damien Guillaume, Fionn Murtagh. Clustering of XML documents. Computer Physics Communications, 2000, (127): 215-227. [4] Jeonghee Yi, Neel Sundaresan. A classifier for semi-structured documents. In KDD 2000, 2000 Boston, MA USA. [5] Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques. University of Minnesota, Technical Report ~00-034 (2000). http://www.cs.umn.edu/tech_reports/ [6] Gerard Salton, McGill M J. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. [7] Gerard Salton, Chris Buckley. Term weighting approaches in automatic text retrieval. Technical Report 87-881, Cornell University, Computer Science Department, November, 1987. [8] Charles F Goldfarb, Paul Prescod. The XML Handbook. Prentice Hall, PTR, 1998 [9] Papakonstantinou Y, Garcia-Molina H, Widom J. Object exchange across heterogeneous information sources. In Proceedings of the Eleventh International Conference on Data Engineering, Taipei, March, 1995, pp.251-260. [10] Bjorner Larsen, Chinatsu Aone. Fast and effective text mining using linear-time document clustering. In KDD-99, San Diego, California, 1999.

YANG Jianwu is a Ph.D. candidate in the Institute of Computer Science and Technology, Peking Uni- versity, China, where he received the M.S. degree in 1999. His current research interests include SGML/XML and data mining.

CHEN Xiaoou obtained his B.S. degree from the Department of Computer Science and Technology, National Defense University in 1983. He ~has been a research staff member at the Institute of Computer Science and Technology, Peking University since 1990, and has been a professor since 2000. He is the president of Founder Research and Development Center. His current research interests are image processing, XML data exchange and representation.