A Semi-Structured Document Model for Text Mining
Total Page:16
File Type:pdf, Size:1020Kb
http://www.paper.edu.cn Vol.17 No.5 J. Comput. Sci. & Technol. Sept. 2002 A Semi-Structured Document Model for Text Mining YANG Jianwu (~) and CHEN Xiaoou (~.~-~) National Key Laboratory for Text Processing, Institute of Computer Science and Technology Peking University, Beijing 100871, P.R. China E-mail: {yjw, cxo}@icst.pku.edu.cn Received May 14, 2001; revised February 4, 2002. Abstract A semi-structured document has more structured information compared to an ordinary document, and the relation among semi-structured documents can be fully utilized. In order to take advantage of the structure and link information in a semi-structured document for better mining, a structured link vector model (SLVM) is presented in this paper, where a vector represents a document, and vectors' elements are determined by terms, document structure and neighboring documents. Text mining based on SLVM is described in the procedure of K-means for briefness and clarity: calculating document similarity and calculating cluster center. The clustering based on SLVM performs significantly better than that based on a conventional vector space model in the experiments, and its F value increases from 0.65-0.73 to 0.82-0.86. Keywords semi-structured document, XML, text mining, vector space model, structured link vector model 1 Introduction Lots of documents, such as HTML pages, BibTex files, SGML/XML documents, are semi-structured. With the development of the Internet and explosive growing of the amount of documents, the need for methods to search, filter, retrieve and mine semi-structured documents increases rapidly. Text mining has been studied for decades with great progresses. However, the effect is not satis- factory. In conventional mining, documents are treated as data without structure and each term of the document is taken as a unit. The order of terms, the documental structure and neighbors are not considered because information was missing in early documents. In semi-structured documents, especially in XML [1] documents, structural information is kept in a good condition and the relations of neighborhood are shown clearly by links. Recent progresses in the structural information and links of documents have drawn much attention [2-4] . A model was proposed to enhance hyperlinks categorization using hyperlinks [2]. There, the robust statistical models and a relaxation labeling technique for better classification by exploiting link infor- mation in a small neighborhood around the documents have been established. This technique also adapts itself gracefully to the fraction of neighboring documents with known topics. An approach was explored for the clustering of XML documents based on the topological graph of XML links and a tool was created to cluster XML documents [a]. The links of the topologicaI graph were weighted, and the clusters were obtained by minimizing the number of cut links between the clusters. Jeonghee Yi, et al. described a text classifier that can cope with structured documents with a structure vector model, where a structured vector represents a document and the vector's dements can be either terms or other structured vectors[4]. In the model, the vector is nested and defined recursively: ed(i,j) is a vector consisting of all sub-vectors ee(i + i, h) and vectors of the child elements, where 0 <~ h <~ md(i,j), ma(i,j) is the number of child nodes of ed(i,j). The vector is complicated, and the model does not take advantage of the information in the links of the document. Nevertheless, the researches are limited to some specific scopes or attributes. In this paper, a Structure and Link Vector Model (SLVM) is proposed, where the information in the structure and links of the document is effectively taken advantage of. The SLVM is based on the Vector Space Model (VSM), which represents a docmnent with a Structured Link Vector (d_struct, d_out, d_in). This research is supported by National Technology Innovation Project and Peking University Graduate Student Development Foundation as one of doctoral dissertation's imlovative research. 转载 中国科技论文在线 http://www.paper.edu.cn 604 YANG Jianwu, CHEN Xiaoou Vo1.17 d_struct, d_out and d_in are document's Structure Vector, Out-Link Vector and In-Link Vector, which are defined in Section 3. For briefness and clarity, they are described in the procedure of K-means: calculating document similarity and cluster center based on SLVM. The model is also useful in other text mining applications. Section 2 provides a brief review of conventional document models in text mining. In Section 3, we propose the SLVM model of semi-structured documents. Algorithms of similarity and cluster center calculation based on SLVM is given in Section 4. Section 5 shows the experimental results. 2 Review of Document Model in Text Mining With the fast growing of the vast amount of text data, people may like to compare different documents, rank the importance and relevance of the documents, or find patterns and trends across multiple documents. Furthermore, the Internet can be viewed as a huge, interconnected and dynamic text~ database. Therefore, text mining is becoming an increasingly popular and essential theme in data mining. Document clustering is one of the most important techniques of text mining, and has been inves- tigated in a number of different areas[ 5I. It is a process of grouping a set of documents into several clusters. A document cluster is a collection of documents, which are similar to one another within the same cluster and dissimilar to documents in other clusters. There are a number of algorithms for clustering. K-means is a clustering algorithm suitable for vast document sets. The method of similarity's calculation can be used in classification, similarity-based retrieval and so on. In this paper, K-means is adopted to illustrate SLVM. 2.1 One of Clustering Algorithms: K-Means For a given instance set X, whose attributes are numerical, and a given integer k (_< n), /(-means divides X into k clusters, making the sum of distances between instances and their cluster centers minimum. Each cluster center is the mean of instances in the cluster. The process can be described as the following mathematical question P: k n Minimize: P(W, Q) = ~ E wij "d(Xi, Ql) /=1 i=I k Satisfy: Ewi,z=l, wi,j>_O, i=l,...,n, j=l,...,k 1=1 where W, an n • k partition matrix, describes every instance in the cluster, and the sum of each line is equal to 1 in general; Xi is one instance of the given set X, Q1,Qs,... ,Qk are the result sets of clusters; d(., .) is the distance between two objects. Question P can be solved by repeating the solution of the following two sub-questions P1 and P2: 1) Question PI: Fixing Q = Q, question P can be predigested to P(W,-Q), and the cluster center is fixed. The main point is to calculate the similarity/distance between instance and cluster's center. Its solution is: Wi,l = 1 if d(Xi, Ql) <_ d(X~, Qt), for any t: 1 < t < k TcVi,t = 0 if for any t : t ~ l 2) Question Ps: Fixing W = W, question P can be predigested to P(W, Q), viz. update cluster center. Its solution is: rt where 1 <l<k, l<j<m ~i=tw~,Ixidn ql,j -- E.i=l Wi,l So, the algorithm for question P is the repeated process of two steps as follows: K-means algorithm: 1) arbitrarily choose k objects as the initial cluster centers. 2) repeat 中国科技论文在线 http://www.paper.edu.cn No.5 A Semi-Structured Document Model for Text Mining 605 3) (re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster; 4) update the cluster means, i.e., calculate the mean value of the objects for each cluster. 5) until no change. The key of the algorithm is the calculation of similarity and cluster center. 2.2 Vector Space Model In text mining, the objects of mining are not so canonical as data in a DBMS. Generally, the documents are translated into canonical forms, like the records of a DBMS, which keep the characters of document content. For most of the clustering algorithms, documents are represented using the Vector Space Model [6] (VSM). In the model of VSM, each document d is considered to be a vector d in the term-space (set of document "words"): d = (d(1), d(2),... , d(n))- In its simplest form, each document is represented by the (TF) vector: dt/= (t fl, t f2,..., tfn). Here t3~ is the frequency of the i-th term in the document. Normally all common words are stripped out completely and different forms of a word are reduced to one canonical form. The TFIDF [7], a term weighting approach of VSM, weights each term based on its inverse document frequency (IDF) in the document collection. This discounts frequent words with little discriminating power. d(i) = TF(Wi, Doc). IDF(Wi) where TF(Wi, Doc) is the frequency of the term Wi in the document Doc. IDF(Wi) = log D/ DF(W~) D is the sum of the documents, DF(~Vi) is the number of documents where the term IJVi appears at least once. Usually, the similarity is defined as: cos(di, dj) = (di * ds)/(lld~]l " IId~ll) where 9indicates the Cartesian product and [Idl] is the length of vector d. The cluster center is defined as: c= 1 Fe dCS where d is a document, d is the vector of document d, S is the document set of the cluster, ISI is the number of documents of S.