Lexical-Semantic SLVM for XML Document Classification

3028 JOURNAL OF SOFTWARE, VOL. 9, NO. 12, DECEMBER 2014

Lexical-semantic SLVM for XML Document Classification

Jun Long School of Information Science and Engineering, Central South University, Changsha 410075, China Email: [email protected]

Luda Wang School of Information Science and Engineering, Central South University, Changsha 410075, China Xiangnan University, Chenzhou 423000, China Email: [email protected]

Zude Li, Zuping Zhang, Huiling Li and Guihu Zhao School of Information Science and Engineering, Central South University, Changsha 410075, China Email: [email protected], [email protected], [email protected]

Abstract—Structured link vector model (SLVM) and its documents, it is widely used for the representation of improved version depend on statistical term measures to arbitrary data structures,[6] for example in web services. implement XML document representation. As a result, they Considering the structure property, classification ignore the lexical semantics of terms and its mutual methods for semi-structured document analysis have to information, leading to text classification errors. This paper consider the information embedded in both the element proposed a XML document representation method, WordNet-based lexical-semantic SLVM, to solve the tags and their associated contents. Recently, the problem. Using WordNet, this method constructed a data Structured Link Vector Model (SLVM) [2] was proposed structure for characterizing lexical semantic contents of for XML documents analysis. Especially, the extended XML document, and adjusted EM modeling to version of SLVM uses the closed frequent sub-trees as disambiguate word stems. Then, synset matrix of lexical structural units for content extraction from the XML semantic contents was built in the lexical-semantic feature document. For XML document representation, existing space for XML document representation, and lexical models depend on statistical term measure for feature semantic relations were marked on it to construct the extraction. However, in the information retrieval field, feature matrix in lexical-semantic SLVM. On categorized statistical term measures causes XML document analysis dataset of Wikipedia XML, using NWKNN classification algorithm, the experimental results show that the feature to perform on the level of term string basically, and matrix of our method performs F1 measure better than neglect lexical semantic contents in the structured original SLVM and frequent sub-tree SLVM based on TF- elements. IDF. Semantic approach is an effectively used technology for document analysis. It can capture the semantic Index Terms—Semi-structured document, SLVM, Lexical features of words under analysis, and based on that, semantics, Classification, Feature matrix characterizes and classifies the document. Close relationship between the syntax and the lexical semantics of words have attracted considerable interest in both I. INTRODUCTION linguistics and computational linguistics. For XML In order to record semi-structured information, lots of document classification, the design and implementation document standards, such as HTML, BibTex and of lexical-semantic SLVM take account of the lexical SGML/XML, are recommended by many main semantics particularly. Unlike present models, using international standards organizations. The structural WordNet [3], our model developed a new term measure flexibility of these semi-structured documents offers an which can characterize lexical semantic contents and approach to representing information in specified relations, and provides a practical method for XML structure. Contrasting conventional plain documents, document representation which can handle the impact of semi-structured documents represent their syntactic synonyms and other relations. Theoretical analysis and structure via the use of document structural elements relevant experiments are carried out to verify the marked by user-specified tags, and the associated schema effectiveness of this model. specified in Schema format [1]. Extensible Markup Language (XML) is a semi-structured document format with strong support via Unicode for different human languages. Although the design of XML focuses on

II. AN OVERVIEW ON ORIGINAL SLVM AND FREQUENT C. Analysis of Feature Extraction SUB-TREE SLVMHELPFUL HINTS In the above, these models for document representation Structured Link Vector Model (SLVM) was proposed are perceived as the mode using statistical term measures. by Jianwu Yang [2], which forms basis of our work, was As a sort of ontology methods [5], XML document proposed for representing XML documents. It was representations based on statistical term measures ignore extended from the conventional vector space model recognition of lexical semantic contents. It causes the (VSM) by incorporating document structures (represented document representation to lose the mutual information as term-by-element matrices), referencing links (extracted [6] of term meanings which comes from synonyms in based on IDREF attributes), as well as element similarity different samples. Our comment on statistical term (represented as an element similarity matrix). On the measures and XML document representation can be other hand, based on original SLVM, an extended VSM clarified by analyzing a small XML corpus Example 1. utilize the closed frequent sub-trees as structural units for Example 1 content extraction from the XML document [1], which are called frequent sub-tree SLVM in this context.

tree Human treasures trees. A. XML Document Representations

SLVM represents a XML document docx using a nm´ document feature matrix Δx Î R , given as [2]

maple Δ = 轾Δ ,Δ ,,…… Δ Men prize maple. x臌 x(1) x (2) x ( m ) , (1)

where m is the number of distinct XML document n title body elements, Δxi()Î R is the TFIDF feature vector th representing the i XML element (ei), given as 轾01 men Δ =TF(,.)() w doc e IDF w 犏 x(,) i j j x i j for all j=1 to 犏犏00 human n, and TF(,.) w doc e is the frequency of the term wj in j x i Δ = 犏01 prize the element e of doc . A 犏 i x 犏 In frequent sub-tree SLVM, each document is 犏00 treasures represented as a matrix based on SLVM, where the 犏 selected closed frequent substrees are regarded as 犏11 maple structural unit [1]. In order to deal with exception that a 犏 document do not include any the selected closed frequent 臌犏00 trees substrees, it adds the vector of the document based on title body VSM into the matrix as a column. 轾00 men B. Similarity Measures 犏 The SLVM-based document similarity between two 犏犏01 human XML documents docx and docy is defined as [2] 犏 ΔB = 00 prize Sim(,) docxy doc = 犏犏 mm , (2) 犏01 treasures M ×(Δ T Δ ) 犏邋 e(,)()() i j x i y j maple ji==11 犏00 犏 trees where Δ xi()or Δ yj()is the normalized document feature 臌犏11 Figure 1. The SLVM feature matrices for Example 1 matrix of documents doc or doc , and is a matrix of x y M e dimension m×m and named as the element similarity In Example 1, the two simple XML documents are viewed as two document samples, and these two matrix. The matrix captures both the similarity documents comprise the small corpus. Evidently, the between a pair of document structural elements as well as meanings of Doc A and Doc B are extremely equivalent. the contribution of the pair to the overall document Thus, the correlation and semantic similarity between similarity. To obtain an optimal for a specific type of these two documents are considerable. But, SLVM and frequent sub-tree SLVM can not display the document XML data, SLVM-based document similarity learn the similarity between Doc A and Doc B. Obviously, because matrix using pair-wise similar training data (unsupervised document representation matrices shown in Fig.1 make learning) in an iterative manner [4]. T each Δ A()() iΔ B j = 0 , so the Sim( docAB , doc )= 0 , using the Eq. (2). Then, on behalf of statistical term

measures, the document representations on Example 1 did Based on the above definition, involved semantic- not perform well for semantic similarity. factors can character the lexical semantic contents of Example 1, which shall accomplish feature extraction of III. PROPOSED PROGRAM lexical semantic contents. For instance, in Fig. 2, the words human and man belong to different document A. Preliminary Conception and Theoretical Analysis samples in Example 1, and the common semantic-factor For document analysis, document representations synsetp (homo) that simultaneously describes the which depend on statistical term measures shall lose meanings of human and man can gain mutual information mutual information of term meanings. Besides, in [6] between term meanings. Moreover, our document different documents, term meanings are relevant to representation is able to capture the lexical semantic specific synonyms which are involved by lexical mutual information between samples which lies with a semantic contents. Thus, our model resorts to WordNet number of synonyms in different documents. [3], a lexical database for English, for extracting lexical According to the statistical theory of communications, semantics Then, the method of document representation our conception needs further analysis for theoretical proof. will construct a lexical-semantic SLVM of XML The analysis first introduces some of the basic formulae document in order to define feature matrix for of information theory [6, 7], which are used in our classification. theoretical development of samples mutual information. In WordNet, a form is represented by a string of ASCII Now, let xi and y j be two distinct terms (events) from characters, and a sense is represented by the set of (one or finite samples (event spaces) X and Y. Then, more) synonyms that have that sense [3]. Synonymy (syn let X or Y be random variable representing distinct same, onyma name) is a symmetric relation between word forms [3]. Synonymy is WordNet’s basic relation, lexical semantic contents in samples X or Y, which occur because WordNet uses sets of synonyms (synsets) to with certain probabilities. In reference to above represent word senses. Videlicet, shown as Fig 1, one definitions, mutual information between X and , word refers to several sets of synonyms (synsets). represents the reduction of uncertainty about either or when the other is known. The mutual Word information between samples, IXY(;) , is specially defined to be [7] P(,) x y ij . (3) IXY( ; )= 邋 P ( xij , y )log P()() x P y xij挝 X y Y ij In the statistical methods of SLVM, synonym set synonym set synonym set …… probability Px()i or Py()j is estimated by counting the

refers to number of observations (frequency) of or in sample X or Y, and normalizing by N, the size of the corpus. Joint describes Figure 2. Word and synonym sets probability, P(,) xij y , is estimated by counting the human man number of times (related frequency) that term equals (is related to) in the respective samples of themselves, and normalizing by N. Taking the Example 1, according to SLVM feature matrices (shown in Fig. 1), between any term in Doc A synset synset synset synset p synset synset …… and any term in Doc B, there is not any counting of (homo) Figure 3. Common semantic-factor of words times that equals . As a result, in Example 1, the WordNet contains 117659 sets of synonyms (synsets) statistical term measures indicate P( x , y )= 0 so the [3]. In Fig. 2, because one word or term refers to ij particular synonym sets, our preliminary conception is samples mutual information IXY( ; )= 0 . Thus, the that several particular synsets can strictly describe the analysis verifies that the statistical methods of feature meaning of one word for characterizing lexical semantic extraction lose mutual information of term meanings. contents. Furthermore, those particular synsets can On the other hand, for feature extraction of lexical indicate lexical semantic between words, using semantic contents, our method uses several particular Antonymy, Hyponymy, Meronymy, Troponomy and semantic-factors to describe the meaning of one word or Entailment between synsets [3]. Then, our method term. In different samples, words can be related to other defines these particular synonym sets as the semantic- words which are described by same synsets. Then, lexical factors of the word.

semantic mutual information between samples, lexical semantic contents and relations between Doc A IXY(;) , is re-defined to be and Doc B is positive. Thus, the analysis proves that the IXY(;) = semantic-factors and feature extraction of lexical semantic contents can provide the probability-weighted F( s ) mod N . (4) xyij, amount of information (PWI) [7] between XML F( s ) mod N log 邋 xyij, 挝 F( s ) mod N´ F ( s ) mod N documents on the lexical semantic level. xij X y Y xyij B. Lexical-semantic SLVM To denote probability Px()i or Py()j , In our work, XML documents are represented using function Fs()or Fs()is estimated by calculating the xi y j the lexical-semantic SLVM. In this model, each XML frequency of semantic-factors that describe the meaning document is represented as a document feature matrix in the lexical-semantic structured link vector space. For of x or y in sample X or Y, and modulo N, the total of i j organizing the lexical-semantic SLVM, the procedures semantic-factors in corpus. Meanwhile, to denote joint are as follows. First, (1) a data structure of semantic- factor information is composed for feature extraction of probability P(,) xij y , function Fs()xy, is estimated by ij lexical semantic contents. Secondly, (2) the EM modeling calculating the frequency of common semantic-factors is used to disambiguate word stems. Furthermore, (3) this that simultaneously describe the meaning of and , work constructs the feature space of lexical-semantic and modulo N. SLVM and builds the synset matrix in the space to characterize lexical semantic contents of XML. Lastly, (4) to characterize lexical semantic relations, it marks title body each vector in synset matrix with weights of 5 semantic 轾 synset relations between synsets [3]. Thus, the feature matrix in 犏 1 lexical-semantic SLVM is constructed via marking 犏犏 semantic relations on synset matrix. 犏01 synset ()homo Semantic 犏 p Δ A = 犏 Member 犏犏 synset () 犏11 q value Original Word α Word Stem α 犏犏犏 Original Word β Word Stem Ⅰ Word Stem Ⅰ …… 臌犏 synsetn title body Original Word γ Word Stem γ 轾 synset 犏 1 犏 … 犏 … 犏01 synset ()homo (a) 犏 p ΔB = 犏 Semantic Members Frequency 犏犏 Frequency of original word α synset () 犏11 q value 犏犏 Frequency of original word β 犏 Frequency of original word γ 臌犏 synsetn Figure 4. The lexical-semantic matrices for the corpus shown in … Example 1 … (b) In Example 1, joint probability is estimated Figure 5. The linked lists of Semantic Member (a) and Semantic by calculating the frequency of common semantic- Members Frequency (b) elements that relate to lexical semantic contents or (1) The data structure of semantic-factor information relations of and , and modulo N. For instance, comprises relevant information of each snyset in a document, which is formalized as a data element, shown shown in Fig. 3, the words human and man are described in Table I. It can record all essential information of by the common semantic-factor synsetq (homo). Actually, semantic-factors in a XML document, such as synset ID, according to document representation matrices shown in weight, sample ID and relevant information of words. Fig. 4, P( human , man )=> F (synset p ) mod N 0 . In Note that, in a record of the data structure, each original addition, joint probability of semantic relation word in inflected form [8] referring to the semantic-factor P(,) tree maple = F(Hyponymy ( tree , maple )) mod N > 0 . and its word stem(s) in base form [8, 9] are recorded by linked list of Semantic Member (shown in Fig. 5(a)). And, As a result, IXY( ; )> 0 , so mutual information from according to WordNet framework [3], when original

word refers to more than 1 word stems, the list of of semantic-factor i in a document, in the form of Semantic Member will expend the node of original word Shannon-Wiener index [12]. to register all word stems. The conditional probability p( c | x ) is defined by Eq. Meanwhile, the list of Semantic Members Frequency is shown in Fig. 5(b). It records the frequency of each (6). The parameter of the semantic-factor i [11], αi , is original word one by one in their order of Semantic the Frequency of original word x in semantic-factor i. K Member. is the number of semantic-factors which word stem c refers to, and Zx() is a value to ensure that the sum of TABLE I DATA STRUCTURE OF SEMANTIC-FACTOR INFORMATION all conditional probabilities for this context is equal to 1. ïì Si Item Explanation ï -PP log if x refers to c and c refers Synset ID Identification of synonym set ï å jj2 to semantic-factor i , f(,) x c = í j=1 (5) Synonymy is WordNet’s basic relation. i ï WordNet uses sets of synonyms (synsets) ï Set of Synonym ï 0 otherwise. to represent word senses.[3] îï Frequency of semantic-factor in a K Weight 1 f(,) x c element (sum of Semantic Members i (Frequency) p( c | x ) = αi . (6) Frequency ) Õ Zx()i= 1 Identification of semi-structured Sample ID document sample Identification of structural element or Original Word Element ID unit in semi-structured document of Term A linked list (shown in Fig. 3) which 1:1 Semantic carries all Original Words of Terms Word Stem Member referring to the semantic-factor and their Word Stem(s) A linked list (shown in Fig. 4) which describes describes describes describes Semantic carries frequency of each Original Words Members of Terms (that refer to the semantic- Semantic Semantic Semantic Frequency …… factor) one by one factor factor factor Figure 6. 1:1 reference of original word (2) On the basis of data structure of semantic-factor information, Semantic Member needs to disambiguate Above equations aim at finding the highest conditional word stems of original word. In case of an original word probability p( c | x ) , and using the function cl() x to referring to more than 1 word stem in base form, ensure that original word x refers to only 1 word stem semantic-factors must ensure that one original word (like Fig. 6). After semantic-factors characterizing lexical refers to only 1 word stem. Then, in order to select only 1 semantic contents of XML document preliminarily, the word stem for an original word (shown in Fig. 6), we specified ME modeling is applied to implement employ the Maximum Entropy Model [10]. disambiguation of word stems. Necessarily, the relevant ME modeling provides a framework for integrating items in the data structure of semantic-factor information information for classification from many heterogeneous shall be modified, such as the Semantic Member, the information sources [11]. In our model, we put an Frequency of original word, and the Weight. Furthermore, assumption that diversity [12] of Semantic Member some relevant semantic-factors shall be eliminated. implies the significance of the semantic-factor and the (3) As for XML document dataset, all referred rationality of existing Semantic Members in a document. semantic-factors are fixed by disambiguation of word Assume a set of original words X and a set of its word stems. Then, in feature space of lexical-semantic SLVM, stems C. The function cl(): x X® C chooses the word a XML document docx is preliminary represented using a stem c with the highest conditional probability, which nm´ synset matrix Δ Î R , defined as makes sure original word x only refers to: x cl( x )= arg max p ( c | x ) . Each feature [11] of original Δ = 轾Δ , Δ ,,…… Δ c x臌 x(1) x (2) x ( m ) , (7) word is calculated by a function that is associated to a specific word stem c, and it takes the form of Eq. (5), Δx() i= (Δ x (,1) i, Δ x (,2) i,,…… Δ x (,) i n ), (8) where Si is the number of Semantic Member in where m is the number of distinct XML document semantic-factor i, Pj is the proportion of the Frequency n of original word j to Weight in semantic-factor i, and the elements and units, Δxi()Î R is the synset vector th Si representing the i XML element or unit, given as -PPjj log2 indicates Semantic Member diversity å Δx(,) i j= FS() s j,.() doc x e i IDF s j for all j=1 to n, j=1

and FS() sj,. doc x e i is the frequency of the semantic-

th factor sj in ei , in which ei is the i XML element or unit In the XML document classification, the dataset of docx. consisting of 20 categories is composed of XML (4) In feature space of lexical-semantic SLVM, to documents of the Wikipedia XML, the training set is characterize lexical semantic relations, our method marks composed of 5000 XML documents, and the test set is Antonymy, Hyponymy, Meronymy, Troponomy and composed of 3000 XML documents. Entailment on each dimension of the synset matrix. The In the experiments, to tackle unbalanced text dataset, processing is formulized as we select an optimized KNN classification, the NWKNN n (Neighbor-Weighted K-Nearest Neighbor) algorithm ΔΔ =R(,) p q Δ defined to be Eq. (14) [13]. As for NWKNN, each XML x(,)(,) i qå x i p , (9) document d is considered to be a feature matrix based on p= 1 original SLVM, frequent sub-tree SLVM, or lexical- ïì 0.5 Antonymy semantic SLVM. ï score(,) doc c = ï Hyponymy, Meronymy, i R(,) p q = í 0.2 ï Troponomy or Entailment , (10) 骣 ï ç ÷ ï Weighti ç å Sim ( doc , docjj )δ ( doc , ci )÷ îï 0 Unrelated ç ÷. (14) 桫çdj Î KNN() d where p and q=1 to n, n is dimensional number of the th ïì 1 dcÎ synset vector of the i XML element, and Δx(,) i p is ï ji subjected to δ(,) dji c í value of the pth synset vector element of the ith XML ï 0 dcÏ îï ji element. ΔΔx(,) i q is semantic relation increment to the In the process of Eq. (8), this algorithm uses the qth dimensional value of the synset vector, and SLVM-based document similarity between feature function R(,) p q denotes semantic relation coefficient matrices of doc and docj [2] to calculate the Sim(,) doc doc . Besides, according to experience for . Specifically, when the synset of qth j th of NWKNN algorithm [13], the parameter of , dimension is related to synset of p dimension via Weighti semantic relation such as Antonymy, Hyponymy, Exponent [13], is equal to 3.5. Meronymy, Troponomy or Entailment, the B. The Result assignment is shown in equation (7). The assignments of To evaluate the classification systems, we use the F1 reflect the semantic relations which are measure [14]. Then, we can observe the effect of different organized into synsets by WordNet. kinds of data on a classification system [14]. For ease of As for XML documents, all synset dimensions carry comparison, we summarize the F1 scores over the the corresponding semantic feature values. Then, lexical- different categories using the macro-averages and micro- semantic SLVM represents a XML docx using a feature averages of F1 score, and display mean average precision. nm´ Table II summarizes the experiment result. matrix Δx Î R , defined as 轾 …… TABLE II Δx= 臌Δ x(1), Δ x (2),,Δ x ( m ) , (11) THE EXPERIMENT RESULTS where m is the number of distinct XML elements and Macro- Micro- Mean Average Approaches n F1 F1 precision units. Δ Î R is the lexical-semantic vector xi() Original SLVM th 0.241853 0.295418 0.486702 representing the i XML element, given as [1] Frequent sub-tree Δ = …… [1] 0.479748 0.518635 0.701861 x() i(Δ x (,1) i, Δ x (,2) i,,Δ x (,) i n ) , (12) SLVM Lexical-semantic Δ =+Δ ΔΔ . (13) 0.492090 0.521977 0.727203 x(,)(,) i j x i j x(,) i j SLVM where n is the number of identical Synset ID of all semantic-factors in XML dataset.

IV. EXPERIMENT AND RESULT V. CONCLUSION In the work, a data structure of semantic-factor A. The Experiment information is constructed to record relevant information In our work, experiments use three sorts of matrices to of each semantic-factor in element of XML document. It represent document sample: 1) document feature matrix can characterize lexical semantic contents and be adapted based on original SLVM, 2) document feature matrix for disambiguation of word stems. Furthermore, in the based on frequent sub-tree SLVM, 3) the lexical-semantic the feature space, lexical semantic relations are marked matrix in the lexical-semantic feature space based on on the synset matrix. Using the NWKNN algorithm, lexical-semantic SLVM lexical-semantic SLVM achieve better performance of

classification than document feature matrix built by [8] Mihai Lintean and Vasile Rus. “Measuring Semantic original SLVM and frequent sub-tree SLVM which stand Similarity in Short Texts through Greedy Pairing and Word for the typical statistical method of feature extraction. Semantics”, In Proceeding(s) of the 25th International Our future research includes using more current Florida Artificial Intelligence Research Society Conference, 2012, pp. 244-249. algorithms based on the lexical-semantic matrix for XML [9] MIT Java Wordnet Interface, Version 2.3.3. document analysis, and developing a method for http://projects.csail.mit.edu/jwi/api/edu/mit/jwi/morph/Wor analyzing WSDL document on the basis of semantic- dnetStemmer.html factor. [10] Armando Suárez, Manuel Palomar. “A maximum entropy- based word sense disambiguation system”, In proceeding(s) ACKNOWLEDGMENT of the 19th international conference on Computational linguistics (COLING '02). 2002, pp.1-7. This work was supported by National High [11] Myunggwon Hwang, Chang Choi, Pankoo Kim. Technology Research and Development Program (863 “Automatic enrichment of semantic relation network and Program) of China (No. 2012AA011205), National its application to word sense disambiguation”, IEEE Natural Science Foundation of China (No. 61272150, Transactions on Knowledge and Data Engineering, vol.23, 61472450, 61379109, 61103034), Research Fund for the no.6, 2011, pp.845-858. Doctoral Program of Higher Education of China (No. [12] C. J. Keylock. “Simpson diversity and the Shannon– 20120162110077), and Excellent Youth Foundation of Wiener index as special cases of a generalized entropy”, Oikos, vol.109, no.1, 2005, pp.203-207. Hunan Scientific Committee, China (No. 11JJ1012). [13] Songbo Tan. “Neighbor-weighted k-nearest neighbor for unbalanced text corpus”, Expert Systems with Applications, REFERENCES vol.28, no.4, 2005, pp. 667-671. [14] C. J. van Rijsbergen, Information retrieval [M], London: [1] Jianwu Yang, Songlin Wang. “Extended VSM for XML Butterworths Press, 1979. Document Classification Using Frequent Subtrees ”, In proceeding(s) of the 8th International Conference on Initiative for the Evaluation of XML Retrieval (INEX'09), Jun Long was born in Anhui Province, China in 1972. He is a 2010. professor and Ph. D. supervisor of Central South University of [2] Jianwu Yang, Xiaoou Chen. “Semi-Structured Document Information Science and Engineering. His research interests Model for Text Mining”, Journal of Computer Science and include distributed computation, software architecture and Technology, vol.17, no.5, 2002, pp.603-610. trusted computing et al. [3] George A. Miller. “WordNet: a lexical database for English” , Communications of the ACM, vol.38, no.11, Luda Wang was born in Jinan, Shandong Province, China in 1995, pp. 39-41. 1981. He received the M.S. degree in Computer Application [4] Jianwu Yang, William Cheung, Xiaoou Chen. “Learning Technology from Hunan University in 2009, and he is currently element similarity matrix for semi-structured document a Ph. D. candidate of Central South University of Computer analysis”, Knowledge and Information Systems, vol.19, Science and Technology. no.1, 2009, pp.53-78. He has been a faculty member of Computer Science at [5] David Sánchez, Montserrat Batet. “A semantic similarity Xiangnan University, Chenzhou, China, since 2005, where he is method based on information content exploiting multiple currently a lecturer. His research interests include AI, data ontologies”, Expert Systems with Applications, vol.40, analysis and information retrieval. no.4, 2013, pp.1393–1399. [6] Akiko Aizawa. “An information-theoretic perspective of Zude Li is an Assistant Professor at Central South University tf–idf measures”, Information Processing and Management, (China). He obtained his Ph.D. degree in 2010 from The vol.39, no.1, 2003, pp.45-65. University of Western Ontario (Canada). His research interests [7] Wen Zhang, Taketoshi Yoshida, Xijin Tang. “A are in the fields of software architecture, evolution and quality. comparative study of TF*IDF, LSI and multi-words for He collaborates with IBM Canada and is appointed as a text classiﬁcation”, Expert Systems with Applications, Software Engineering expert in SKANE SOFT. vol.38, no.3, 2011, pp.2758–2765.