J. Intell. Syst. 2017; 26(2): 233–241

Eman Ismail and Walaa Gad* CBER: An Effective Classification Approach Based on Enrichment Representation for Short Text Documents

DOI 10.1515/jisys-2015-0066 Received June 27, 2015; previously published online February 29, 2016.

Abstract: In this paper, we propose a novel approach called Classification Based on Enrichment Representa- tion (CBER) of short text documents. The proposed approach extracts concepts occurring in short text doc- uments and uses them to calculate the weight of the synonyms of each concept. Concepts with the same meanings will increase the weights of their synonyms. However, the text document is short and concepts are rarely repeated; therefore, we capture the semantic relationships among concepts and solve the disambigua- tion problem. The experimental results show that the proposed CBER is valuable in annotating short text documents to their best labels (classes). We used precision and recall measures to evaluate the proposed approach. CBER performance reached 93% and 94% in precision and recall, respectively.

Keywords: Semantic classification, lexical ontology, sense disambiguation (WSD).

2010 Mathematics Subject Classification: 62H30, 68Q55.

1 Introduction

Short text document (STD) annotation plays an important role in organizing large amounts of information into a small number of meaningful classes. Annotation of STDs becomes a challenge in many applications such as short message service, online chat, social networks comments, tweets, and snippets. STDs do not provide sufficient word occurrences, and are rarely repeated. The traditional methods of classifying such types of documents are based on a Bag of Words (BOW) [15], which indexes text docu- ments as independent features. Each feature is a single term or word in a document. A document is repre- sented as a vector in feature space. A document vector is the term frequencies in the document. Word weights are the number of word occurrences in the document. Classification based on BOW has many drawbacks: STDs do not provide enough co-occurrence of words or shared context. Representation of such documents is almost sparse because of empty weights when using BOW. Data sparsity leads to low classification accuracy because of the lack of information. The BOW approach treats synonym words as different features. BOW does not represent the relations between words and documents. Therefore, it fails to solve the disambiguation problem among words (terms). Therefore, semantic knowledge is introduced as a background [6] to increase the accuracy of classifica- tion. Wikipedia [7, 9, 13] and WordNet [3] are two main types of semantic knowledge that are involved in docu- ment classification. Semantic knowledge approaches represent text documents as a bag of concepts (BOC). They treat terms as concepts with semantic weights that depend on relationships among them. ­Wikipedia is a large repository in the Internet that contains more than 4 million articles at the time this paper is written. Each page (Wikipage) in Wikipedia describes a single topic. The page title describes concepts in the Wikipedia

*Corresponding author: Walaa Gad, Faculty of Computers and Information Sciences, Ain Shams University, Abbassia, ­Cairo 11566, Egypt, e-mail: [email protected] Eman Ismail: Faculty of Computers and Information Sciences, Ain Shams University, Abbassia, Cairo, Egypt 234 E. Ismail and W. Gad: An Effective CBER for Short Text Documents that is built hierarchically. In Ref. [3], the authors used the Wikipedia structure to represent documents as BOC for the classification process. Using WordNet [12, 18], BOW is enriched by new features representing topics of the text. The performance of classification in both methods is significantly better than BOW. However, data enrichment can also introduce noise. We propose a novel approach, Classification Based Enrichment Representation (CBER), for classifying documents using WordNet as a semantic background. CBER exploits the WordNet ontology hierarchical structure and relations to provide terms (concepts) a new weight. The new weights depend on an accu- rate assessment of semantic similarities among terms. Moreover, the proposed approach enriches the STDs with semantic weights to solve disambiguation problems such as polysemous and synonyms. We propose two approaches. The first is Semantic Analysis Based WordNet Model (SAWN), and the second is a hybrid approach between SAWN and the traditional document representation, SAWNWVTF. The word vector term fre- quency (WVTF) is a BOW representation of text documents. SAWN chooses the most suitable synonym for document terms by studying and understanding the surrounding terms in the same document. This is done without the need to increase the document features such as in Ref. [18]. SAWNWVTF is a hybrid approach that discovers the hidden information in STDs by identifying the important words. We applied CBER on short text “web snippets.” These types of text documents are noisy, and terms are always rare. Snippets do not share enough words to overlap well. They contain few words and do not provide enough co-occurrence of terms. The CBER performance is compared to other approaches [3, 11, 18] that use WordNet as semantic knowledge to represent documents. The obtained results are very promising. The CBER performance reaches 93% and 94% in precision and recall, respectively. The remainder of the paper is organized as follows. The previous work is presented in Section 3. The proposed approach, CBER, is described in Section 4. In Section 5, we present the experimental results and evaluation process. In Section 6, we conclude our results and overall conclusion about our approach.

2 Literature Overview

In recent years, the classification of STDs has been in research focus. Two main types were proposed: enrich- ment- and reduction-based classification. STDs do not have enough co-occurrence terms or shared context for classification. Enrichment methods [1, 3, 11, 12, 14, 18, 22] are used to enrich the short text with more semantic information to increase document terms. In Ref. [11], the enrichment methods are based on the BOW representation for text document. They generate new words derived from an external-knowledge base, such as Wikipedia. Wikipedia is crawled to extract different topics that are associated to document keywords (terms). The new extracted topics are added to the documents as new semantic features to enrich STDs with new information. In addition, document enrichment may be done by topic analysis using Latent Dirichlet allocation (LDA) [3, 5, 12, 18], which uses probabilistic models to perform to include synonym and ­polysemy. LDA includes the hidden topics of STD and enriches the traditional BOW text document represen- tation with topics. In Ref. [18], the Wikipedia knowledge base is used to apply semantic analysis to extract all topics covered by the document. TAGME, a topical annotator, is used to identify different spots from Wikipedia to annotate the text document. Moreover, they use latent topics derived from LSA (Latent Semantic Analysis) or LDA. As in Ref. [3], they annotate all training data with subtopics. They detect the topics occurring in the input texts by using a recent set of (IR) tools, called topic annotators [11]. These tools are efficient and accurate in identifying meaningful sequences of terms in a text and link them to pertinent Wikipedia pages representing their underlying topics. Then, a ranking function is used to rank higher topics to represent documents. In Ref. [21], the authors map the document terms to topics with different weights using LDA. Each document is represented as topic features rather than term features. In Ref. [1], the authors used LDA to extract documents topics; then, a semantic relationship is built among extracted topics of a document and its words. E. Ismail and W. Gad: An Effective CBER for Short Text Documents 235

Moreover, a reduction approach is proposed to solve the problems of short text in classification [9, 14]. This approach reduces the document features and exchanges them with new terms. The new features are selected using WordNet for better classification accuracy. Soucy and Mineau [15] follow the same way by extracting some terms as features. Terms that have weights greater than a specific threshold are selected based on the weighting function in Ref. [15]. In Ref. [16], the authors reduce document features by selecting a set of words to represent a document and its topics. They use BOW representation and term frequency tf or term frequency-inverse term frequency tf-idf [17] to extract a few words to be used as query words to search with them. The words are extracted according to a clarity function to give score to the words that share specific topics. The previous methods have many drawbacks: –– In enrichment methods [4, 7, 10, 13], new features or words are added to text, which increase document representation dimensions and classification process time. –– In reduction methods [9, 16], documents are represented only by their topics using Wikipedia or WordNet. These methods focus on words that are related to text topics and neglect others.

3 CBER

We propose the CBER model. Figure 1 shows the main modules of CBER. The proposed approach enriches the short text with auxiliary information provided by WordNet. WordNet is a lexical database for the English lan- guage [16]. It groups English words into sets of synonyms called synsets, and records relations among these synonym sets or their members. We use the word document to refer to STDs. The proposed CBER consists of –– Document preprocessing; –– Document representation using WVTF; –– Document enrichment using the SAWN approach;

–– Hybrid approach SAWNWVTF using both WVTF and SAWN approaches.

Short text documents

Words vector based term Semantic analysis based frequency approach WordNet approach (WVTF) (SAWN)

Baseline Hybrid SAWN_WVTF Semantic classified “CBER” approach classified texts texts

Hybrid classified texts

Figure 1: CBER Approach. 236 E. Ismail and W. Gad: An Effective CBER for Short Text Documents

3.1 Document Preprocessing

Each text document is introduced to CBER as a line of terms or words. CBER indexes document into terms, removes stop words, and applies using the Porter stemmer algorithm [10]. We apply the stemmer algorithm on words that are not defined in WordNet. Moreover, a pruning step is done to eliminate rare words. Rare words are terms that only appear in a few documents and are unlikely to be helpful in annotation. The unwanted rare words increase the size of BOW. Therefore, we set a threshold for pruning to reduce the number of features and get rid of words that have a number of occurrences in a data set less than the prede- fined threshold.

3.2 WVTF

STDs are represented as vectors. Each vector is a set of keywords (terms). Each word has a weight, term ­frequency tf, which is the number of occurrences of this word in a document. We apply the term frequency-inverse document frequency, tf-idf [15], weighting function to improve the classification performance. The term weight, tw, is defined as n tt=+log( f 1)log, (1) wt,d D t

where tft,d is the number of occurrences of term t in document d, n is the total number of documents, and Dt is the number of documents that contain term t. After that, we apply normalization using the L2 norm function defined for complex vector, also known as the Euclidean norm. t 2 wi Lnorm()ti = 22 2 , (2) Sqrt(.ttww++.)+tw 01 m where t is the term i in document d; t is the weight of term i; and m is the number of terms in document d. i wi The norm function helps in computing similarities between documents [6]. We use a cosine function [13] to calculate the semantic similarities among documents.

3.3 SAWN

The proposed semantic analysis based on WordNet, SAWN, uses WordNet to get and capture the semantic meaning of documents. SAWN chooses the best concepts that represent a document semantically. WordNet is a database of English words that are linked together by their semantic relationships. It is a dictionary or with a graph structure hierarchy. It contains 155,327 terms, 597 senses, and 207,016 pairs of term- sense. It groups nouns, verbs, adjectives, and adverbs into sets of synonyms. Words having the same concept are grouped into synsets. Each synset contains a brief definition, gloss, for the synset. It supports different relations such as hyperonymy, hyponymy, or is-a relation. SAWN enriches STDs with semantic information to understand document meaning and overcome the disambiguation problems [8]. The proposed SAWN captures the most meaningful sense of a document terms by studying and understanding the surrounding terms in the same document. Many measures [2, 20] are used for calculating the relatedness among terms. The relatedness is based on gloss overlaps [19]; that is, if the glosses (definitions) of two concepts share words, then they are related. The more words the two concept glosses share, the concepts are more related. We adopt the similarity measure of Wu and Palmer [2, 7] to calculate the relatedness between two senses to solve the disambiguation problem. E. Ismail and W. Gad: An Effective CBER for Short Text Documents 237

Each document dj ∀ 1 ≤ j ≤ n, where n is the number of documents in the data set D, is represented as a vector of terms ti ∀1 ≤ i ≤ m, where m is the number of terms in document dj. Document dj is defined as a vector of terms as dj = t1, t2, …, tm.

The proposed SAWN is searching for the best meaning of a term ti. Each term has many senses. Each sense is given a score. The sense that has the highest score is chosen to be the best meaning of a term ti. For example, the term “dog” has many senses:

 < Synset(’dog’), Synset(’frump’), Synset(’cad’), Synset(’frank’), Synset(’pawl’), Synset(’andiron’), Synset(’chase’) >

SAWN captures the best synset (sense) of the term “dog” by its context, and the relatedness of senses and other senses in the document. SAWN(ti) is defined as

= SAWN()tSim∑max( imDist((Stij), St()nm), (3) tt≠ ni where S(ti)j is the sense j of term ti, SimDist is the similarity distance between term senses, and maxm is the highest sense score of similarity distance. SAWN calculates semantic similarities by considering the depths of the two senses, along with the depth of the lowest common ancestor (LCA) of two senses. A score is given to represent semantic distance, 0 < score < = 1. A score cannot be zero because the depth of the LCA is never zero. The score is 1 if the two input senses are the same.

Depth( LCA( St(), St())) SimDistS((tS), ()t )2=∗ 12. (4) 12 Depth( St())D+ epth((St)) 12 Equation (4) returns a score indicating how similar two senses are, based on the depth of the two senses in the taxonomy and their LCA.

3.4 Hybrid SAWNWVTF Approach

SAWN enhances classification performance if all document terms are defined in WordNet. However, some document terms may not be found and defined in WordNet. They may be abbreviations or spelling mistake terms. The undefined words do not have senses. Therefore, term frequency should be considered in weighting scores. The hybrid SAWNWVTF is proposed to sum the term frequency weight using WVTF and semantic weight using SAWN. SAWNWVTF calculates the new semantic weight for term ti as follows: SAWN ()tt=+SAWN()t , (5) WVTF iwi i where t is the term weight using the WVTF approach. Combining the two approaches solves the limitations wi of the previous work by adapting the Lesk dictionary algorithm [16]. Experiments show that the CBER with its two approaches overcomes the limitations found in other works [9, 18, 21].

4 Experimental Results

We evaluate the proposed CBER with its two approaches, SAWN and the hybrid SAWNWVTF, over a snippets data set. We use a snippets data set because it is an example of STDs that is used in Refs. [11, 18]. The snippets data set has been created by Phan et al. [11]. It is composed of 12 K snippets drawn from Google. The snippets data set is labeled to eight classes: business, computers, culture-arts, education-­science, engineering, health, politics, and sports. Figure 2 shows the eight classes of the snippets data set. A [6] is used to assess and evaluate the proposed model. It is a simple probabilis- tic classifier based on applying Bayes theorem with strong (naive) independence assumptions between the 238 E. Ismail and W. Gad: An Effective CBER for Short Text Documents

2500

2000

1500

1000

Document number 500

0 ts ing ts-ent. Health Spor Business Computers Engineer olitics-society P Culture-ar Education-science Dataset classes

Figure 2: Snippets Data Set Classes. features. The naive Bayes classifier is highly scalable, and helps to get very good results. We do cross-folding validation on the snippets to validate the classification process. The snippets data set is partitioned into two groups. One partition is the training data set. The second is the testing data set. Cross-folding validation measures the classification accuracy over the training data set, and takes differ- ent values from 2 to 10. For each fold, we measure the relative absolute error in the WVTF approach and CBER. The comparison of the relative absolute error is shown in Figure 3. We stop cross-validation on 10-folds as the error decreases. In our experiments, we reduce the average error rate from 24.94% using the WVTF approach to 15.0% using the proposed CBER. The improvement reaches 10% in error rate measure. Four measures are used to evaluate the classification performance of the proposed model. They are Preci- sion, Recall, F-measure, and Accuracy. The performance measures are defined as

Tp Precision,= (6) Tp + Fp

Tp Recall = , (7) Tp + Fn

PrecisionR∗ ecall FMeasure =∗2 (8) PrecisionR+ ecall

Tp + Tn Accuracy = , (9) Tp ++Tn Fp + Fn where Tp, true positive, is the number of documents correctly selected to their classes; Fp, false positive, is the number of documents incorrectly rejected from their classes; Fn, false negative, is the number of docu-

45 WVTF approach 40 Hybrid approach 35 30 25 20

e absolute error 15 10 Relativ 5 0 Fold 2Fold 3Fold 4Fold 5Fold 6Fold 7Fold 8Fold 9Fold 10

Figure 3: Cross-folding Validation on the Training Data Set. E. Ismail and W. Gad: An Effective CBER for Short Text Documents 239 ments incorrectly selected to a class; and Tn, true negative, is the number of documents correctly rejected from a class. In Figure 4, we run the classifier on different data set sizes starting from 1 K to 10 K. We compare our pro- posed SAWN and SAWNWVTF with WVTF as a baseline. The STD data set shares few common words. It is very sparse. Therefore, the WVTF fails to label the documents to their correct classes [11]. The CBER gets higher accuracy in classification, as it increases the accuracy from 65.75% to 93%. CBER is efficient in working with sparse and noisy data sets. It is not based only on frequency weight for terms but it also adds a semantic weight to capture the semantic relatedness of the context of terms.

Figures 5–7 show a detailed comparison for the SAWN and hybrid SAWNWVTF compared to WVTF. Figure 8 compares SAWN and hybrid SAWNWVTF with the classifier found in Refs. [11, 18]. The authors proposed a clas- sifier called topical classifier that extracts topics related to documents. The topical classifier [11] is based on enrichment representation. It enriches documents with docu- ment topics using TAGME tools [18]. The topical annotator [18] is connected to Wikipedia and classifies the extracted topics. The results are not good because of data sparseness. It loses parts of its original text. Its results are close to the results of Phan et al. [11]. They built a framework for classifying STD. The framework

100 90 Phan et al. 80 ) baseline 70 60 Hybrid 50 40

Accuracy (% 30 20 10 0 1 K2 K3 K4 K5 K 6 K7 K8 K9 K10 K Size of data

Figure 4: Accuracy of the CBER Approach in Comparison with the Baseline Approach as in Ref. [11].

95

Precision 90 WVTF SAWN 85 Hybrid

80 1 K1 K2 K3 K4 K5 K6 K7 K8 K9 K 10 K

Figure 5: Evaluation of the CBER Approach in Terms of Precision.

95

Recall 90 WVTF SAWN Hybrid 85

80 1 K1 K2 K3 K4 K5 K6 K7 K8 K9 K 10 K

Figure 6: Evaluation of the CBER Approach in Terms of Recall. 240 E. Ismail and W. Gad: An Effective CBER for Short Text Documents

95

90 F-measure WVTF

SAWN 85 Hybrid

80 1 K1 K2 K3 K4 K5 K6 K7 K8 K9 K10 K

Figure 7: Evaluation of the CBER Approach in Terms of F-Measure.

95 90 85 80 75 SVM 70 MaxEnt Topical 65 Phan 60 SAWN 55 50 45 1 K2 K3 K4 K5 K6 K7 K8 K9 K10 K

Figure 8: Results of the CBER Approach.

gets document topics using LDA. It chooses a large number of words with special characteristics to cover all documents. Then, it uses the topics to classify documents and annotate them.

As shown in Figure 8, we compare the accuracies of the proposed SAWN and SAWNWVTF at different data- base sizes with those in Phan et al. [11], Topical, MaxEnt, and SVM. SAWN increases the accuracy from 81% to

89% compared to that in Phan et al. SAWNWVTF reaches an accuracy of 93% compared with the Topical classi- fier, which gets a maximum of 81% classification performance, while SVM and MaxEnt reach an accuracy of 74.93% and 65.75%, respectively. The results ensure that the proposed CBER outperforms other approaches in terms of accuracy, precision, recall, and F-measure. The main contribution of this work is to assign a weighting score to document terms that are related in meaning. We adopt Wu and Palmer’s method to measure the semantic relatedness among the senses of terms. If terms do not have senses because they are not defined in the dictionary, the traditional weighting score WVTF is only considered. Therefore, CBER only gives good results if text documents are written in formal English language. If they are not written well, CBER gives the same results of the traditional methods such as WVTF.

5 Conclusion

In this paper, we proposed a novel, scalable, and efficient approach for classifying STDs. The proposed CBER focuses on STD enrichment with hidden information. We focused on capturing the semantic context of the STD concepts. The main contribution of this paper is employing WordNet to solve disambiguation problems in short text classification. CBER consists of two approaches: SAWN and SAWNWVTF. The proposed CBER captures the semantic context of documents by analyzing documents and giving a semantic score to document terms. E. Ismail and W. Gad: An Effective CBER for Short Text Documents 241

The traditional methods are based on the term frequency in the document. We applied CBER on short text web snippets. A snippets data set is sparse and noisy. Moreover, it does not share enough terms to overlap well. Extensive experimental evaluation shows that the additional semantic information increases the accu- racy of the classification results. This performance improvement demonstrates a promising achievement compared to other document classification methods in terms of Precision, Recall, F-measure, and Accuracy.

Bibliography

[1] A. Bouaziz, C. Dartigues and P. Lloret, Short text classification using semantic random forest, Springer International ­Publishing, Switzerland, 2014. [2] A. Budanitsky and G. Hirst, Evaluating WordNet-based measures of lexical semantic relatedness, Association for Computa- tional Linguistics, J. Comput. Linguis.-COLI, 32 (2006), 13–47. [3] P. Ferragina and U. Scaiella, TAGME: on-the-fly annotation of short text fragments (by Wikipedia entities), in:Proceedings ­ of the 19th ACM International Conference on Information and Knowledge Management (CIKM ’10), pp. 1625–1628, ACM, New York, 2010. [4] Y. Genc, Y. Sakamoto and J. Nickerson, Discovering context: classifying tweets through a semantic transform based on Wikipedia, in: FAC 2011, edited by D. D. Schmorrow and C. M. Fidopiastis. LNCS, Springer, Heidelberg, 2011. [5] J. Hoffart, M. Yosef, I. Bordino, M. Pinkal, M. Spaniol, B. Taneva, S. Thater and G. Weikum, Robust disambiguation of named entities in text, in: Proceedings of Conference on Empirical Methods in Natural Language Processing, pp. 782–792, ­Edinburgh, Scotland, UK, 2011. [6] V. Korde and C. Mahender, Text classification and classifiers: a survey, Int. J. Artif. Intell. Appl. (IJAIA) 3 (2012). [7] C. Makris, Y. Plegas and E. Theodoridis, Improved text annotation with Wikipedia entities, Coimbra, Portugal, 2013. [8] R. Navigli, Word sense disambiguation: a survey, J. ACM Comput. Surveys–CSUR, 41 (2009), 1–69. [9] L. Patil and M. Atique, A semantic approach for effective using WordNet, arXiv preprint arXiv:1303.0489, 2013. [10] T. Pedersen, S. Patwardhan and J. Michelizzi, WordNet: similarity-measuring the relatedness of concepts, American ­Association for Artificial Intelligence, 2004. www.aaai.org. [11] X. Phan, L. Nguyen and S. Horiguchi, Learning to classify short and sparse text and web with hidden topics from large-scale data collections, International World Wide Web Conference Committee (IW3C2), ACM, April, 2008. [12] U. Scaiella, P. Ferragina, A. Marino and M. Ciaramita, Topical clustering of search results, in: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, pp. 223–232. ACM, 2012. [13] J. Sedding and D. Kazakov, WordNet-based text document clustering, in: Proceedings of the 3rd Workshop on Robust Meth- ods in Analysis of Natural Language Data. Association for Computational Linguistics, 2004. [14] G. Song, Y. Ye, X. Du, X. Huang and S. Bie, Short text classification: survey, J. Multimedia 9 (2014), 635–643. [15] P. Soucy and G. Mineau, Beyond TFIDF weighting for text in the vector space model, in: Proceedings of the 19th International Joint Conference on Artificial Intelligence, pp. 1130–1135, 2004. [16] A. Sun, Short text classification using very few words, in: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1143–1144, 2012. [17] X. Sun, W. Haofen and Y. Yong, Towards effective short text deep classification, in: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1143–1144. ACM, 2011. [18] D. Vitale, P. Ferragina and U. Scaiella, Classification of short texts by deploying topical annotations, in: Advances in ­Information Retrieval, vol. 7224, pp. 376–387, Springer Berlin Heidelberg, 2012. [19] B. Wang, Y. Huang, W. Yang, and X. Li, Short text classification based on strong feature thesaurus, Zhejiang University and Springer-Verlag, Berlin, 2012. [20] M. Warin, Using WordNet and semantic similarity to disambiguate an ontology, vol. 25. University of Stockholm, ­Stockholm, Sweden, 2004. Retrieved January. [21] L. Yang, C. Li and O. Ding, Combining lexical and semantic features for short text classification, Proc. Comp. Sci., 22 (2013), 78–86. [22] W. Yih and C. Meek, Improving similarity measures for short segments of text, in: AAAI'07 Proceedings of the 22nd National Conference on Artificial Intelligence, vol. 2, pp. 1489–1494, AAAI Press, Palo Alto, CA, USA, 2007.