CBER: an Effective Classification Approach Based on Enrichment Representation for Short Text Documents

J. Intell. Syst. 2017; 26(2): 233–241 Eman Ismail and Walaa Gad* CBER: An Effective Classification Approach Based on Enrichment Representation for Short Text Documents DOI 10.1515/jisys-2015-0066 Received June 27, 2015; previously published online February 29, 2016. Abstract: In this paper, we propose a novel approach called Classification Based on Enrichment Representa- tion (CBER) of short text documents. The proposed approach extracts concepts occurring in short text documents and uses them to calculate the weight of the synonyms of each concept. Concepts with the same meanings will increase the weights of their synonyms. However, the text document is short and concepts are rarely repeated; therefore, we capture the semantic relationships among concepts and solve the disambiguation problem. The experimental results show that the proposed CBER is valuable in annotating short text documents to their best labels (classes). We used precision and recall measures to evaluate the proposed approach. CBER performance reached 93% and 94% in precision and recall, respectively. Keywords: Semantic classification, lexical ontology, word sense disambiguation (WSD). 2010 Mathematics Subject Classification: 62H30, 68Q55. 1 Introduction Short text document (STD) annotation plays an important role in organizing large amounts of information into a small number of meaningful classes. Annotation of STDs becomes a challenge in many applications such as short message service, online chat, social networks comments, tweets, and snippets. STDs do not provide sufficient word occurrences, and words are rarely repeated. The traditional methods of classifying such types of documents are based on a Bag of Words (BOW) [15], which indexes text documents as independent features. Each feature is a single term or word in a document. A document is represented as a vector in feature space. A document vector is the term frequencies in the document. Word weights are the number of word occurrences in the document. Classification based on BOW has many drawbacks: STDs do not provide enough co-occurrence of words or shared context. Representation of such documents is almost sparse because of empty weights when using BOW. Data sparsity leads to low classification accuracy because of the lack of information. The BOW approach treats synonym words as different features. BOW does not represent the relations between words and documents. Therefore, it fails to solve the disambiguation problem among words (terms). Therefore, semantic knowledge is introduced as a background [6] to increase the accuracy of classification. Wikipedia [7, 9, 13] and WordNet [3] are two main types of semantic knowledge that are involved in document classification. Semantic knowledge approaches represent text documents as a bag of concepts (BOC). They treat terms as concepts with semantic weights that depend on relationships among them. Wikipedia is a large repository in the Internet that contains more than 4 million articles at the time this paper is written. Each page (Wikipage) in Wikipedia describes a single topic. The page title describes concepts in the Wikipedia *Corresponding author: Walaa Gad, Faculty of Computers and Information Sciences, Ain Shams University, Abbassia, Cairo 11566, Egypt, e-mail: [email protected] Eman Ismail: Faculty of Computers and Information Sciences, Ain Shams University, Abbassia, Cairo, Egypt 234 E. Ismail and W. Gad: An Effective CBER for Short Text Documents semantic network that is built hierarchically. In Ref. [3], the authors used the Wikipedia structure to represent documents as BOC for the classification process. Using WordNet [12, 18], BOW is enriched by new features representing topics of the text. The performance of classification in both methods is significantly better than BOW. However, data enrichment can also introduce noise. We propose a novel approach, Classification Based Enrichment Representation (CBER), for classifying documents using WordNet as a semantic background. CBER exploits the WordNet ontology hierarchical structure and relations to provide terms (concepts) a new weight. The new weights depend on an accurate assessment of semantic similarities among terms. Moreover, the proposed approach enriches the STDs with semantic weights to solve disambiguation problems such as polysemous and synonyms. We propose two approaches. The first is Semantic Analysis Based WordNet Model (SAWN), and the second is a hybrid approach between SAWN and the traditional document representation, SAWNWVTF. The word vector term frequency (WVTF) is a BOW representation of text documents. SAWN chooses the most suitable synonym for document terms by studying and understanding the surrounding terms in the same document. This is done without the need to increase the document features such as in Ref. [18]. SAWNWVTF is a hybrid approach that discovers the hidden information in STDs by identifying the important words. We applied CBER on short text “web snippets.” These types of text documents are noisy, and terms are always rare. Snippets do not share enough words to overlap well. They contain few words and do not provide enough co-occurrence of terms. The CBER performance is compared to other approaches [3, 11, 18] that use WordNet as semantic knowledge to represent documents. The obtained results are very promising. The CBER performance reaches 93% and 94% in precision and recall, respectively. The remainder of the paper is organized as follows. The previous work is presented in Section 3. The proposed approach, CBER, is described in Section 4. In Section 5, we present the experimental results and evaluation process. In Section 6, we conclude our results and overall conclusion about our approach. 2 Literature Overview In recent years, the classification of STDs has been in research focus. Two main types were proposed: enrichment- and reduction-based classification. STDs do not have enough co-occurrence terms or shared context for classification. Enrichment methods [1, 3, 11, 12, 14, 18, 22] are used to enrich the short text with more semantic information to increase document terms. In Ref. [11], the enrichment methods are based on the BOW representation for text document. They generate new words derived from an external-knowledge base, such as Wikipedia. Wikipedia is crawled to extract different topics that are associated to document keywords (terms). The new extracted topics are added to the documents as new semantic features to enrich STDs with new information. In addition, document enrichment may be done by topic analysis using Latent Dirichlet allocation (LDA) [3, 5, 12, 18], which uses probabilistic models to perform latent semantic analysis to include synonym and polysemy. LDA includes the hidden topics of STD and enriches the traditional BOW text document representation with topics. In Ref. [18], the Wikipedia knowledge base is used to apply semantic analysis to extract all topics covered by the document. TAGME, a topical annotator, is used to identify different spots from Wikipedia to annotate the text document. Moreover, they use latent topics derived from LSA (Latent Semantic Analysis) or LDA. As in Ref. [3], they annotate all training data with subtopics. They detect the topics occurring in the input texts by using a recent set of information retrieval (IR) tools, called topic annotators [11]. These tools are efficient and accurate in identifying meaningful sequences of terms in a text and link them to pertinent Wikipedia pages representing their underlying topics. Then, a ranking function is used to rank higher topics to represent documents. In Ref. [21], the authors map the document terms to topics with different weights using LDA. Each document is represented as topic features rather than term features. In Ref. [1], the authors used LDA to extract documents topics; then, a semantic relationship is built among extracted topics of a document and its words. E. Ismail and W. Gad: An Effective CBER for Short Text Documents 235 Moreover, a reduction approach is proposed to solve the problems of short text in classification [9, 14]. This approach reduces the document features and exchanges them with new terms. The new features are selected using WordNet for better classification accuracy. Soucy and Mineau [15] follow the same way by extracting some terms as features. Terms that have weights greater than a specific threshold are selected based on the weighting function in Ref. [15]. In Ref. [16], the authors reduce document features by selecting a set of words to represent a document and its topics. They use BOW representation and term frequency tf or term frequency-inverse term frequency tf-idf [17] to extract a few words to be used as query words to search with them. The words are extracted according to a clarity function to give score to the words that share specific topics. The previous methods have many drawbacks: – In enrichment methods [4, 7, 10, 13], new features or words are added to text, which increase document representation dimensions and classification process time. – In reduction methods [9, 16], documents are represented only by their topics using Wikipedia or WordNet. These methods focus on words that are related to text topics and neglect others. 3 CBER We propose the CBER model. Figure 1 shows the main modules of CBER. The proposed approach enriches the short text with auxiliary information provided by WordNet. WordNet is a lexical database for the English lan- guage [16]. It groups English words into sets of synonyms

Load more