Experiments with Dbpedia, Wordnet and Sentiwordnet As Resources For

Experiments with DBpedia, WordNet and SentiWordNet as resources for sentiment analysis in micro-blogging Hussam Hamdan *,**,*** Frederic Béchet ** Patrice Bellot *,*** hussam.hamdan@lsis- frederic.bechet@lif- patrice.bellot@lsis- .org .univ-mrs.fr .org *LSIS **LIF ***OpenEdition Aix-Marseille Université CNRS Aix-Marseille Université CNRS Aix-Marseille Université CNRS Av. Esc. Normandie Niemen, Avenue de Luminy 3 pl. V. Hugo, case n°86 13397 Marseille Cedex 20, 13288 Marseille Cedex 9, 13331 Marseille Cedex 3, France France France With the availability of such content, it attracts Abstract the attention from who want to understand the opinion and interestingness of individuals. Thus, it Sentiment Analysis in Twitter has become an would be useful in various domains such as poli- important task due to the huge user-generated tics, financing, marketing and social. In this con- content published over such media. Such text, the efficacy of sentiment analysis of twitter analysis could be useful for many domains has been demonstrated at improving prediction of such as Marketing, Finance, Politics, and So- box-office revenues of movies in advance of their cial. We propose to use many features in order to improve a trained classifier of Twitter mes- release (Asur and Huberman, 2010). Sentiment sages; these features extend the feature vector Analysis has been used to study the impact of 13 of uni-gram model by the concepts extracted twitter accounts of celebrated person on their fol- from DBpedia, the verb groups and the similar lowers (Bae and Lee, 2012) and for forecasting the adjectives extracted from WordNet, the Senti- interesting tweets which are more probably to be features extracted using SentiWordNet and reposted by the followers many times (Naveed, some useful domain specific features. We also Gottron et al. , 2011). built a dictionary for emotion icons, abbrevia- However, sentiment analysis of microblogs tion and slang words in tweets which is useful faces several challenges, the limited size of posts before extending the tweets with different fea- (e.g., maximum 140 characters in Twitter), the tures. Adding these features has improved the f-measure accuracy 2% with SVM and 4% informal language of such content containing slang with NaiveBayes. words and non-standard expressions (e.g. gr8 instead of great , LOL instead of laughing out loud , 1 Introduction goooood etc.), and the high level of noise in the posts due to the absence of correctness verification In recent years, the explosion of social media has by user or spelling checker tools. changed the relation between the users and the Three different approaches can be identified in web. The world has become closer and more “real- the literature of Sentiment Analysis, the first ap- time” than ever. People have increasingly been part proach is the lexicon based which uses specific of virtual society where they have created their types of lexicons to derive the polarity of a text, content, shared it, interacted with others in differ- this approach is suffering from the limited size of ent ways and at a very increasingly rate. Twitter is lexicon and requires human expertise to build the one of the most important social media, with 1 lexicon (Joshi, Balamurali et al. , 2011). The billion tweets 1 posted per week and 637 million second one is machine learning approach which users 2. uses annotated texts with a given label to learn a statistical model and an early work was done on a 1http://blog.kissmetrics.com/twitter-statistics/ movie review dataset (Pang, Lee et al., 2002). Both 2http://twopcharts.com/twitter500million.php lexicon and machine learning approaches can be 455 Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 455–459, Atlanta, Georgia, June 14-15, 2013. c 2013 Association for Computational Linguistics combined to achieve a better performance (Khuc, The rest of this paper is organized as follows. Shivade et al. 2012). The third one is social ap- Section 2 outlines existing work of sentiment anal- proach which exploits social network properties ysis over Twitter. Section 3 presents the features and data for enhancing the accuracy of the classifi- we used for training a classifier. Our experiments cation (Speriosu, Sudan et al. , 2011; Tan, Lee et al. are described in section 4 and future work is pre- 2011; Hu, Tang et al. , 2013) (Hu, Tang et al., sented in section 5. 2013) (Tan, Lee et al ., 2011). In this paper, we employ machine learning. Each 2 Related Work text is represented by a vector in which the features have to be selected carefully. They can be the We can identify three main approaches for senti- words of the text, their POS tags (part of speech), ment analysis in Twitter. The lexicon based ap- or any other syntactic or semantic features. proaches which depend on dictionaries of positive We propose to exploit some additional features and negative words and calculate the polarity ac- (section 3) for sentiment analysis that extend the cording to the positive and negative words in the representation of tweets by: text. Many dictionaries have been created manual- • the concepts extracted from DBpedia 3, ly such as ANEW (Aaffective Norms for English Words) or automatically such as SentiWordNet • the related adjectives and verb groups ex- 4 (Baccianella, Esuli et al. 2010). Four lexicon dic- tracted from WordNet , tionaries were used to overcome the lack of words • some “social” features such as the number in each one (Joshi, Balamurali et al. 2011; Mukher- of happy and bad emotion icons, • jee, Malu et al. 2012). Automatically construction the number of exclamation and question of a Twitter lexicon was implemented by Khuc, marks, Shivade et al . (2012). • the existence of URL (binary feature), Machine learning approaches were employed • if the tweet is re-tweeted (binary feature), from annotated tweets by using Naive Bayes, Max- • the number of symbols the tweet contains, imum Entropy MaxEnt and Support Vector Ma- • the number of uppercase words, chines (SVM) (Go, Bhayani et al. 2009). Go et al. • some other senti-features extracted from (2009) reported that SVM outperforms other clas- SentiWordNet 5 such as the number of sifiers. They tried a unigram and a bigram model in positive, negative and neutral words that conjunction with parts-of-speech (POS) features; allow estimating a score of the negativity, they noted that the unigram model outperforms all positivity and objectivity of the tweets, other models when using SVM and that POS fea- their polarity and subjectivity. tures decline the results. N-gram with lexicon fea- We extended the unigram model with these tures and microbloging features were useful but features (section 4.2). We also constructed a dic- POS features were not (Kouloumpis, Wilson et al. tionary for the abbreviations and the slang words 2011). In contrast, Pak & Paroubek (2010) re- used in Twitter in order to overcome the ambiguity ported that POS and bigrams both help. Barbosa & of the tweets. Feng (2010) proposed the use of syntax features of We tested various combinations (section 4.2) of tweets like retweet, hashtags, link, punctuation and these features, and then we chose the one that gave exclamation marks in conjunction with features the highest F-measure for negative and positive like prior polarity of words and POS of words , classes (submission for Tweet subtask B of senti- Agarwal et al. (2011) extended their approach by ment analysis in twitter task of SemEval2013 using real valued prior polarity and by combining (Wilson, Kozareva et al. 2013)). We tested differ- prior polarity with POS. They build models for ent machine learning models: Naïve Bayes, SVM, classifying tweets into positive, negative and neu- IcsiBoost 6 but the submitted runs exploited SVM tral sentiment classes and three models were pro- only 6. posed: a unigram model, a feature based model and a tree kernel based model which presented a new 3 http://dbpedia.org/About tree representation for tweets. Both combining 4 http://wordnet.princeton.edu/ unigrams with their features and combining the 5 http://sentiwordnet.isti.cnr.it/ features with the tree kernel outperformed the uni- 6 http://code.google.com/p/icsiboost/ 456 gram baseline. Saif et al. (2012) proposed to use for the previous tweet, the DBpedia concepts for the semantic features, therefore they extracted the Chapel Hill are ( Settlement, PopulatedPlace, hidden concepts in the tweets. They demonstrated Place ). Therefore, if we suppose that people post that incorporating semantic features extracted us- positively about settlement, it would be more prob- ing AlchemyAPI 7 improves the accuracy of senti- able to post positively about Chapel Hill. ment classification through three different tweet corpuses. — WordNet features The third main approach takes into account the We used WordNet for extracting the synonyms of influence of users on their followers and the rela- nouns, verbs and adjectives, the verb groups (the tion between the users and the tweets they wrote. hierarchies in which the verb synsets are arranged), Using the Twitter follower graph might improve the similar adjectives (synset) and the concepts of the polarity classification. Speriosu, Sudan et al. nouns which are related by the relation is-a in (2011) demonstrated that using label propagation WordNet. with Twitter follower graph improves the polarity We chose the first synonym set for each noun, classification. Tan, Lee et al. (2011) employed adjective and verb, then the concepts of the first social relation for user-level sentiment analysis. noun synonym set, the similar adjectives of the Hu, Tang et al. (2013) proposed a sociological first adjective synonym set and the verb group of approach to handling the noisy and short text the first verb synonym set. We think that those (SANT) for supervised sentiment classification, features would improve the accuracy because they they reported that social theories such as Sentiment could overcome the ambiguity and the diversity of Consistency and Emotional Contagion could be the vocabulary.

Experiments with Dbpedia, Wordnet and Sentiwordnet As Resources For

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support