Semeval-2016 Task 7: Determining Sentiment Intensity of English and Arabic Phrases
Total Page:16
File Type:pdf, Size:1020Kb
SemEval-2016 Task 7: Determining Sentiment Intensity of English and Arabic Phrases Svetlana Kiritchenko and Saif M. Mohammad Mohammad Salameh National Research Council Canada University of Alberta [email protected] [email protected] [email protected] Abstract For each domain, a separate task with its own devel- opment and test sets was set up. The phrases include We present a shared task on automatically de- those composed of negators (e.g., nothing wrong), termining sentiment intensity of a word or a modals (e.g., might be fun), and degree adverbs (e.g., phrase. The words and phrases are taken from fairly important) as well as phrases formed by words three domains: general English, English Twit- with opposing polarities (e.g., lazy sundays). ter, and Arabic Twitter. The phrases include those composed of negators, modals, and de- Lists of words and their associated sentiment are gree adverbs as well as phrases formed by commonly referred to as sentiment lexicons. They words with opposing polarities. For each of are used in sentiment analysis. For example, a the three domains, we assembled the datasets number of unsupervised classifiers rely primarily on that include multi-word phrases and their con- sentiment lexicons to determine whether a piece of stituent words, both manually annotated for text is positive or negative. Supervised classifiers real-valued sentiment intensity scores. The three datasets were presented as the test sets also often use features drawn from sentiment lexi- for three separate tasks (each focusing on a cons (Mohammad et al., 2013; Pontiki et al., 2014). specific domain). Five teams submitted nine Sentiment lexicons are also beneficial in stance de- system outputs for the three tasks. All datasets tection (Mohammad et al., 2016a; Mohammad et created for this shared task are freely available al., 2016b), literary analysis (Hartner, 2013; Kleres, to the research community. 2011; Mohammad, 2012), and for detecting person- ality traits (Minamikawa and Yokoyama, 2011; Mo- hammad and Kiritchenko, 2015). 1 Introduction Existing manually created sentiment lexicons Words have prior associations with sentiment. For tend to provide only lists of positive and negative example, honest and competent are associated with words (Hu and Liu, 2004; Wilson et al., 2005; Mo- positive sentiment, whereas dishonest and dull are hammad and Turney, 2013). The coarse-grained associated with negative sentiment. Further, the de- distinctions may be less useful in downstream ap- gree of positivity (or negativity), also referred to as plications than having access to fine-grained (real- intensity, can vary. For example, most people will valued) sentiment association scores. Most of the agree that succeed is more positive (or less negative) existing sentiment resources are available only for than improve, and fail is more negative (or less pos- English. Non-English resources are scarce and often itive) than setback. We present a shared task where based on automatic translation of the English lexi- automatic systems are asked to predict a prior sen- cons (Abdul-Mageed and Diab, 2014; Eskander and timent intensity score for a word or a phrase. The Rambow, 2015). Manually created sentiment lex- words and phrases are taken from three domains: icons usually include only single words. Yet, the general English, English Twitter, and Arabic Twitter. sentiment of a phrase can differ markedly from the 42 Proceedings of SemEval-2016, pages 42–51, San Diego, California, June 16-17, 2016. c 2016 Association for Computational Linguistics sentiment of its constituent words. Sentiment com- 2 Task Description position is the determining of sentiment of a multi- The task is formulated as follows: given a list of word linguistic unit, such as a phrase or a sentence, terms (single words and multi-word phrases), an au- from its constituents. Lexicons that include senti- tomatic system needs to provide a score between 0 ment associations for phrases and their constituents and 1 that is indicative of the term’s strength of as- are useful in studying sentiment composition. We sociation with positive sentiment. A score of 1 indi- refer to them as sentiment composition lexicons. cates maximum association with positive sentiment Automatically created lexicons often have real- (or least association with negative sentiment) and a valued sentiment association scores, have a high score of 0 indicates least association with positive coverage, can include longer phrases, and can eas- sentiment (or maximum association with negative ily be collected for a specific domain. However, due sentiment). If a term is more positive than another, to the lack of manually annotated real-valued senti- then it should have a higher score than the other. ment lexicons the quality of automatic lexicons are There are three tasks, one for each of the three often assessed only extrinsically through their use in domains: sentence-level sentiment prediction. In this shared task, we intrinsically evaluate automatic methods General English Sentiment Modifiers Set: • that estimate sentiment association scores for terms This dataset comprises English single words in English and Arabic. For this, we assembled and multi-word phrases from the general do- three datasets of phrases and their constituent single main. The phrases are formed by combining words manually annotated for sentiment with real- a word and a modifier, where a modifier is a valued scores (Kiritchenko and Mohammad, 2016a; negator, an auxilary verb, a degree adverb, or Kiritchenko and Mohammad, 2016c). a combination of those, for example, would be We first introduced this task as part of the very easy, did not harm, and would have been SemEval-2015 Task 10 ‘Sentiment Analysis in Twit- nice. The single word terms are chosen from ter’ Subtask E (Rosenthal et al., 2015). The 2015 the set of words that are part of the multi-word test set was restricted to English single words and phrases, for example, easy, harm, and nice. simple two-word negated expressions commonly English Twitter Mixed Polarity Set: This found in tweets. This year (2016), we broadened • dataset focuses on English phrases made up of the scope of the task and included three different opposing polarity terms, for example, phrases domains. Furthermore, we shifted the focus from such as lazy sundays, best winter break, happy single words to longer, more complex phrases to ex- accident and couldn’t stop smiling. The dataset plore sentiment composition. also includes single word terms (as separate en- Five teams submitted nine system outputs for tries). These terms are chosen from the set of the three tasks. All submitted outputs correlated words that are part of the multi-word phrases. strongly with the gold term rankings (Kendall’s rank The multi-word phrases and single-word terms correlation above 0.35). The best results on all tasks are drawn from a corpus of tweets, and include were achieved with supervised methods by exploit- a small number of hashtag words (e.g., #wan- ing a variety of sentiment resources. The highest tit) and creatively spelled words (e.g., plssss). rank correlation was obtained by team ECNU on the However, a majority of the terms are those that General English test set (τ = 0.7). On the other two one would use in everyday English. domains, the results were lower (τ of 0.4-0.5). All datasets created as part of this shared task are Arabic Twitter Set: This dataset includes sin- • freely available through the task website.1 For ease gle words and phrases commonly found in Ara- of exploration, we also created online interactive vi- bic tweets. The phrases in this set are formed sualizations for the two English datasets.2 only by combining a negator and a word. 1http://alt.qcri.org/semeval2016/task7/ Teams could participate in any one, two, or all 2http://www.saifmohammad.com/WebPages/SCL.html three tasks; however, only one submission was al- 43 Task Total Development set Test set words phrases all words phrases all General English Sentiment Modifiers 2,999 101 99 200 1,330 1,469 2,799 English Twitter Mixed Polarity 1,269 60 140 200 358 711 1,069 Arabic Twitter 1,366 167 33 200 1,001 165 1,166 Table 1: The number of single-word and multi-word terms in the development and test sets. lowed per task. For each task, the above description Dataset Sentiment and a development set (200 terms) were provided Term score to the participants in advance; there were no train- General English Sentiment Modifiers Set favor 0.826 ing sets. The three test sets, one for each task, were would be very easy 0.715 released at the start of the evaluation period. The did not harm 0.597 test sets and the development sets have no terms in increasingly difficult 0.208 common. The participants were allowed to use the severe 0.083 development sets in any way (for example, for tun- English Twitter Mixed Polarity Set ing or training), and they were allowed to use any best winter break 0.922 additional manually or automatically generated re- breaking free 0.586 sources. isn’t long enough 0.406 In 2015, the task was set up similarly (Rosen- breaking 0.250 thal et al., 2015). Single words and multi-word heart breaking moment 0.102 phrases from English Twitter comprised the devel- Arabic Twitter Set opment and test sets (1,515 terms in total). The Ym.× (glory) 0.931 phrases were simple two-word negated expressions éJ k. ðQË@ èXAªË@# (marital happiness) 0.900 (e.g., cant waitttt). Participants were allowed to use á ®K # (certainty) 0.738 these datasets for the development of their systems. áºÓ@ B (not possible) 0.300 3 Datasets of English and Arabic Terms H. AëP@ (terrorism) 0.056 Annotated for Sentiment Intensity Table 2: Examples of entries with real-valued sentiment scores from the three datasets.