Investigating Redundancy in Emoji Use: Study on a Twitter Based Corpus

Investigating Redundancy in Emoji Use: Study on a Twitter Based Corpus Giulia Donato Patrizia Paggio University of Copenhagen University of Copenhagen [email protected] University of Malta [email protected] [email protected] Abstract tic relation they have with the surrounding text, in particular whether emoji add independent mean- In this paper we present an annotated cor- ing, is an important step in any approach attempt- pus created with the aim of analyzing the ing to process their contribution to the overall con- informative behaviour of emoji – an issue tent of a given message, both for the purposes of of importance for sentiment analysis and sentiment analysis and natural language process- natural language processing. The corpus ing. consists of 2475 tweets all containing at We are interested in investigating to what extent least one emoji, which has been annotated it is possible for a human annotator, and subse- using one of the three possible classes: Re- quently for an automatic classifier, to determine dundant, Non Redundant, and Non Redun- if emoji in tweets are used to emphasize or add dant + POS. We explain how the corpus information, which may well be emotional infor- was collected, describe the annotation pro- mation, but could also have a different semantic cedure and the interface developed for the flavour. If emoji do add meaning, we also ask how task. We provide an analysis of the cor- easy it is to understand if they are being used as pus, considering also possible predictive syntactic substitutes for words. In this paper, we features, discuss the problematic aspects focus on the corpus of English tweets that was col- of the annotation, and suggest future im- lected and annotated to provide training data for a provements. number of classifiers aiming at predicting whether emoji in microblogs are used in a redundant or a 1 Introduction non-redundant way. Nowadays emoji are widespread throughout mo- The classification experiments achieved bile and web communication both in private con- promising results (F-score of 0.7) for the best versations and public contexts such as blog entries performing model, which combined LSA with or comments. In 2015, the Oxford Dictionary de- handcrafted features and employed a linear SVM clared the emoji Face with tears of joy ”Word of in a One vs. All fashion. The process and results the year”, and since then the academic interest to- of the experiments will be described in a future wards the topic, as well as the development of rel- paper (in preparation). evant resources, have grown substantially. Emoji In Section (2) we review related research, then are best known to be markers for emotions, and in Section (3) we describe how the tweets were in this sense they can be considered an evolution extracted and collected to create the corpus, and of emoticons. However, these pictographs can be give counts of the various represented categories. used to represent a much wider range of concepts In Section (4) the annotation process is described, than emoticons, including objects, ideas and ac- Section (5) presents and discusses the results, and tions in addition to emotions, and thus they inter- finally in Section (6) we provide a conclusion. act with the content expressed in the surrounding text in more complex ways. Furthermore, emoji 2 Related research are used not only at the end of a message, e.g. a tweet, but can occur anywhere and possibly in Several studies trace parallels between emoticons sequences. Therefore, understanding the seman- and emoji, sometimes using both terms inter- 118 Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 118–126 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics changeably, with the purpose of dealing with emo- standard, and representations are thus obtained for tion expression or automatic emotion detection, all represented emoji including those that appear and thus only considering those pictographs that infrequently in online text. In spite of the model resemble facial features. Boia et al.(2013) focus being trained on much less data, the authors claim on emoticons and their use in tweets. The authors to outperform Barbieri et al.(2016) on the task of attempted to determine the reliability of emoti- Twitter sentiment analysis. These results point to con labels in sentiment classification by means of the fact that the emoji descriptions in the Unicode a user study and generated a sentiment lexicon standard are a valid source from which to model from a corpus of 2.1 million tweets. They found their semantics. that agreement between the sentiment expressed The issue whether emoji add content to the text by emoticons and the sentiment expressed by the they occur in, particularly in tweets, or whether surrounding words is only slightly higher than ran- they are largely redundant, as well as how their dom, showing that emoticons are likely to be used specific use in this respect can be predicted, is as a means to add emotion to an otherwise neu- not investigated directly in any of the studies men- tral text. The experiment based on the sentiment tioned so far. lexicon proved that emoticons are good indicators The paper by Zanzotto et al.(2011) addresses of sentiment in the tweet, but are less effective in the problem of linguistic redundancy within the retrieving related sentiment words, thus confirm- realm of microblogs. Although this study does not ing that emoticons complement the text rather than specifically target emoji, it is of particular inter- stressing what is already expressed by the words. est for our work given the formal definitions pro- The paper by Hallsmar and Palm(2016) is in- vided for both redundancy and non redundancy as stead focused on the effectiveness of using emoji well as the methodology employed. The authors to automatically annotate training data for mul- performed a classification experiment on 1242 ticlass emotion classification. The researchers pairs of tweets related to news, previously anno- employed a training corpus of 400,000 tweets, tated considering four possible relations, i.e. en- 100,000 for each of four classes (sadness, anger, tailment (redundant), paraphrase (redundant), re- fear and happiness), then tested against 80 in- lated/unrelated (non-redundant), and contradiction stances, manually collected and labeled accord- (non-redundant). They used the annotated corpus ing to their textual content. The results show to test different models in a classification exper- that emoji can be effectively used to automati- iment, and obtained the best results with a com- cally annotate the emotion class in large sets of bination of syntactic and similarity features com- tweets, thus suggesting that emoji, in contrast with puted across the word vectors of each pair. emoticons, may co-occur with semantically re- The methodology adopted in our work builds on lated words. the Zanzotto et al.(2011) study, both as concerns the fundamental question we ask, and the way we Other works have analyzed the semantics of have collected and annotated our training corpus. emoji, mostly by means of distributional seman- A crucial difference is, however, that our analysis tics. In Barbieri et al.(2016), the authors used focuses on the use of emoji. the skip-gram model paired with different dataset sizes and different filtering methods to generate emoji embeddings. These were evaluated against 3 Corpus Preparation a set of 50 emoji pairs manually annotated for To answer our research questions we set up a similarity and relatedness scores. The similarity corpus of English tweets automatically extracted scores obtained by the models were strongly cor- from Twitter with the aid of specific emoji key- related with those in the gold standard, particularly words. The corpus was then annotated by four hu- if stop words and punctuation are removed from man coders to be further used in a machine learn- the dataset. This indicates that surrounding words ing experiment. The annotated corpus consists of and other emoji are useful for inferring the mean- tweets containing emoji paired with their counter- ing of a given emoji, possibly indicating that the parts where the emoji has been removed, for a total emoji is being used in a redundant way. of 2475 pairs. In Eisner et al.(2016), emoji embeddings were The purpose of the corpus collection and anno- learnt from their description in the Unicode emoji tation was twofold. Our primary goal was to pro- 119 Category Emoji Names Traveling/Commuting car, airplane, sailboat Events party popper, jack-o-lantern, graduation cap Places school, european castle, home + garden Other Activities artist palette, books, television Feelings smiling face with heart eyes, unamused face, crying face People man and woman holding hands, person walking, person raising one hand Eating & Drinking pizza, doughnut, hot beverage Nature & Animals dog, snowflake, maple leaf Music microphone, guitar, musical notes Sport trophy, swimmer, basketball and hoop Table 1: List of the emoji used to extract tweets for the corpus collection vide training data to develop classifiers that could Objects, Symbols, Flags. Emojipedia categorizes predict the relation of emoji in unseen tweets. A the pictographs considering their graphical prop- secondary goal was to investigate how easy it is erties, while the MSCOCO categories are modeled for human coders to distinguish different uses of for object recognition, thus they discriminate more emoji with respect to their semantic contribution. precisely among inanimate objects. In order to clarify this aspect, we run an inter- Barbieri et al.(2016) used word embeddings, annotator agreement test on part of the annotated dimensionality reduction and clustering, to iden- material. tify 11 clusters labeled as: Sports & Animals, Na- ture, Body gestures & Positive, Free Time, Un- 3.1 Emoji Selection clear, Love & Parties, Letters, Barber & Symbols, Eating & Drinking, Music, Sad & Tears.

Load more