Assessing Emoji Use in Modern Text Processing Tools Abu Awal Md Shoeb Gerard de Melo Dept. of Computer Science Hasso Plattner Institute / Rutgers University University of Potsdam New Brunswick, NJ, USA Potsdam, Germany [email protected] [email protected] Abstract ing sent daily just on Facebook Messenger (Burge, 2017). Emojis are textual elements that are encoded Emojis have become ubiquitous in digital com- munication, due to their visual appeal as well as characters but rendered as small digital images or as their ability to vividly convey human emo- icons that can be used to express an idea or emotion. tion, among other factors. The growing promi- nence of emojis in social media and other in- Goals. Due to their increasing prominence, there stant messaging also leads to an increased need is a growing need to properly handle emojis when- for systems and tools to operate on text con- ever one deals with textual data. In this study, we taining emojis. In this study, we assess this consider a set of popular text processing tools and support by considering test sets of tweets with empirically assess to what extent they support emo- emojis, based on which we perform a series of jis. experiments investigating the ability of promi- nent NLP and text processing tools to ade- Although emojis can be encoded as Unicode charac- quately process them. In particular, we con- ters, there are special properties of emoji encoding sider tokenization, part-of-speech tagging, as that may need to be considered, such as skin tone well as sentiment analysis. Our findings show modifiers, first introduced in 2015 for a small set that many tools still have notable shortcomings when operating on text containing emojis. of emojis. Some tools may handle regular emojis but fail to handle skin tones properly. Moreover, 1 Introduction text harbouring emojis may adhere to subtly differ- ent conventions than more traditional forms of text, In our modern digital era, interpersonal communi- e.g., with regard to token and sentence boundaries. cation often takes place via online channels such Finally, emojis may of course also alter the seman- as instant messaging, email, social media, as well tics of the text, which in turn may, for instance, as websites and apps. Along with this growth of affect its sentiment polarity. digital communication, there is an increasing need for tools that operate on the resulting digital data. Overview. For our analysis, we draw primarily For instance, online conversations can be invalu- on real tweets to study a range of different kinds of able sources of insights that reveal fine-grained con- emojis. We run a series of experiments on this data sumer preferences with regard products, services, evaluating each text processing tool to observe its or businesses. Modern text processing and natural behaviour at different stages in the text processing language processing (NLP) tools address a range pipeline. Our study focuses on tokenization, part- of different tasks, encompassing both fundamental of-speech tagging, and sentiment analysis. The ones such as tokenization and part-of-speech tag- results show that many tools have notable deficien- ging as well as semantic tasks such as sentiment cies in coping with modern emoji use in text. and emotion analysis, text classification, and so on. 2 Related Work However, the shifts in modality and medium also shape the way we express ourselves, making it in- While emoji characters have a long history, they creasingly natural for us to embed emojis, images, have quite substantially grown in popularity since hashtags into our conversations. In this paper, we their incorporation into Unicode 6.0 in 2010 fol- focus specifically on emojis, which have recently lowed by increasing support for them on mobile de- become fairly ubiquitous in digital communication, vices. Accordingly, numerous studies have sought with a 2017 study reporting 5 billion emojis be- to explain how the broad availability of emojis has affected human communication, considering gram- 3 Experimental Data matical, semantic, as well as pragmatic aspects (Kaye et al., 2017; McCulloch, 2019). Only few As we want to assess the support of emojis provided studies have specifically considered some of the by different text processing tools, we first consider more advanced technical possibilities that the Uni- some of the different cases of emoji use that one code standard affords, such as zero width joiners may encounter, in order to compile relevant data. to express more complex concepts. For instance, 3.1 Emoji Use in Text with regard to emoji skin tone modifiers, Robert- son et al. (2020) study in depth how the use of such Emojis can appear in a sentence or tweet in differ- modifiers varies on social media, including cases ent ways. They may show up at the beginning of a of users modulating their skintone, i.e., using a dif- tweet or at the end of a tweet. Similarly, they may ferent tone than the one they usual pick. appear as part of a series of emojis separated by spaces, or could be clustered within a tweet without any interleaved spacing. Based on observations on Given the widespread use of emojis in everyday a collection of tweets crawled from Twitter (Shoeb communication, it is important to consider their sup- et al., 2019), we defined a series of cases distin- port in the most commonly used NLP toolkits, such guishing different aspects of emoji use, including as Stanford’s Stanza (Qi et al., 2020) and NLTK the number of emojis, position of emojis, the use (Bird et al., 2009), which power a wide range of of skin tone modifiers, and so on. applications. There are numerous reports that com- pare the pros and cons of popular NLP libraries Case 1: Single Emoji. This is the simple case of (Wolff, 2020; Kozaczko, 2018; Choudhury, 2019; emoji use with only one single emoji occurrence in Bilyk, 2020). These primarily consider the fea- the entire tweet. However, in this case, an emoji tures and popularity of the tools, as well as their can be space-separated from the text or it can be performance. However, there have not been stud- tied up with text without having any leading or ies comparing them with regard to their ability to trailing spaces. This is the rudimentary case among cope with modern emoji-laden text. Since emojis all cases designed to quickly assess if there is bare are becoming increasingly ubiquitous, it is crucial minimum emoji support offered by the respective for developers and institutions deploying such soft- text processing tool. ware to know whether it can cope with the kinds of text that nowadays may quite likely arrive as input Case 1.1: Single Emoji with Space data. In many real-world settings, applications and Emojis are a new way of expressing services are expected to operate on text containing emotions! #emoji emojis, and so it is important to investigate these capabilities. Case 1.2: Single Emoji without Space Emojis are a new way of expressing Many academic studies present new models for par- emotions! #emoji ticular NLP tasks relating to emojis. For instance, Felbo et al. (2017) developed an emoji prediction model for tweets. Weerasooriya et al. (2016) dis- Case 2: Multiple Emojis. In real-world social cussed how to extract essential keywords from a media, we often observe multiple emojis within a tweet using NLP tools. Cohn et al. (2019) attempted single posting. In this case, emojis may be found to understand the use of emojis from a grammati- in multiple places as single emojis or as a group. cal perspective, seeking to determine the parts-of- Use of the same emojis repeatedly within a tweet speech of emoji occurrences in a sentence or tweet. is a common phenomenon, especially when people Owoputi et al. (2013) proposed an improved part- wish to emphasize or express a high intensity of of-speech tagging model for online conversational an emotion. Such use is akin to the repetition of text based on word clusters. Proisl (2018) proposed individual characters in OOV words such as funnnn, a part-of-speech tagger for German social media. heloooo, which may also be encountered. The two However, these studies mostly target just one spe- may also occur together: For example, when people cific task and are typically not well-integrated with write “Yahooooo!” to express excitement, they may common open source toolkits. also be likely to add multiple emojis “ ” instead of a single one. Tweets Count % Total 22.3 M 100 Unique 21.4 M 95.84 Case 2.1: Multi Emoji Multi Positions Only single emoji 5.67 M 25.38 Multiple emojis 16.48 M 73.77 Emojis are a new way for expressing Emoji skin tone modifiers 1.31 M 5.85 emotions ! #emoji Light Skin Tone emojis 382 K 1.71 Medium Light Skin Tone emojis 386 K 1.73 Medium Skin Tone emojis 337 K 1.51 Case 2.2: Multi Emoji with Space Medium Dark Skin Tone emojis 274 K 1.23 Dark Skin Tone emojis 53 K 0.24 Another example is having multiple emojis Zero Width Joiner (ZWJ) emojis 97 K 0.43 together in a tweet. Table 1: Corpus statistics – the distribution of emojis Case 2.3: Multi Emoji Cluster over the ~22 million tweets with regard to the consid- ered cases This gets a little complicated when having multiple emojis in a tweet with- out having any spaces in between emojis. Not all software supports this more recent addi- tion to the Unicode standard. For example, some software may fail to render such emojis. In our Case 3: Emojis with Skin Tone Modifiers.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages10 Page
-
File Size-