A Study on Deep Learning-Based Humor Detecting Methods for Sentiment Analysis of Social Media
Total Page:16
File Type:pdf, Size:1020Kb
Title A Study on Deep Learning-based Humor Detecting Methods for Sentiment Analysis of Social Media Author(s) 栗, 達 Citation 北海道大学. 博士(情報科学) 甲第14177号 Issue Date 2020-06-30 DOI 10.14943/doctoral.k14177 Doc URL http://hdl.handle.net/2115/79059 Type theses (doctoral) File Information Da_Li.pdf Instructions for use Hokkaido University Collection of Scholarly and Academic Papers : HUSCAP Language Media Laboratory Research Group of Information Media Science and Technology Division of Media and Network Technologies Graduate School of Information Science and Technology Hokkaido University Da Li A Study on Deep Learning-based Humor Detecting Methods for Sentiment Analysis of Social Media A doctoral dissertation supervised by Prof. Kenji Araki Sapporo 2020 A Study on Deep Learning-based Humor Detecting Methods for Sentiment Analysis of Social Media mVndun Sdm n nlngiRol ihVbmS88 oden mlnjlbnmoshSPCH joniRntly lSihS nmVShnoj huhSWiont lojhiRdymjST nmVShldjyWjlhh Sjjlbn hhSmxj88no jinybndum nmdn mTjoiR onjhi ldhSjm onini RjhSyhS nmmjiRj ydhonS88 mxhSjbcn dymdmV ndunSddj niR mxmr nj npnd ndumnd 2HA8BHL9F3FHO62893 d mxjr mjd ndum djnm nd83 2 - ln mVShdjiR jd jS88 ,LLLCH-CCLCHE2893 ma P13581MFH35C8EA dn iR 2893 LNL. jrhSji 0 ))( djdmx o ( mlhSjWyuyj d mxmnpj dmVndunSd nd i Contents 1 Introduction 1 1.1 Background ................... 1 1.2 Contributions of the Thesis .......... 4 1.3 Structure of the thesis .............. 6 2 Related Research 8 2.1 Sentiment Analysis ............... 8 2.1.1 Rule-based methods .......... 8 2.1.2 Machine learning-based methods .. 9 2.2 Internet Slang .................. 11 2.3 Emoji ....................... 12 3 Humor in Chinese Culture 15 3.1 Theories of Humor ............... 15 3.2 Humor in Chinese Culture ........... 17 4 Analysis of Internet Slang 22 4.1 Internet Slang in Social Media ......... 22 4.2 Lexicon of Chinese Online Slang ....... 22 5 Analysis of Emoji 27 5.1 Emoji in Social Media .............. 27 5.2 Lexicon of Chinese Social Media Emojis ... 29 ii 5.3 Emoji Polarity .................. 30 5.3.1 Annotation ............... 32 5.3.2 Applications ............... 34 6 Experiment of Emoji Polarity for Sentiment Anal- ysis 38 6.1 EPLSTM Approach ............... 38 6.1.1 Model structure ............. 39 6.1.2 Emoji polarity .............. 41 6.2 Experiments ................... 41 6.2.1 Preprocessing .............. 41 6.2.2 LSTM model .............. 42 6.2.3 Testing .................. 43 6.3 Considerations ................. 44 7 Experiment of Emoji Lexicon and Slang Lexicon for Humor Detection 47 7.1 Machine Learning Approaches ........ 47 7.2 Experiments ................... 48 7.2.1 Preprocessing .............. 48 7.2.2 Applied Classifiers ........... 49 Logistic Regression ........... 49 Support Vector Machine ........ 49 Naïve Bayes ............... 50 k-Nearest Neighbors .......... 51 Random Forest ............. 51 Decision Tree .............. 51 iii 7.2.3 Performance Test ............ 52 7.3 Considerations ................. 54 8 Humor Detection 57 8.1 HEMOS System ................. 57 8.1.1 Attention-based Bi-Directional Long Short- Term Memory Recurrent Neural Net- work ................... 59 Word Encoder .............. 59 Attention Layer ............. 62 Softmax Layer .............. 64 8.2 Experiments ................... 64 8.2.1 Preprocessing .............. 64 8.2.2 Performance Test ............ 65 8.3 Considerations ................. 68 9 Conclusions of the Thesis 73 Bibliography 75 Acknowledgements 81 iv List of Figures 3.1 Example of Weibo post with Internet slang and emojis. The entry says “After jogging, I’m starving. Someone sent me a picture of skewers. I’m too tired for romance”. ..... 18 3.2 Example of Weibo post with “lemon man” emojis. ...................... 20 5.1 109 Weibo emojis which can be transformed into Chinese character tags. .......... 29 5.2 Most common emojis of Weibo. ........ 32 5.3 A example of erotic message which avoids automatic detection by inserting a pictogram inside the text. .................. 36 6.1 Overview of the proposed EPLSTM method for sentiment analysis. ............. 39 6.2 Example of correct prediction. ......... 44 v 6.3 Example of unconventional Chinese charac- ter choice in a Weibo post with images. En- try says “A notch on the screen.... speech- less.”, however one of characters of the word “notch” is replaced with another homonymous one changing the meaning to person’s name. 45 6.4 Example of wrong classification into “humor- ous” category. .................. 46 7.1 Example of correct classification of humor- ous post. ..................... 54 7.2 Example of wrong classification into “non- humorous” category. .............. 55 8.1 The architecture of AttBiLSTM model. .... 59 8.2 The architecture of LSTM model. ....... 62 8.3 Example of correct classification of humor- ous post. ..................... 69 8.4 Another example of correct classification of humorous post. ................. 70 8.5 An example of an “optimistic humorous” post misclassified as “pessimistic humorous”. .. 71 vi List of Tables 4.1 Examples of my Chinese Internet Slang Lex- icon. ........................ 23 5.1 Examples of Chinese Emoji Lexicon. ..... 31 5.2 Twenty-three emojis conveying humor typi- cal for Chinese culture. ............. 33 5.3 Occurrence of emojis used in 500 randomly chosen Weibo comments on trending topics. 37 6.1 Precision comparison results. ......... 43 7.1 Comparison results of machine learning with- out emojis and Internet slang lexicon. .... 52 7.2 Comparison results of machine learning with Internet slang lexicon only. ........... 53 7.3 Comparison results of machine learning with Chinese emojis lexicon only. .......... 53 7.4 Comparison results of machine learning with both emojis and slang. ............. 53 7.5 Comparison results of F1-score between fea- ture sets. ..................... 53 vii 8.1 Results of two categories sentiment classifi- cation by AttBiLSTM method only. ...... 66 8.2 Results of two categories sentiment classifi- cation by AttBiLSTM considering Internet slang and emojis lexicons. ............... 66 8.3 Results of four categories sentiment classifi- cation by AttBiLSTM method only. ...... 67 8.4 Results of my proposed method ........ 68 1 Chapter 1 Introduction 1.1 Background Nowadays, social media has become the essential part of our lives. With the tremendous popularity of web 2.0 ap- plications, people are free to express opinions on social net- works such as Twitter, Facebook or Weibo1 (the biggest Chi- nese social media network that was launched in 2009). These new communication channels generate massive amounts of data daily, which become an insightful information source for various research purposes, often using Natural Language Processing (Kavanaugh et al., 2012). One of its popular and essential tasks is sentiment analysis, which mainly fo- cuses on automatic sentiment predicting from online user- generated content but also draws interest from other fields as psychology, cognitive linguistics, or political science. Sen- timent analysis can be useful in election result forecasting, stock prediction, opinion mining, business analytics, and 1https://www.weibo.com Chapter 1. Introduction 2 other data-driven tasks (Li, Rzepka, and Araki, 2018). Sen- timent analysis has been widely used in real-world appli- cations that analyze the online user-generated data, such as opinion mining, product reviews analysis, and business- related activity analysis (Zhao et al., 2018). Sentiment anal- ysis in English has achieved noteworthy success in the re- cent two decades. On the other hand, classifying sentiment in other languages, for example Chinese, remains at an early stage (Wang et al., 2013). The current sentiment analysis mainly focuses on text-based user-generated content. How- ever, various new forms of semiotic information on social networks such as emoji, images, and memes, allow users to express themselves more creatively. Emojis and slangs have become a noticeable part of online informal expression on social networks. Humor is one of the personal aspects which defines us as human beings and social entities. It is a very complex, as well as ubiquitous concept, which we could simply define by the presence of amusing effects, such as laughter or well- being sensations, a set of phenomena which play a relevant role in our lives (Reyes, Rosso, and Buscaldi, 2012). Before we had spoken or written language, humans used laughter to express our enjoyment or accession in certain situations (Mathews, 2016). In today’s society, people tend to write Chapter 1. Introduction 3 funny jokes, play word games, use emojis or memes to com- ment on current affairs by “white sarcasm”2, and relieve life stress by self mocking, often in very creative ways. Emojis are ideograms and smileys used in electronic mes- sages and web pages. Originating from Japanese mobile phones in 1997, this kind of pictograms became increas- ingly popular worldwide in the 2010s after being added to several mobile operating systems. In 2015, Oxford Dic- tionaries named the “Face with Tears of Joy” emoji the “Word of the Year”. In my opinion ignoring emojis in sen- timent research is unjustifiable because they convey a piece of significant emotional information and play an important role in expressing emotions and opinions in social media (Novak et al., 2015; Guibon, Ochs, and Bellot, 2016; Li et al., 2019a). Internet slang is similarly ubiquitous on the Internet. The emergence of new social contexts like micro-blogs, discus- sion groups, community-driven question-and-answer