Analyzing Microtext: Papers from the 2011 AAAI Workshop (WS-11-05)

A Comparison between Microblog Corpus and Balanced Corpus from Linguistic and Sentimental Perspectives

Yi-jie Tang, Chang-Ye Li and Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan [email protected]; [email protected]

Abstract Chinese informal words on the . While has gained popularity on the Internet, To better understand the characteristics of microblog analyzing and processing short messages has become a short messages, a comparison between a microblog corpus challenging task in natural language processing. This paper and a free text corpus is required. In most of the previous analyzes the differences between Internet short messages (or studies, this comparison is done by analyzing word “microtext”) and general articles by comparing the Plurk Corpus and the Sinica Balanced Corpus. Likelihood ratio frequency of each corpus. Kilgarriff (1998), for example, and the tóngyìcícílín (“ ”) thesaurus are adopted uses word frequency to measure the homogeneity and to analyze the lexical semantics of frequent terms in each similarity between two corpora. Rayson and Garside (2000) corpus. Furthermore, the NTUSD sentiment dictionary is adopt word frequency profiling to discover key items that used to compare the sentiment distribution of the two differentiate on corpus from another. While these corpora. The result is also applied to sentiment transition analysis. comparisons are based on large corpora consisting of free text corpus, Sahami and Heilman (2006) propose a new kernel function to measure the similarity of short text Introduction snippets by leveraging web search results. In this paper, we compare microblog texts with general Microblogs have become one of the most important articles by word frequency, as well as lexcial semantics, Internet media in recent years. It allows users to post short and perform sentiment analyses. We collect textual data messages, which are usually limited to 140 characters. from the Plurk microblogging platform (http://plurk.com) Since these messages are shorter than regular blog articles and adopt the Sinica Balanced Corpus, and study the and other free texts, and the word usage is different from lexical semantics and sentiment tendency of high spoken dialogues, analyzing and processing these frequency terms in each corpus. messages is a challenging task in natural language processing. The analysis of microtext (Ellen 2011) such as short Microblog Corpus vs. Balanced Corpus messages used in microblogs, instant messages, SMS, etc attracts much attention in recent years. To analyze the This paper collects 20,265,405 Traditional Chinese posts Internet informal language usages, on the other hand, Xia, of about 110,000 microbloggers in Plurk from April 1 to Wong and Gao (2005) collect textual data from bulletin October 31, 2009. This microblog corpus, called Plurk board systems, and propose methods including pattern Corpus, is used to study the language in microtexts. Plurk matching and support vector machines to recognize is the most popular microblogging platform in Taiwan,

Topic philosophy science society art life literature Total #characters 685.3K 102.4K 2761.3K 732.2K 1412.0K 1278.5K 7892.7K #words 451.7K 675.0K 1820.3K 482.7K 930.8K 842.8K 5202.8K Ratio 8.68% 12.97% 34.99% 9.28% 17.89% 16.20% 100%

Table 1. Statistics of Sinica 3.0 Corpus

68

A. PERSON: Aa. generic name, Ab. people of all ages and both sexes, Ac. posture, Ad. nationality/citizenship, Ae. occupation, Af. identity, Ag. status, Ah. family member, Ai. seniority, Aj. relationship, Ak. temperament, Al. ability, Am. religion, An. negative appellation B. OBJECT: Ba. generic name, Bb. shape, Bc. part of object, Bd. celestial body, Be. terrain, Bf. meteorological phonomena, Bg. natural substance, Bh. plant, Bi. animal, Bj. microorganism, Bk. body, Bl. secretions/excretions, Bm. material, Bn. building, Bo. machine and tool, Bp. appliance, Bq. clothing, Br. edible/medicine/drug C. TIME AND SPACE: Ca. time, Cb. space D. ABSTRACT THING: Da. event/circumstance, Db. reason/logic, Dc. look, Dd. function/property, De. character/talent, Df. consciousness, Dg. analogy, Dh. imaginary thing, Di. society/politics, Dj. economy, Dk. culture and education, Dl. disease, Dm. organization, Dn. quantity/unit E. CHARATERISTICS: Ea. appearance, Eb. phenomenon, Ec. color/taste, Ed. property, Ee. virtue, Ef. circumstance F. MOTION. Fa. motion of hands, Fb. motion of legs, Fc. motion of head, Fd. motion of the whole body G. MENTAL ACTIVITY: Ga. mental status, Gb. mental activity, Gc. capability and willingness H. ACTIVITY: Ha. political activity, Hb. military activity, Hc. administrative management Hd. production, He. economical activity, Hf. communications and transportation, Hg. Education/hygiene/research, Hh. recreation and sport, Hi. social activity, Hj. life, Hk. religious activity, Hl. superstitious activity, Hm. police and judicature, Hn. wicked behavior I. PHENOMENON AND CONDITION: Ia. natural phenomena, Ib. physiological phenomena, Ic. facial expression, Id. object status, Ie. situation, If. circumstance, Ig. beginning and end, Ih. change J. RELATION Ja. association, Jb. similarity and dissimilarity, Jc. coordination, Jd. existence, Je. influence K. AUXILIARY: Ka. adverb, Kb. preposition Kc. conjunction, Kd. particle, Ke. interjection, Kf. Onomatopoeia L. GREETING Table 2. Taxonomy of Cilin which means it can provide a large amount of suitable data to n entries, so that the number of word types is less than for this study. The messages on Plurk, like those on 65,464. There are 53,644 words, including 45,586 and some other microblogging platforms, are limited to unambiguous words and 8,058 ambiguous words. Table 1 140 characters. Plurk also serves as a and shows the taxonomy of Cilin on the levels of large and system, since users can have friends and middle sense categories. Symbols A, B, ..., L denote large followers and interact with each other instantly. sense categories and symbols Aa, Ab, Ac, ..., Ba, ... denote In comparison with Plurk corpus, the Sinica Balanced middle sense categories. Corpus 3.0 (abbreviated as Sinica corpus), which is a Chen, Lin and Lin (2002) sampled documents from segmented and POS- Chinese balanced corpus, is different categories of the Sinica corpus, including adopted. Sinica corpus is composed of 5 millions words. philosophy (10%), science (10%), society (35%), art (5%), Table 1 shows the topic distribution of documents in this life (20%) and literature (20%). There were 35,921 words corpus. The Plurk corpus is segmented by maximum in the test corpus. They reported the accuracies 52.85% matching with a dictionary collected from Sinica corpus and 34.35% for tagging ambiguous words and unknown and Sinca BOW (Bilingual Ontological Wordnet). words, respectively, when 1,428 sense categories of Cilin is adopted. If unambiguous instances are also counted, the sense tagger achieved a performance of 76.04%. Analysis of Lexical Semantics In this paper, we do not disambiguate the word senses. We postulate that an ambiguous word consisting of n Likelihood Ratio senses contributes each of its sense categories equally, i.e., To compare these two corpora by lexical semantics, we 1/n times. We count the frequency of each Traditional adopt the thesaurus tóngyìcícílín (“ ”), which is Chinese word in Plurk and Sinica corpora, respectively, abbreviated as Cilin (Mei 1982). Cilin gathers 65,464 select the top 100 and top 3,000 high frequency words, Chinese word entries. Cilin senses are decomposed to a determine the sense categories of the selected words, four-layer semantic structure including 12 large sense compute the occurrences of each sense category, and categories, 94 middle sense categories, 1,428 small sense finally compare the distribution of their senses in each categories, and 3,925 word sense clusters, however, it does corpus. For comparability, we normalize the occurrences not support relationships for hypernym, hyponym, similar, of each sense category in each corpus by total occurrences derived, antonym, etc. A word with n senses corresponds of all the sense categories in the same corpus. In this way,

69

Top 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th lrPlurk-Sinica pū a good Lou Ye yíge (refer- good (sentence evening (sentence (sentence English (indefinite ring to a super sleep sleep morning -final or good -final -final article) Plurk particle) night particle) particle) message) Category Dn L Kf Hi L Ka Kd Kd Fd, Hj Hj

lrSinica-Plurk qí zé yú express; and; suŏ English (3rd-person (conjunc- each from every and (preposi- indicate also (particle) possessive) tion) tion) Category Aa, Ba, Ed Hi, Ja Ka, Kc Ja Dn Kb Dn, Kd Ed Kc Kb

Table 3. List of top 10 high lrPlurk-Sinica words and top 10 high lrSinica-Plurk words the normalized value of a sense category will range In terms of 100 most frequent words, Plurk corpus has between 0 and 1. more words for Activity (category H), while the Sinica We union the top 100 high frequency words in Plurk and corpus has more words for abstract thing (category D). Sinica corpora and get a set of 140 words. A likelihood Most of the H category words in Plurk corpus are related to ratio (lr) is defined as follows. It is used to distinguish personal experience, such as “eat” ( ), “buy” ( ), and which words are more critical in which corpus, and which “sleep” ( ). Both the Plurk and Sinica corpora contain are common in both corpora. many auxiliaries (category K). The relative frequency of i this category does not differ a lot in the two corpora. f A (w ) When we look into the subcategories, however, much more i | A | (1) lr (w )  log words in category Kd appear in the Plurk corpus. That is, AB f (wi ) B Plurk uses more words like a ( ), ma ( ), and la ( ), | B | which are sentence-final particles in Chinese. Table 3 lists i where lrAB(w ) is the log of likelihood ratio of relative the top-10 high lrPlurk-Sinica words and the top-10 high lrSinica- i i i frequencies of word w in A and B, fA(w ) and fB(w ) are Plurk words for reference. frequencies of wi in A and in B, respectively, and |A| and |B| We further compare the sense distribution of the top are total words in A and in B, respectively. The following 3,000 high frequency words in both corpora. Figure 2 two likelihood ratios are calculated: shows that the tendency is quite similar. In Plurk corpus, i (1) lrPlurk-Sinica(w ), where A=Plurk, B=Sinica the normalized values of Motion sense (category F), i (2) lrSinica-Plurk(w ), where A=Sinica, B=Plurk Phenomenon and Condition sense (category I), and i i If likelihood ratio lrPlurk-Sinica(w )>0, then w tends to Greetings sense (category L) are still two times larger. The appear more in Plurk corpus than in Sinica corpus. normalized values of Thing sense (category B) in Plurk i i Similarly, if likelihood ratio lrSinica-Plurk(w ) >0, then w corpus become much larger than those in Sinica corpus, tends to appear more in Sinica corpus than in Plurk corpus. while the difference of Psychological Activity sense If lr(wi)0, then wi tends to appear equally in both corpora. (category G) between these two corpora becomes smaller in the top 3,000 high frequency words. Comparatively, Results Abstract Things (category D) sense is two times larger in Figure 1 shows the sense distribution in terms of 12 large Sinica corpus than in Plurk corpus. sense categories of the top 100 high frequency words in Plurk and Sinica corpora. The meanings of sense symbols refer to Table 2. The normalized values of Motion sense (category F), Psychological Activity sense (category G), Phenomenon and Condition sense (category I), and Greetings sense (category L) in Plurk corpus are two times larger than those of the corresponding senses in Sinica corpus. In contrast, the normalized values of Person sense (category A) and Abstract Thing sense (category D) in Sinica corpus are two times larger than those of the corresponding senses in Plurk corpus. Figure 1. The sense distribution in terms of 12 large categories of the top 100 words

70

difference between Spoken Chinese ( ) and Literary Chinese. Unlike general spoken Chinese, however, the Plurk corpus also contains some Internet-specific terms like pū ( , referring to a Plurk message), ān’ān ( , a greeting word commonly used by Internet users), shàng'àn ( , referring to the action of getting offline).

Analysis of Sentiment Polarity

NTU Sentiment Dictionary Figure 2. The sense distribution in terms of 12 large categories of To compare the sentiment distribution in the Plurk and the top 3,000 words Sinica corpora, the National Taiwan University Sentiment Dictionary (Ku 2007), or NTUSD, is used to classify the We extend this discussion from the coarsest sense set words in the two corpora. Table 4 shows the distribution of (i.e., 12 large sense categories) to finer sense set (i.e., 94 positive, negative and neutral words in NTUSD. The most middle sense categories). Figure 3 shows the sense frequent type is negative words, while the least frequent is distribution of the top 3,000 high frequency words in Plurk neutral words, for which the proportion is only 4.136%. and Sinica corpora. The normalized values of senses including shape (subcategory Bb), motion of hands Positive Negative Neutral (subcategory Fa), motion of legs (subcategory Fb), motion #Words 21,055 22,750 1,890 of head (subcategory Fc), motion of the whole body Proportion 46.077% 49.786% 4.136% (subcategory Fd), natural phenomena (subcategory Ia), Table 4. Distribution of sentiment words in NTUSD sentiment dictionary physiological phenomena (subcategory Ib), facial expression (subcategory Ic), object status (subcategory Id), Sentiment Analysis circumstance (subcategory If), and greeting (subcategory L) are two times larger in Plurk corpus. In contrast with Plurk The sentiment word frequency is calculated for the two corpus, Sinica corpus owns two times more senses corpora based on whether each word appears in NTUSD. including reason/logic (subcategory Db), look (subcategory Figure 4 shows the ratio of the sentiment word frequency Dc), function/property (subcategory Dd), society/politics of Plurk corpus to the sentiment word frequency of Sinica (subcategory Di), and organization (subcategory Dm). corpus. For the top 100 frequent words, the sentiment word frequency in Plurk corpus is 4 times larger than that in Sinica corpus. This suggests that the most used words on Plurk corpus are related to informal social conversations with emotion expressions. On the other hand, the top 100 frequent words of Sinica corpus contain many function words. Figure 4 also shows that the more words are collected, the lower the ratio is. One of the reasons for this

Figure 3. The sense distribution in terms of 94 middle effect may be that Plurk corpus contains many informal categories of the top 3,000 words and topical terms that cannot be found in NTUSD.

Literal Text vs. Spoken Language We also calculate and compare the average number of syllables of the top 100 frequent words in both corpora. The average word length in syllable number is 1.29 for the Plurk corpus, while that for the Sinica corpus is 1.14. The difference between them is significant (p < 0.01). About 86% of the top 100 frequent words in the Sinica corpus are monosyllabic, while only 71% of the top 100 frequent words in the Plurk corpus are monosyllabic. Based on the above observations and Chao's (1968) statement that Literal Chinese ( ) has more mono- Figure 4. Ratio of sentiment words in top n high frequency words syllabic words and less compounds, we argue that the in the Plurk and Sinica corpora relation between the two corpora is similar to the

71

Figure 5 shows the proportions of positive, negative and emoticons are commonly used. In a conversation, the neutral words in the Plurk corpus. In the top 100 frequent replier’s emotion sometimes disagrees with the poster’s. words, much more positive words are used than negative The purpose of this experiment is to understand what words. As more words are taken into account, the ratio of factors keep the sentiment from changing during writing negative words increases and that of neutral words and reading, and what factors change the emotions. decreases. It illustrates that people commonly shows nonnegative feelings in a microblogging platform with Dataset social network characteristics. Total 35 emoticons are chosen from the 78 emoticons As shown in Figure 6, the Sinica corpus contains more provided by Plurk, and categorized into the positive and positive words and less negative words. Although the ratio negative groups as shown in Figure 7. The other 43 are of neutral words is lower than that of positive words, i.e., either neutral or cannot be clearly categorized, so we only 4.136% of sentiment words in the NTUSD dictionary exclude them to minimize uncertainty. are neutral words, ratio of neutral words is larger than that of negative words. This suggests that neutral words play an important role in the Sinica corpus.

Positive

Negative

Figure 7. Positive and negative emoticons

A sentiment pair (writer_sentiment, reader_sentiment) is used to formulate the emotion transition, where writer_sentiment means the emotion expressed by a writer, Figure 5. Raito of sentiment words in top n frequent words in i.e., a poster, and reader_sentiment means the emotion the Plurk corpus expressed by a reader, i.e., a replier. The sentiment can be positive (pos) or negative (neg), so that there are four possible sentiment transitions for a message including (pos, pos), (pos, neg), (neg, neg) and (neg, pos). The corpus was divided into four datasets based on these sentiment transition types. For clarity, the four datasets are named as PP, PN, NN, and NP datasets, respectively. Total 79,042 conversations were selected to form our experimental corpus. The number of instances in each dataset PP, NN, and NP is 20,000. The number of instances in the dataset PN is 19,042, because fewer examples of (pos, neg) can be found.

Likelihood Ratio of Sentiment Words Figure 6. Raito of sentiment words in top n frequent words in We employ the likelihood ratio lr of words in two datasets the Sinica corpus A and B, i.e., Equation (1) described in the Analysis of i Lexical Semantics section. The interpretations of lrAB(w ) Further Sentiment Analysis in Plurk Corpus are as follows. (1) A=PP, B=PN Since likelihood ratio is suitable for comparing lexical It captures sentiment transitions posbpos and pos b distribution in two corpora, the same algorithm is adopted neg. Those words in PN-PP may be probable to affect the to analyze sentiments on Plurk. The Plurk corpus is divided sentiment transitions from positive to negative. Those into four datasets based on sentiment types, and a words in PP-PN may be probable to keep the sentiment comparison between these datasets is drawn. unchanged, i.e., in positive state. The Plurk corpus contains conversations with both poster’s and replier’s messages, in which graphical (2) A=NP, B=NN

72

It captures sentiment transitions negbpos and negbneg. ratio is employed to select the critical terms from both Those words in NP-NN may have some effects on the corpora at first. Then, the Cilin thesaurus and NTUSD sentiment transition from negative to positive. Those sentiment dictionary are used to investigate the distribution words in NN-NP may keep the sentiment unchanged. of the semantic categories and sentiment tendency of the selected terms. In addition, we also study which terms are Analysis of the Mined Words critical for sentiment transition and which terms tend to We examine the top 200 words with higher lr in PN-PP, keep emotion from changing. The results will be applied to PP-PN, NN-NP and NP-NN, and calculate the word emotion prediction in microtext. According to the frequency in each Cilin category. Only the words that can observations in this paper, we argue that different linguistic be found in Cilin are analyzed. Figure 8 shows the and sentimental processing methods should be applied to distribution of words in the posbpos transition and the microblog short texts. posbneg transition. Words used in positive writer contents are more likely to get positive response, except for Acknowledgments the categories B, F, I, J, and K. The most noticeable feature is greeting words (category L) such as ‘bye-bye’ ( ), This research was partially supported by National Science ‘good morning’ ( ), and ‘good night’ ( ), which Council, Taiwan under grant NSC96-2628-E-002-240- never cause the posbneg transition. The words causing MY3. the posbneg transition include some words in the category K like ‘dubiously’ ( ), ‘fortunately’ ( ), and ‘exactly’ ( ). These words themselves do not contain References negative sentiment, but are usually used in expressions Chao, Yuen Ren. 1968. A Grammar Of Spoken Chinese. related to negative sentiment. University of California Press. Chen, H.H.; Lin, C.C.; and Lin, W.C. 2002. Building a Chinese- English WordNet for Translingual Applications. ACM Transactions on Asian Language Information Processing 1(2): 103-122. Chen, K.J.; and Hsieh, Y.M. 2004. Chinese Treebanks and Grammar Extraction. In Proceedings of International Joint Conference on Natural Language Processing, 560-565. Ellen, J. 2011. All about Microtext: A Working Definition and a Survey of Current Microtext Research within Artificial Intelligence and Natural Language Processing. In Proceedings of the Third International Conference on Agents and Artificial Intelligence. Kilgarriff, A.; and Rose, T. 1998. Measures for Corpus Similarity and Homogeneity. In Proceedings of 3rd Conference on Figure 8. Category distribution of sentiment words Empirical Methods in Natural Language Processing, 46-52. Granada, Spain. b The distribution of the words used for the neg pos Ku, L.W.; and Chen H.H. 2007. Mining Opinions from the Web: transition and the negbneg transition is interpreted Beyond Relevance Retrieval. Journal of American Society for similarly. The words in the group L can cause the Information Science and Technology 58(12): 1838-1850. negbpos transition. The words used in the negbpos Li, Z., and Yarowsky, D. 2008. Mining and Modeling Relations transition include personal status like ‘very tired’ ( ) between Formal and Informal Chinese Phrases from Web and ‘sleepy’ ( ), which belong to the category I and can Corpora. In Proceedings of the Conference on Empirical Methods receive encouragement or other positive responses. As in Natural Language Processing. ó expected, the words in the transition negbneg, including Mei, J.; Zhu, Y.; Gao, Y.; and H. Yin. 1982. T ngyìcícílín. Shanghai Dictionary Press. ‘angry’ ( , category G) and ‘hateful’ ( , category E), are mostly used to express negative status or Rayson, P., and Garside, R. 2000. Comparing Corpora Using Frequency Profiling. In Proceedings of Workshop on Comparing characteristics. Corpora of ACL 2000, 1-6.

Sahami, M., and Heilman, T. D.. 2006. A Web-based Kernel Conclusion and Future Work Function for Measuring the Similarity of Short Text Snippets. WWW’06, 377–386. ACM Press. In this paper, we compare the Plurk microblog corpus and Xia, YQ; Wong, K.F.; and Gao, W. 2005. NIL is not Nothing: the Sinica Balanced corpus in terms of word frequency, Recognition of Chinese Network Informal Language Expressions. lexical semantics and sentimental expression. Likelihood In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing, 95-102.

73