A Comparison Between Microblog Corpus and Balanced Corpus from Linguistic and Sentimental Perspectives

A Comparison Between Microblog Corpus and Balanced Corpus from Linguistic and Sentimental Perspectives

Analyzing Microtext: Papers from the 2011 AAAI Workshop (WS-11-05) A Comparison between Microblog Corpus and Balanced Corpus from Linguistic and Sentimental Perspectives Yi-jie Tang, Chang-Ye Li and Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan [email protected]; [email protected] Abstract Chinese informal words on the Internet. While microblogging has gained popularity on the Internet, To better understand the characteristics of microblog analyzing and processing short messages has become a short messages, a comparison between a microblog corpus challenging task in natural language processing. This paper and a free text corpus is required. In most of the previous analyzes the differences between Internet short messages (or studies, this comparison is done by analyzing word “microtext”) and general articles by comparing the Plurk Corpus and the Sinica Balanced Corpus. Likelihood ratio frequency of each corpus. Kilgarriff (1998), for example, and the tóngyìcícílín (“ ”) thesaurus are adopted uses word frequency to measure the homogeneity and to analyze the lexical semantics of frequent terms in each similarity between two corpora. Rayson and Garside (2000) corpus. Furthermore, the NTUSD sentiment dictionary is adopt word frequency profiling to discover key items that used to compare the sentiment distribution of the two differentiate on corpus from another. While these corpora. The result is also applied to sentiment transition analysis. comparisons are based on large corpora consisting of free text corpus, Sahami and Heilman (2006) propose a new kernel function to measure the similarity of short text Introduction snippets by leveraging web search results. In this paper, we compare microblog texts with general Microblogs have become one of the most important articles by word frequency, as well as lexcial semantics, Internet media in recent years. It allows users to post short and perform sentiment analyses. We collect textual data messages, which are usually limited to 140 characters. from the Plurk microblogging platform (http://plurk.com) Since these messages are shorter than regular blog articles and adopt the Sinica Balanced Corpus, and study the and other free texts, and the word usage is different from lexical semantics and sentiment tendency of high spoken dialogues, analyzing and processing these frequency terms in each corpus. messages is a challenging task in natural language processing. The analysis of microtext (Ellen 2011) such as short Microblog Corpus vs. Balanced Corpus messages used in microblogs, instant messages, SMS, etc attracts much attention in recent years. To analyze the This paper collects 20,265,405 Traditional Chinese posts Internet informal language usages, on the other hand, Xia, of about 110,000 microbloggers in Plurk from April 1 to Wong and Gao (2005) collect textual data from bulletin October 31, 2009. This microblog corpus, called Plurk board systems, and propose methods including pattern Corpus, is used to study the language in microtexts. Plurk matching and support vector machines to recognize is the most popular microblogging platform in Taiwan, Topic philosophy science society art life literature Total #characters 685.3K 102.4K 2761.3K 732.2K 1412.0K 1278.5K 7892.7K #words 451.7K 675.0K 1820.3K 482.7K 930.8K 842.8K 5202.8K Ratio 8.68% 12.97% 34.99% 9.28% 17.89% 16.20% 100% Table 1. Statistics of Sinica 3.0 Corpus 68 A. PERSON: Aa. generic name, Ab. people of all ages and both sexes, Ac. posture, Ad. nationality/citizenship, Ae. occupation, Af. identity, Ag. status, Ah. family member, Ai. seniority, Aj. relationship, Ak. temperament, Al. ability, Am. religion, An. negative appellation B. OBJECT: Ba. generic name, Bb. shape, Bc. part of object, Bd. celestial body, Be. terrain, Bf. meteorological phonomena, Bg. natural substance, Bh. plant, Bi. animal, Bj. microorganism, Bk. body, Bl. secretions/excretions, Bm. material, Bn. building, Bo. machine and tool, Bp. appliance, Bq. clothing, Br. edible/medicine/drug C. TIME AND SPACE: Ca. time, Cb. space D. ABSTRACT THING: Da. event/circumstance, Db. reason/logic, Dc. look, Dd. function/property, De. character/talent, Df. consciousness, Dg. analogy, Dh. imaginary thing, Di. society/politics, Dj. economy, Dk. culture and education, Dl. disease, Dm. organization, Dn. quantity/unit E. CHARATERISTICS: Ea. appearance, Eb. phenomenon, Ec. color/taste, Ed. property, Ee. virtue, Ef. circumstance F. MOTION. Fa. motion of hands, Fb. motion of legs, Fc. motion of head, Fd. motion of the whole body G. MENTAL ACTIVITY: Ga. mental status, Gb. mental activity, Gc. capability and willingness H. ACTIVITY: Ha. political activity, Hb. military activity, Hc. administrative management Hd. production, He. economical activity, Hf. communications and transportation, Hg. Education/hygiene/research, Hh. recreation and sport, Hi. social activity, Hj. life, Hk. religious activity, Hl. superstitious activity, Hm. police and judicature, Hn. wicked behavior I. PHENOMENON AND CONDITION: Ia. natural phenomena, Ib. physiological phenomena, Ic. facial expression, Id. object status, Ie. situation, If. circumstance, Ig. beginning and end, Ih. change J. RELATION Ja. association, Jb. similarity and dissimilarity, Jc. coordination, Jd. existence, Je. influence K. AUXILIARY: Ka. adverb, Kb. preposition Kc. conjunction, Kd. particle, Ke. interjection, Kf. Onomatopoeia L. GREETING Table 2. Taxonomy of Cilin which means it can provide a large amount of suitable data to n entries, so that the number of word types is less than for this study. The messages on Plurk, like those on Twitter 65,464. There are 53,644 words, including 45,586 and some other microblogging platforms, are limited to unambiguous words and 8,058 ambiguous words. Table 1 140 characters. Plurk also serves as a social network and shows the taxonomy of Cilin on the levels of large and instant messaging system, since users can have friends and middle sense categories. Symbols A, B, ..., L denote large followers and interact with each other instantly. sense categories and symbols Aa, Ab, Ac, ..., Ba, ... denote In comparison with Plurk corpus, the Sinica Balanced middle sense categories. Corpus 3.0 (abbreviated as Sinica corpus), which is a Chen, Lin and Lin (2002) sampled documents from segmented and POS-tagged Chinese balanced corpus, is different categories of the Sinica corpus, including adopted. Sinica corpus is composed of 5 millions words. philosophy (10%), science (10%), society (35%), art (5%), Table 1 shows the topic distribution of documents in this life (20%) and literature (20%). There were 35,921 words corpus. The Plurk corpus is segmented by maximum in the test corpus. They reported the accuracies 52.85% matching with a dictionary collected from Sinica corpus and 34.35% for tagging ambiguous words and unknown and Sinca BOW (Bilingual Ontological Wordnet). words, respectively, when 1,428 sense categories of Cilin is adopted. If unambiguous instances are also counted, the sense tagger achieved a performance of 76.04%. Analysis of Lexical Semantics In this paper, we do not disambiguate the word senses. We postulate that an ambiguous word consisting of n Likelihood Ratio senses contributes each of its sense categories equally, i.e., To compare these two corpora by lexical semantics, we 1/n times. We count the frequency of each Traditional adopt the thesaurus tóngyìcícílín (“ ”), which is Chinese word in Plurk and Sinica corpora, respectively, abbreviated as Cilin (Mei 1982). Cilin gathers 65,464 select the top 100 and top 3,000 high frequency words, Chinese word entries. Cilin senses are decomposed to a determine the sense categories of the selected words, four-layer semantic structure including 12 large sense compute the occurrences of each sense category, and categories, 94 middle sense categories, 1,428 small sense finally compare the distribution of their senses in each categories, and 3,925 word sense clusters, however, it does corpus. For comparability, we normalize the occurrences not support relationships for hypernym, hyponym, similar, of each sense category in each corpus by total occurrences derived, antonym, etc. A word with n senses corresponds of all the sense categories in the same corpus. In this way, 69 Top 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th lrPlurk-Sinica pū a good Lou Ye yíge (refer- good (sentence evening (sentence (sentence English (indefinite ring to a super sleep sleep morning -final or good -final -final article) Plurk particle) night particle) particle) message) Category Dn L Kf Hi L Ka Kd Kd Fd, Hj Hj lrSinica-Plurk qí zé yú express; and; suŏ English (3rd-person (conjunc- each from every and (preposi- indicate also (particle) possessive) tion) tion) Category Aa, Ba, Ed Hi, Ja Ka, Kc Ja Dn Kb Dn, Kd Ed Kc Kb Table 3. List of top 10 high lrPlurk-Sinica words and top 10 high lrSinica-Plurk words the normalized value of a sense category will range In terms of 100 most frequent words, Plurk corpus has between 0 and 1. more words for Activity (category H), while the Sinica We union the top 100 high frequency words in Plurk and corpus has more words for abstract thing (category D). Sinica corpora and get a set of 140 words. A likelihood Most of the H category words in Plurk corpus are related to ratio (lr) is defined as follows. It is used to distinguish personal experience, such as “eat” ( ), “buy” ( ), and which words are more critical in which corpus, and which “sleep” ( ). Both the Plurk and Sinica corpora contain are common in both corpora. many auxiliaries (category K). The relative frequency of i this category does not differ a lot in the two corpora. f A (w ) When we look into the subcategories, however, much more i | A | (1) lr (w ) log words in category Kd appear in the Plurk corpus. That is, AB f (wi ) B Plurk uses more words like a ( ), ma ( ), and la ( ), | B | which are sentence-final particles in Chinese.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    6 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us