Gender Prediction for Chinese Social Media Data
Total Page:16
File Type:pdf, Size:1020Kb
Gender Prediction for Chinese Social Media Data Wen Li Markus Dickinson Department of Linguistics Department of Linguistics Indiana University Indiana University Bloomington, IN, USA Bloomington, IN, USA [email protected] [email protected] Abstract shorter and noisier, as with social media data, given that there seem to be fewer and less reliable Social media provides users a platform indicators of a demographic trait (e.g., Zhang and to publish messages and socialize with Zhang, 2010; Burger et al., 2011), in addition to others, and microblogs have gained more the fact that many users produce language atyp- users than ever in recent years. With such ical of their demographic (Bamman et al., 2014; usage, user profiling is a popular task in Nguyen et al., 2014). This problem is potentially computational linguistics and text mining. compounded when examining languages such as Different approaches have been used to Chinese, where: a) the definition of a word is prob- predict users’ gender, age, and other in- lematic (Sproat et al., 1996); b) the collection of formation, but most of this work has been data with links to individual users is challenging, done on English and other Western lan- since Weibo (see below) requires users’ authoriza- guages. The goal of this project is to pre- tion before data collection; and c) there has been dictthegender of usersbased ontheirposts no published work (we are aware of) on this task, on Weibo, a Chinese micro-blogging plat- most work focusing on English and to some ex- form. Given issues in Chinese word seg- tent other Western languages (Rangel et al., 2015; mentation, we explore character and word Nguyen et al., 2013). n-grams as features for this task, as well We set as our first step that of predicting the gen- as using character and word embeddings der of users on Weibo1—the Twitter analogue in for classification. Given how the data is China—based on an individual post. This task re- extracted, we approach the task on a per- veals two sub-goals. First, we want to identify ar- post basis, and we show the difficulties of eas of development in moving to Chinese social the task for both humans and computers. media data. Having no work on identifying gen- Nonetheless, we present encouraging re- der or other author characteristics in Chinese is sults and point to future improvements. a pity, as previous work has debated the impor- 1 Introduction and Motivation tance of character-based n-grams vs. word-based n-grams (cf., e.g., Sapkota et al., 2015), and Chi- Author profiling, the task of determining some de- nese, with greater difficulty in word segmentation, mographic property of a language user (gender, is an excellent proving ground for such issues. In- age, personality, etc.), has become a significant deed, although there are many Chinese syntactic area within NLP and text mining, with many prac- issues to deal with (e.g., the nominal classification tical applications (see, e.g., Rangel et al., 2015), system), we focus our attention on the impact of and it links to a bevy of related subfields (au- different kinds of n-gram models. This is particu- thorship attribution, native language identification, larly relevant in Chinese as characters in the logo- sentiment analysis, etc.) (cf. Argamon et al., graphic (meaning-based) Chinese writing system 2009), in that they share in common methods and mean something different than characters in alpha- features (e.g., lexica, n-grams). Despite obtain- betic systems, and with a larger number of charac- ing promising results in certain tasks with cer- ters there will be sparser n-grams. As a side note, tain data sets (e.g., Schler et al., 2006), the chal- lenges in author profiling increase as the text gets 1http://weibo.com 438 Proceedings of Recent Advances in Natural Language Processing, pages 438–445, Varna, Bulgaria, Sep 4–6 2017. https://doi.org/10.26615/978-954-452-049-6_058 there is work utilizing Weibo for word segmenta- cation. We would thus like to help fill this gap. tion (e.g., Zhang et al., 2013) and sentiment anal- Users are required to specify their gender on ysis (e.g., Zhou, 2015); with our work we pave the Weibo, giving good experimental data (section 2), way for future connections by exploring the im- but making the task less immediately useful.2 The pact of data filtering, preprocessing, and n-gram task is still worth pursuing because the insights are features on system performance. applicable for predicting other demographics (e.g., A second sub-goal is to identify specific difficul- age) and for (current or future) data beyond Weibo. ties and specific opportunities with a per-post (vs. per-user) method of classification, as this means 2 Data we have very little data with which to work, as lit- Weibo (also known as Sina Weibo) is a Chinese tle as a few words (see section 2). Burger et al. microblogging site, with a market penetration sim- (2011) note that their “tweet text classifier’s accu- ilar to the United States’ Twitter. According to racy increases as the number of tweets from the Wikipedia,3 as of the third quarter of 2015, Weibo user increases,” and Nguyen et al. (2014) point out has 222 million subscribers and 100 million daily cases where features associated with one gender users. About 100 million messages are posted each are found in tweets of the opposite gender. We day on Weibo. assess how accurate such a classifier can be, for Weibo implements many features from Twit- humans or machines, and the impact of the choice ter. A user may post with a 140-character limit, of features on classification accuracy. Indeed, we mention or talk to other people using @UserName find automatic classification accuracy no higher formatting, add hashtags with #HashName# for- than 63% on this per-post task (section 4.1)—also matting, follow other users to make their posts true for human accuracy (section 4.3)—and our appear in one’s own timeline, re-post with work suggests that per-post classification should //@UserName similar to Twitter’s retweet function focus on the identification of posts which can be RT @UserName, and select posts for one’s favorites reliably classified rather than on improving over- list. The users of Weibo include Asian celebrities, all accuracy (sections 4.2 and 4.4). movie stars, singers, famous business and media Given short messages and associated data spar- figures, as well as some famous foreign individu- sity, we take the additional step of investigating als and organizations; like Twitter, Sina Weibo has the role that semantic similarity methods can have a verification program for known people and orga- in classification. The popularity of word2vec in nizations. recent years comes from the weakness of tradi- URLs are automatically shortened using the do- tional bag-of-words model: words are represented main name t.cn like Twitter’s t.co. Official as isolated indices, and the vectors to represent a and third-party applications make users able to ac- document are often sparse (Mikolov et al., 2013). cess Sina Weibo from other websites or platforms. As mentioned in Mikolov et al. (2013), word2vec In January 2016, Sina Weibo decided to remove captures some syntactic regularities and seman- the 140-character limit for any original posts, and tic similarities. Some research is starting to use users were thereby allowed to post with up to 2000 word2vec for text classification of Chinese texts, characters, while the 140-character limit was still especially for sentiment analysis (Su et al., 2014; applicable to re-posts and comments. Bai et al., 2014; Zhang et al., 2015), and we would like to explore the use of word2vec techniques in 2.1 Collection the gender classification task. Because of the at- We collected random Weibo users’ posts in Febru- tributes of Chinese characters, the “word” vectors ary and March 2015, using the Weibo developer could be built based on characters or words. While API.4 The API allows one to get 200 recent pub- most approaches train the vectors based on Chi- lic posts without user authorization. Since short nese words, with word segmentation done before- lag times might lead to duplicate documents, we hand, there is also research using character repre- sentations (Sun et al., 2014). However, the differ- 2Users may of course falsely report their gender, leaving ence between character and word vectors in Chi- some noise in the data; an accurate classifier may in the long run help pinpoint such misreporting. nese is not clear, and no one has conducted a de- 3https://en.wikipedia.org/wiki/Sina_Weibo tailed comparison between them for text classifi- 4http://open.weibo.com 439 called the API every three minutes. Thus, the col- Char. Word lected data are organized by time (i.e., per-post), Uni. 6,463 81,280 not by user. Bi. 333,748 517,639 Tri. 987,781 877,571 2.2 Filtering Total 1,327,992 1,476,490 We filter posts from users with more than 500,000 followers, since these accounts are often main- Table 1: Number of n-grams in training tained by organizations or public relations teams (e.g., for celebrities), which produce very differ- ford Chinese Word Segmenter (Tseng et al., 2005). ent contents than for ordinary users. We also try to Although developed for well-edited data, hand- filter posts containing headline news, usually start- examination reveals no major issues for posts; ing with “【”. importantly, the segmenter is at least consistent Some posts on Weibo are written in languages across posts.