Gender Prediction for Chinese Social Media Data Wen Li Markus Dickinson Department of Linguistics Department of Linguistics Indiana University Indiana University Bloomington, IN, USA Bloomington, IN, USA
[email protected] [email protected] Abstract shorter and noisier, as with social media data, given that there seem to be fewer and less reliable Social media provides users a platform indicators of a demographic trait (e.g., Zhang and to publish messages and socialize with Zhang, 2010; Burger et al., 2011), in addition to others, and microblogs have gained more the fact that many users produce language atyp- users than ever in recent years. With such ical of their demographic (Bamman et al., 2014; usage, user profiling is a popular task in Nguyen et al., 2014). This problem is potentially computational linguistics and text mining. compounded when examining languages such as Different approaches have been used to Chinese, where: a) the definition of a word is prob- predict users’ gender, age, and other in- lematic (Sproat et al., 1996); b) the collection of formation, but most of this work has been data with links to individual users is challenging, done on English and other Western lan- since Weibo (see below) requires users’ authoriza- guages. The goal of this project is to pre- tion before data collection; and c) there has been dictthegender of usersbased ontheirposts no published work (we are aware of) on this task, on Weibo, a Chinese micro-blogging plat- most work focusing on English and to some ex- form. Given issues in Chinese word seg- tent other Western languages (Rangel et al., 2015; mentation, we explore character and word Nguyen et al., 2013).