Tracking and Quantifying Censorship on a Chinese Microblogging Site
Total Page:16
File Type:pdf, Size:1020Kb
Tracking and Quantifying Censorship on a Chinese Microblogging Site Tao Zhu David Phipps Adam Pridgen Jedidiah R. Crandall Dan S. +allach [email protected] Computer Science Computer Science Computer Science Computer Science Independent Researcher Bowdoin College Rice Uni"ersity Uni"ersity of Ne% Mexico Rice Uni"ersity Abstract—We present measurements and analysis of censor- Since the media is heavily controlled in China, some current ship on Weibo, a popular microblogging site in China. Since e"ents do not appear in traditional media, and are not as we were limited in the rate at which we could download posts, prominent in social media as less sensiti"e topics are. +e we identified users likely to participate in sensitive topics and recursively followed their social contacts. We also leveraged new implemented a system which can find these current e"ents and natural language processing techniques to pick out trending the related discussions in social media before such content is topics despite the use of neologisms, named entities, and informal remo"ed for censorship reasons. +e ha"e de"eloped a user language usage in Chinese social media. tracking method for -nding sensiti"e topic posts before they We found that Weibo dynamically adapts to the changing are deleted. +e also apply no"el natural language processing interests of its users through multiple layers of filtering. he filtering includes both retroactively searching posts by keyword methods. To our kno%ledge, ours is the first system that can or repost links to delete them, and rejecting posts as they are successfully extract topics from posts that use unknown %ords posted. The trend of sensitive topics is short-lived, suggesting or phrases in Chinese social media. +e also document "arious that the censorship is effective in stopping the “viral” spread proacti"e and reacti"e filtering and deletion methods applied of sensitive issues. We also give evidence that sensitive topics in by +eibo, and find that these mechanisms are effecti"e at Weibo only scarcely propagate beyond a core of sensitive posters. restricting discussion of sensiti"e topics to a core set of posters. I. I)T ,!'CTI,) The rest of this paper is structured as follows. Section II Many countries %ant to ha"e the benefits of the Internet gi"es some basic background information about microblogging while also controlling the dissemination of content. There and Internet censorship in China. Then Section III describes are many reasons for censoring the Internet, but a major the algorithms and methods we used for our measurement and reason is to pre"ent ci"il unrest [1]. Censorship is difficult analysis, followed by Section IV that describes the natural to maintain at scale 03]. Social media companies in countries language processing %e applied to the data and presents that tightly control Internet content find themselves in a risky results from topical analysis. Section ? describes the "arious position. +ithout li"ely discussions, they will not ha"e user mechanisms that +eibo uses to filter sensiti"e content. Then, engagement and the advertising re"enues that come with it. after discussion in Section VI, we conclude. If discussions turn to sensiti"e topics, they risk go"ernment intervention that could ha"e negati"e consequences for the II. $&C@> ,')! company. Consequently. many +estern social network sites, China has in"ested heavily in a mixture of Internet filtering such as Twitter or 5acebook, are una"ailable inside nations technologies 0A2. 0B2. 0C2. [7], cyber police surveillance 0D2. 0E2. such as China, and homegrown competitors, such as Sina 0162. and policy 0112. 0132 in order to tightly control access +eibo, ha"e de"eloped sophisticated censorship systems. to, and the spread of, Internet content. +eibo is the most popular microblogging site in China. Starting from 3616. %hen microblogs debuted in China, In February 3613. it had o"er 766 million users [3]. Like not only ha"e there been many top news stories where the Twitter in other countries, +eibo plays an important role reporting %as dri"en by social media, but social media has in the discourse surrounding current e"ents in China. Both also been part of the story itself for a number of prominent professional reporters and amateurs can provide immediate, e"ents, including the protests of +ukan 0172. the Deng Fujiao first-hand accounts and opinions of e"ents as they unfold. :"en incident 01A2. the Fao Jiaxin murder case 01B2 and the recent though a user might normally only see messages from other Shifang protest [16]. There ha"e also been e"ents where social users that they directly follo%. any message can rapidly “go media has forced the go"ernment to address issues directly. viral” by being shared or “retweeted” from one user to the such as the Beijing rainstorms in July. next. 5or companies such as Twitter or 5acebook, viral posts Sina +eibo Gweibo.com. referred to in this paper simply impose technical challenges to the scalability and response as ;+eibo”) has the most acti"e user community of any time of their service. To +eibo and other such firms, they microblog site in China [17]. +eibo provides services which create huge censorship challenges as well. +eibo must conduct are similar to Twitter. with @usernames, #hashtags, reposting, its own censorship rather than relying on external mechanisms, and ' 8 shortening. Posts can also contain 1A6 UTF-8 char9 such as China’s “Great Fire%all.< acters. Because %ords in written Chinese often only require two characters, 1A6 Chinese characters can ha"e 79B times that were later allowed to be searched. From the search results, as much information as 1A6 characters in English. T%o other we picked 3B users who stood out for posting about sensiti"e major differences are that +eibo allows embedded images and topics. videos, and users can comment on each others’ posts. Next, we needed to broaden our search to a larger group There is relati"ely little research on Chinese social me- of users. +e assumed that anybody who has been reposted dia 01D2. despite the fact that English-speaking social media more than -"e times by our sensiti"e users must be sensiti"e scholars perform research on Russian and Persian language as well. +e followed them for a period of time and measured online corpora e"en though they cannot speak Russian or ho% often their posts %ere deleted. Any user with more than Persian (see, e.g.. 019]). One of many concerns that can hinder B deleted posts %as considered to be sensiti"e. By expanding this %ork is the general difficulty of mechanically processing outward in this fashion. after -"e days of crawling we settled Chinese text. +estern speakers (and algorithms) expect %ords on a group of o"er 7.B66 sensiti"e +eibo users, collecti"ely to be separated by whitespace or punctuation. In written experiencing more than 7.666 deletions per day (roughly AM Chinese, howe"er. there are no such %ord boundary delimiters. of the total posts from those users). The %ord segmentation problem in Chinese is exacerbated While our methodology cannot be considered to yield a by the existence of unknown %ords such as named entities representati"e sample of +eibo users o"erall, we belie"e it is (people, companies, movies, etc.) or neologisms (substituting representati"e of ho% users who discuss sensiti"e topics will characters that appear similar to others, or otherwise coining experience +eibo’s censorship. +e also belie"e our sample ne% euphemisms or expressions, to defeat keyword-based is broad enough to also speak to the topics that +eibo is censorship) [20]. Furthermore, since social media is heavily censoring on any gi"en day. centered around current e"ents, it may well contain ne% named entities that will not appear in any static lexicon [21]. B. Crawling Despite these concerns, +eibo censorship has been the Once we settled on our list of users to follo%. we %anted to subject of previous research. Bamman et al. 0332 performed a follo% them with suf-cient fidelity to see posts as they were statistical analysis of deleted posts, showing that the presence made and measure ho% long they last prior to being deleted. of some sensiti"e terms indicated a higher probability of the +e use two APIs pro"ided by +eibo, allowing us to query deletion of a post. Their %ork also showed some geographic individual user timelines as well as the public timeline [25]. patterns in post deletion, with posts from the provinces of Starting in July 3613. %e queried each of our 7B66 users, once Tibet and Qinghai exhibiting a higher deletion rate than other per minute, for which +eibo returns the most recent B6 posts. provinces. +e also queried the public timeline roughly once e"ery four +eiboScope 0372 %as de"eloped at the Uni"ersity of Hong seconds, for which +eibo returns the most recent 366 posts. @ong in order to track trends on +eibo, concurrently with These posts appear to be 39B minutes older than real-time. +e our own study of +eibo. The core difference between our belie"e they represent anywhere from 1.7% to 16M of the full %ork and +eiboScope is that they track a large sample: around public timeline Gi.e.. we seem to obtain more complete traces 766 thousand users who each ha"e more than 1666 followers, at night, when many of +eibo’s users are sleeping and the allowing them to observe usually no more than 166 deletions posting rate is correspondingly lower). per day. +e follo% a much smaller sample of users, as we will +eibo appears to enforce rate limits both on individual describe shortly.