Neural Network Prediction of Censorable Language
Total Page:16
File Type:pdf, Size:1020Kb
Neural Network Prediction of Censorable Language Kei Yin Ng Anna Feldman Jing Peng Chris Leberknight Montclair State University Montclair, New Jersey, USA fngk2,feldmana,pengj,[email protected] Abstract nearly 30% of them are censored within 5–30 min- utes, and nearly 90% within 24 hours (Zhu et al., Internet censorship imposes restrictions on 2013).Research shows that some of the censorship what information can be publicized or viewed on the Internet. According to Freedom decisions are not necessarily driven by the crit- House’s annual Freedom on the Net report, icism of the state (King et al., 2013), the pres- more than half the world’s Internet users now ence of controversial topics (Ng et al., 2018a,b), live in a place where the Internet is censored or posts that describe negative events. Rather, cen- or restricted. China has built the world’s most sorship is triggered by other factors, such as for ex- extensive and sophisticated online censorship ample, the collective action potential (King et al., system. In this paper, we describe a new cor- 2013). The goal of this paper is to compare cen- pus of censored and uncensored social me- dia tweets from a Chinese microblogging web- sored and uncensored posts that contain the same site, Sina Weibo, collected by tracking posts sensitive keywords and topics. Using the linguis- that mention ‘sensitive’ topics or authored by tic features extracted, a neural network model is ‘sensitive’ users. We use this corpus to build built to explore whether censorship decision can a neural network classifier to predict censor- be deduced from the linguistic characteristics of ship. Our model performs with a 88.50% ac- the posts. curacy using only linguistic features. We dis- cuss these features in detail and hypothesize 2 Previous Work that they could potentially be used for censor- ship circumvention. There have been significant efforts to develop strategies to detect and evade censorship. Most 1 Introduction work, however, focuses on exploiting technolog- Free flow of information is absolutely necessary ical limitations with existing routing protocols for any democratic society. Unfortunately, polit- (Leberknight et al., 2012; Katti et al., 2005; Levin ical censorship exists in many countries, whose et al., 2015; McPherson et al., 2016; Weinberg governments attempt to conceal or manipulate in- et al., 2012). Research that pays more atten- formation to make sure their citizens are unable to tion to linguistic properties of online censorship read or express views that are contrary to those in in the context of censorship evasion include, for power. One such example is Sina Weibo, a Chi- example, Safaka et al. (2016) who apply lin- nese microblogging website. It was launched in guistic steganography to circumvent censorship. 2009 and became the most popular social media Lee (2016) uses parodic satire to bypass censor- platform in China. Sina Weibo has over 431 mil- ship in China and claims that this stylistic de- lion monthly active users1. In cooperation with vice delays and often evades censorship. Hirun- the ruling regime, Weibo sets strict control over charoenvate et al. (2015) show that the use of ho- the content published under its service. Accord- mophones of censored keywords on Sina Weibo ing to Zhu et al. (2013), Weibo uses a variety of could help extend the time a Weibo post could re- strategies to target censorable posts, ranging from main available online. All these methods rely on keyword list filtering to individual user monitor- a significant amount of human effort to interpret ing. Among all posts that are eventually censored, and annotate texts to evaluate the likeliness of cen- sorship, which might not be practical to carry out 1https://www.investors. com/news/technology/ for common Internet users in real life. There has weibo-reports-first-quarter-earnings/ also been research that uses linguistic and content 40 Proceedings of the Third Workshop on Natural Language Processing and Computational Social Science, pages 40–46 Minneapolis, Minnesota, June 6, 2019. c 2019 Association for Computational Linguistics clues to detect censorship. Knockel et al. (2015) “permission denied” (system-deleted), whereas a and Zhu et al. (2013) propose detection mecha- user-removed post returns an error message of “the nisms to categorize censored content and automat- post does not exist” (user-deleted). However, since ically learn keywords that get censored. King et the Topic Timeline (the data source of our web al. (2013) in turn study the relationship between scraper) can be accessed only on the front-end (i.e. political criticism and chance of censorship. They there is no API endpoint associated with it), we come to the conclusion that posts that have a Col- rely on both the front-end and the API to identify lective Action Potential get deleted by the cen- system- and user-deleted posts. It is not possible sors even if they support the state. Bamman et to distinguish the two types of deletion by directly al. (2012) uncover a set of politically sensitive key- querying the unique ID of all scraped posts be- words and find that the presence of some of them cause, through empirical experimentation, uncen- in a Weibo blogpost contribute to higher chance sored posts and censored (system-deleted) posts of the post being censored. Ng et al. (2018b) both return the same error message – “permis- also target a set of topics that have been suggested sion denied”). Therefore, we need to first check to be sensitive, but unlike Bamman et al. (2012), if a post still exists on the front-end, and then they cover areas not limited to politics. Ng et send an API request using the unique ID of post al. (2018b) investigate how the textual content as that no longer exists to determine whether it has a whole might be relevant to censorship decisions been deleted by the system or the user. The steps when both the censored and uncensored blogposts to identify censorship status of each post are il- include the same sensitive keyword(s). lustrated in Figure1. First, we check whether a scraped post is still available through visiting the 3 Tracking Censorship user interface of each post. This is carried out au- tomatically in a headless browser 2 days after a Tracking censorship topics on Weibo is a challeng- post is published. If a post has been removed (ei- ing task due to the transient nature of censored ther by system or by user), the headless browser posts and the scarcity of censored data from well- is redirected to an interface that says “the page known sources such as FreeWeibo2 and Weibo- doesn’t exist”; otherwise, the browser brings us to Scope3. The most straightforward way to collect the original interface that displays the post con- data from a social media platform is to make use of tent. Next, after 14 days, we use the same meth- its API. However, Weibo imposes various restric- ods in step 1 to check the posts’ status again. This tions on the use of its API4 such as restricted ac- step allows our dataset to include posts that have cess to certain endpoints and restricted number of been removed at a later stage. Finally, we send a posts returned per request. Above all, Weibo API follow-up API query using the unique ID of posts does not provide any endpoint that allows easy and that no longer exist on the browser in step 1 and efficient collection of the target data (posts that step 2 to determine censorship status using the contain sensitive keywords) of this paper. There- same decoding techniques proposed by Zhu et al. fore, an alternative method is needed to track cen- as described above (2013). Altogether, around 41 sorship for our purpose. thousand posts are collected, in which 952 posts 4 Data Collection (2.28%) are censored by Weibo. 4.1 Web Scraping topic censored uncensored cultural revolution 55 66 4.2 Decoding Censorship human rights 53 71 According to Zhu et al. (2013), the unique ID of family planning 14 28 censorship & propaganda 32 56 a Weibo post is the key to distinguish whether a democracy 119 107 post has been censored by Weibo or has been in- patriotism 70 105 stead removed by the author himself. If a post China 186 194 has been censored by Weibo, querying its unique Trump 320 244 ID through the API returns an error message of Meng Wanzhou 55 76 kindergarten abuse 48 5 2https://freeweibo.com/ Total 952 952 3http://weiboscope.jmsc.hku.hk/ 4https://open.weibo.com/wiki/API文c/en Table 1: Data collected by scraper for classification 41 Figure 1: Logical flow to determine censorship status 4.3 Pre-existing Corpus topic censored uncensored cultural revolution 19 29 human rights 16 10 Zhu et al. (2013) collected over 2 million posts family planning 4 4 published by a set of around 3,500 sensitive users censorship & propaganda 47 38 during a 2-month period in 2012. We extract democracy 94 53 around 20 thousand text-only posts using 64 key- patriotism 46 30 words across 26 topics (which partially overlap China 300 458 Bo Xilai 8 8 with those of scraped data, see Table3) and fil- brainwashing 57 3 ter all duplicates. Among the extracted posts, emigration 10 11 930 (4.63%) are censored by Weibo as verified by June 4th 2 5 Zhu et al. (2013) The extracted data from Zhu et food & env. safety 14 17 al.(2013)’s are also used in building classification wealth inequality 2 4 protest & revolution 4 5 models. stability maintenance 66 28 political reform 12 9 territorial dispute 73 75 dataset N H features accuracy baseline 49.98 Dalai Lama 2 2 human baseline (Ng et al., 2018b) 63.51 HK/TW/XJ issues 2 4 scraped 500 50,50,50 Seed 1 80.36 political dissidents 2 2 scraped 800 60,60,60 Seed 1 80.2 Obama 8 19 Zhu et al’s 800 50,7 Seed 1 87.63 USA 62 59 Zhu et al’s 800 30,30 Seed 1 86.18 communist party 37 10 both 800 60,60,60 Seed 1 75.4 freedom 12 11 both 500 50,50,50 Seed 1 73.94 economic issues 31 37 scraped 800 30,30,30 all except LIWC 72.95 Total 930 930 Zhu et al’s 800 60,60,60 all except LIWC 70.64 both 500 40,40,40 all except LIWC 84.67 Table 3: Data extracted from Zhu et al.