Detecting Media Self-Censorship Without Explicit Training Data

Detecting Media Self-Censorship without Explicit Training Data Rongrong Tao ∗ Baojian Zhou y Feng Chen z David Mares x Patrick Butler ∗ Naren Ramakrishnan ∗ Ryan Kennedy { Abstract jailed annually 2. The motives and means of explicit state censorship have One of the responses to this stifling environmen- been well studied, both quantitatively and qualitatively. tal context is self-censorship, i.e., the act of decid- Self-censorship by media outlets, however, has not received ing not to publish about certain topics. However, there nearly as much attention, mostly because it is difficult to is currently no efficient and effective approach to au- systematically detect. We develop a novel approach to tomatically detect and track self-censorship events in identify news media self-censorship by using social media real time. We can draw some parallels to social me- as a sensor. We develop a hypothesis testing framework to dia censorship. Here, censorship often takes the form of identify and evaluate censored clusters of keywords and a active censors identifying offending posts and deleting near-linear-time algorithm (called GraphDPD) to identify them and therefore tracking post deletions supports the the highest scoring clusters as indicators of censorship. We use of supervised learning approaches [1]. On the other evaluate the accuracy of our framework, versus other state- hand, censorship in news media typically has no labeled of-the-art algorithms, using both semi-synthetic and real- information and must rely on unsupervised techniques world data from Mexico and Venezuela during Year 2014. instead. These tests demonstrate the capacity of our framework to In this paper, we present a novel unsupervised ap- identify self-censorship, and provide an indicator of broader proach that views social media as a sensor to detect media freedom. The results of this study lay the foundation censorship in news media wherein statistically signifi- for detection, study, and policy-response to self-censorship. cant differences between information published in the news media and the correlated information published 1 Introduction in social media are automatically identified as can- didate censored events. A generalized log-likelihood News media censorship is generally defined as a restric- ratio test (GLRT) statistic is formulated for hypoth- tion on freedom of speech to prohibit access to public esis testing, and the problem of censorship detec- information, and is taking place more than ever before. tion is cast as the maximization of the GLRT statis- According to the Freedom of the Press Report, 40.4 per- tic over all possible clusters of keywords. We pro- cent of nations fit into the \free" category in 2003. By pose a near-linear-time algorithm called GraphDPD 2014, this global percentage fell to 32 percent 1. Hun- to identify the highest scoring clusters as indicators of dreds of journalists are jailed every year, according to censorship events in the local news media, and further the Committee to Protect Journalists. In fact, in the apply randomization testing to estimate the statistical past three years, more than 200 journalists have been significance of these clusters. We consider the detection of censorship in the news media of Mexico and Venezuela, and utilize Twitter as the uncensored source. 1Virginia Tech, VA, USA [email protected], [email protected], Starting in January 2012, a \Country-Withheld [email protected] Content" policy has been launched by Twitter, with 2University at Albany, SUNY, Albany, NY, USA which governments are able to request withholding and [email protected] deletion of user accounts and tweets. At the same time, 3 University of Texas at Dallas, Richardson, TX, USA Twitter started to release a transparency report, which [email protected] 4University of California at San Diego, San Diego, CA, USA provided worldwide information about such removal [email protected] requests. The Transparency Report lists information 5University of Houston, Houston, TX, USA [email protected] 1https://freedomhouse.org/report/freedom-press/freedom- 2http://saccityexpress.com/defending-freedom-of- press-2014 speech/#sthash.cbI7lWbw.dpbs Copyright c 2020 by SIAM Unauthorized reproduction of this article is prohibited Table 1: Summary of Twitter Transparency Report for 2 Related Work Year 2014 on selected countries. Previous studies have focused on explicit censorship Requests Requests of social media posts, especially in Turkey. Turkey is (Govt, Accounts Tweets Country (Court Police, Withheld Withheld Order) the country issuing the largest number of censorship etc.) Argentina 0 1 0 0 requests to Twitter (see Table 1). The authors in [14] Australia 0 0 0 0 Brazil 35 0 5 101 studied patterns of Twitter censorship by collecting 20 Colombia 0 1 0 0 Greece 0 3 0 0 million Turkish tweets and applying topic extraction Japan 6 21 0 43 Mexico 0 2 0 0 and clustering on those that were censored. They found Turkey 393 270 79 2; 003 Venezuela 0 0 0 0 that the vast bulk of censored tweets contained political information critical of the Turkish government. and removal requests from Year 2012 on a half-year These methods are not easily adapted for the study basis. Table 1 summarizes the information and removal of self-censorship, since they require that the story or requests for Year 2014 on our selected countries. For post be published (or submitted) and removed, allowing Mexico and Venezuela, Twitter did not participate in for direct observation of explicit censorship. To detect any social media censorship. Based on this observation, self-censorship using social media, we need to be able to Twitter can be considered as a reliable and uncensored detect major events in social media apriori, i.e. events source to detect news self censorship events in Latin the media would have reported with a high likelihood America. The main contributions of this paper are if not for self-imposed restrictions. The detection of summarized as follows: such events has largely been done in the field of event detection. [16], for example, developed a system which • Analysis of censorship patterns between news identifies tweets posted closely in time and location, and media and Twitter: We carried out an extensive determined whether they are mentions of the same event analysis of information in Twitter deemed relevant by co-occurring keywords. [12] presented the first open- to censored information in news media. In doing so, domain system for event extraction and an approach we make important observations that highlight the to classify events based on latent variable models. [13] importance of our work. formulated event detection in activity networks as a graph mining problem and proposed effective greedy • Formulation of an unsupervised censorship de- approaches to solve this problem. In addition to textual tection framework: We propose a novel hypothesis- information, [4] proposed an event detection method testing-based statistical framework for detecting clus- which utilizes visual content and intrinsic correlation ters of co-occurred keywords that demonstrate statis- in social media. tically significant differences between the information We must be careful not to overstate the utility of published in news media and the correlated informa- social media for detecting major events. Analysis of tion published in a uncensored source (e.g., Twitter). the coverage of various topics across social media and To the best of our knowledge, this is the first unsuper- news media have found many similarities, but also some vised framework for automatic detection of censorship systematic differences. [10] studied topic and timing events in news media. overlapping in newswire and Twitter and concluded that Twitter covers not only topics reported by news • Optimization algorithms: The inference of our media during the same time period, but also minor proposed framework involves the maximization of topics ignored by news media. Through analysis of a GLRT statistic function over all clusters of co- hundreds of news events, [9] observed both similarities occurred keywords, which is hard to solve in general. and differences of coverage of events between social We propose a novel approximation algorithm to solve media and news media. this problem in nearly linear time. • Extensive experiments to validate the pro- 3 Data Analysis posed techniques: We conduct comprehensive ex- Table 3 summarizes the notation used in this work. The periments on real-world Twitter and News datasets. EMBERS project [11] provided a collection of Latin The results demonstrate that our proposed approach American news articles and Twitter posts. The news outperforms existing techniques in the accuracy of dataset was sourced from around 6000 news agencies censorship detection. In addition, we perform case during 2014 across the world. From \4 International studies on the detected censorship patterns and ana- Media & Newspapers", we retrieved a list of top news- lyze the reasons behind censorship. papers with their domain names in the target country. News articles are filtered based on the domain names Copyright c 2020 by SIAM Unauthorized reproduction of this article is prohibited in the URL links. Twitter data was collected by ran- across Twitter in Mexico while Mexican news outlets do domly sampling 10% (by volume) tweets during Year not depict significant changes. 2014. Mexico and Venezuela were chosen as target Topic is of interest only in news media: In countries in this work since they had no censorship in late September 2014, heads of state and governments Twitter (as shown in Table 1) but severe level of cen- attended the global Climate Summit. This incident is sorship in news media according to the Freedom of the widely discussed in news media, while relatively less at- Press Report. tention in social media. Topic is censored in news media: Fig. 1 com- 3.1 Data Preprocessing The inputs to our pro- pares TSDF in El Mexicano Gran Diario Regional (el- posed approach are keyword co-occurrence graphs.

Load more