Detecting Media Self- without Explicit Training Data

Rongrong Tao ∗ Baojian Zhou † Feng Chen ‡ David Mares § Patrick Butler ∗ Naren Ramakrishnan ∗ Ryan Kennedy ¶

Abstract jailed annually 2. The motives and means of explicit state censorship have One of the responses to this stifling environmen- been well studied, both quantitatively and qualitatively. tal context is self-censorship, i.e., the act of decid- Self-censorship by media outlets, however, has not received ing not to publish about certain topics. However, there nearly as much attention, mostly because it is difficult to is currently no efficient and effective approach to au- systematically detect. We develop a novel approach to tomatically detect and track self-censorship events in identify news media self-censorship by using real time. We can draw some parallels to social me- as a sensor. We develop a hypothesis testing framework to dia censorship. Here, censorship often takes the form of identify and evaluate censored clusters of keywords and a active censors identifying offending posts and deleting near-linear-time algorithm (called GraphDPD) to identify them and therefore tracking post deletions supports the the highest scoring clusters as indicators of censorship. We use of supervised learning approaches [1]. On the other evaluate the accuracy of our framework, versus other state- hand, censorship in news media typically has no labeled of-the-art algorithms, using both semi-synthetic and real- information and must rely on unsupervised techniques world data from and Venezuela during Year 2014. instead. These tests demonstrate the capacity of our framework to In this paper, we present a novel unsupervised ap- identify self-censorship, and provide an indicator of broader proach that views social media as a sensor to detect media freedom. The results of this study lay the foundation censorship in news media wherein statistically signifi- for detection, study, and policy-response to self-censorship. cant differences between information published in the news media and the correlated information published 1 Introduction in social media are automatically identified as can- didate censored events. A generalized log-likelihood News media censorship is generally defined as a restric- ratio test (GLRT) statistic is formulated for hypoth- tion on to prohibit access to public esis testing, and the problem of censorship detec- information, and is taking place more than ever before. tion is cast as the maximization of the GLRT statis- According to the Report, 40.4 per- tic over all possible clusters of keywords. We pro- cent of nations fit into the “free” category in 2003. By pose a near-linear-time algorithm called GraphDPD 2014, this global percentage fell to 32 percent 1. Hun- to identify the highest scoring clusters as indicators of dreds of journalists are jailed every year, according to censorship events in the local news media, and further the Committee to Protect Journalists. In fact, in the apply randomization testing to estimate the statistical past three years, more than 200 journalists have been significance of these clusters. We consider the detec- tion of censorship in the news media of Mexico and Venezuela, and utilize as the uncensored source. 1Virginia Tech, VA, USA {[email protected], [email protected], Starting in January 2012, a “Country-Withheld [email protected]} Content” policy has been launched by Twitter, with 2University at Albany, SUNY, Albany, NY, USA which governments are able to request withholding and {[email protected]} deletion of user accounts and tweets. At the same time, 3 University of Texas at Dallas, Richardson, TX, USA Twitter started to release a transparency report, which {[email protected]} 4University of California at San Diego, San Diego, CA, USA provided worldwide information about such removal {[email protected]} requests. The Transparency Report lists information 5University of Houston, Houston, TX, USA {[email protected]} 1https://freedomhouse.org/report/freedom-press/freedom- 2http://saccityexpress.com/defending-freedom-of- press-2014 speech/#sthash.cbI7lWbw.dpbs

Copyright c 2020 by SIAM Unauthorized reproduction of this article is prohibited Table 1: Summary of Twitter Transparency Report for 2 Related Work Year 2014 on selected countries. Previous studies have focused on explicit censorship

Requests Requests of social media posts, especially in . Turkey is (Govt, Accounts Tweets Country (Court Police, Withheld Withheld Order) the country issuing the largest number of censorship etc.) Argentina 0 1 0 0 requests to Twitter (see Table 1). The authors in [14] Australia 0 0 0 0 Brazil 35 0 5 101 studied patterns of Twitter censorship by collecting 20 Colombia 0 1 0 0 Greece 0 3 0 0 million Turkish tweets and applying topic extraction Japan 6 21 0 43 Mexico 0 2 0 0 and clustering on those that were censored. They found Turkey 393 270 79 2, 003 Venezuela 0 0 0 0 that the vast bulk of censored tweets contained political information critical of the Turkish government. and removal requests from Year 2012 on a half-year These methods are not easily adapted for the study basis. Table 1 summarizes the information and removal of self-censorship, since they require that the story or requests for Year 2014 on our selected countries. For post be published (or submitted) and removed, allowing Mexico and Venezuela, Twitter did not participate in for direct observation of explicit censorship. To detect any social media censorship. Based on this observation, self-censorship using social media, we need to be able to Twitter can be considered as a reliable and uncensored detect major events in social media apriori, i.e. events source to detect news self censorship events in Latin the media would have reported with a high likelihood America. The main contributions of this paper are if not for self-imposed restrictions. The detection of summarized as follows: such events has largely been done in the field of event detection. [16], for example, developed a system which • Analysis of censorship patterns between news identifies tweets posted closely in time and location, and media and Twitter: We carried out an extensive determined whether they are mentions of the same event analysis of information in Twitter deemed relevant by co-occurring keywords. [12] presented the first open- to censored information in news media. In doing so, domain system for event extraction and an approach we make important observations that highlight the to classify events based on latent variable models. [13] importance of our work. formulated event detection in activity networks as a graph mining problem and proposed effective greedy • Formulation of an unsupervised censorship de- approaches to solve this problem. In addition to textual tection framework: We propose a novel hypothesis- information, [4] proposed an event detection method testing-based statistical framework for detecting clus- which utilizes visual content and intrinsic correlation ters of co-occurred keywords that demonstrate statis- in social media. tically significant differences between the information We must be careful not to overstate the utility of published in news media and the correlated informa- social media for detecting major events. Analysis of tion published in a uncensored source (e.g., Twitter). the coverage of various topics across social media and To the best of our knowledge, this is the first unsuper- news media have found many similarities, but also some vised framework for automatic detection of censorship systematic differences. [10] studied topic and timing events in news media. overlapping in newswire and Twitter and concluded that Twitter covers not only topics reported by news • Optimization algorithms: The inference of our media during the same time period, but also minor proposed framework involves the maximization of topics ignored by news media. Through analysis of a GLRT statistic function over all clusters of co- hundreds of news events, [9] observed both similarities occurred keywords, which is hard to solve in general. and differences of coverage of events between social We propose a novel approximation algorithm to solve media and news media. this problem in nearly linear time.

• Extensive experiments to validate the pro- 3 Data Analysis posed techniques: We conduct comprehensive ex- Table 3 summarizes the notation used in this work. The periments on real-world Twitter and News datasets. EMBERS project [11] provided a collection of Latin The results demonstrate that our proposed approach American news articles and Twitter posts. The news outperforms existing techniques in the accuracy of dataset was sourced from around 6000 news agencies censorship detection. In addition, we perform case during 2014 across the world. From “4 International studies on the detected censorship patterns and ana- Media & Newspapers”, we retrieved a list of top news- lyze the reasons behind censorship. papers with their domain names in the target country. News articles are filtered based on the domain names

Copyright c 2020 by SIAM Unauthorized reproduction of this article is prohibited in the URL links. Twitter data was collected by ran- across Twitter in Mexico while Mexican news outlets do domly sampling 10% (by volume) tweets during Year not depict significant changes. 2014. Mexico and Venezuela were chosen as target Topic is of interest only in news media: In countries in this work since they had no censorship in late September 2014, heads of state and governments Twitter (as shown in Table 1) but severe level of cen- attended the global Climate Summit. This incident is sorship in news media according to the Freedom of the widely discussed in news media, while relatively less at- Press Report. tention in social media. Topic is censored in news media: Fig. 1 com- 3.1 Data Preprocessing The inputs to our pro- pares TSDF in El Mexicano Gran Diario Regional (el- posed approach are keyword co-occurrence graphs. mexicano.com.mx) and TSDF in Twitter during a 2- Each node represents a keyword associated with four month period on a connected set of keywords. All the attributes: (1) time-series daily frequency (TSDF) in example keywords are relevant to the 43 missing stu- Twitter, (2) TSDF in News, (3) expected daily fre- dents from Ayotzinapa in the city of Iguala protesting quency in Twitter, and (4) expected daily frequency in the government’s education reforms. The strong con- News. Each edge represents the co-occurrence of con- nectivity of these keywords, as shown in Fig. 1e, guar- necting nodes in Twitter, or News, or both. However, antees that they are mentioned together frequently in constructing such graphs is not trivial due to data in- Twitter and local news media. The time region dur- tegration. One challenge is to handle the different vo- ing which anomalous behavior is detected is highlighted cabularies used in Twitter and News, with underlying with two yellow markers. Since volume of Twitter is distinct distributions. much larger than volume of News, TSDF in Fig. 1a to To find words that behave differently in News com- Fig. 1d are normalized to [0, 500] for visualization. Fig. paring to Twitter, we only retained keywords which are 1a to Fig. 1d depict that TSDF in El Mexicano Gran mentioned in both Twitter and News. For each key- Diario Regional is well correlated with TSDF in Twitter word, linear correlation between its TSDF in Twitter except during the highlighted time region, where abnor- and News during Year 2014 is required to be greater mal absenteeism in El Mexicano Gran Diario Regional than a predefined threshold (e.g. 0.15) in order to guar- can be observed for all example keywords. In order to antee the keyword is well correlated in two data sources. validate the deviation between them is not due to dif- TSDF in Twitter and News for each node are normal- ference in topics of interests, we also compare with a ized with quantile normalization. An edge is removed if number of other local news outlets. Fig. 1a to Fig. 1d its weight is less than Γ, where Γ is the threshold used shows that TSDF in El Universal in Mexico City is con- to tradeoff graph sparsity and connectivity. Empirically sistent with TSDF in Twitter and does not depict an we found Γ = 10 to be an effective threshold. A key- abnormal absenteeism during the highlighted time pe- word co-occurrence graph for a continuous time window riod. Using Twitter and El Universal in Mexico City as is defined as the maximal connected component from a sensors, we can conclude an indicator of self-censorship union of daily keyword co-occurrence graph during the in El Mexicano Gran Diario Regional with respect to the time window. 43 missing students during the highlighted time region. Inspired by these observations, we say that a cen- 3.2 Pattern Analysis It’s challenging to claim that sorship pattern exists if for a cluster of connected any deviation between social media and news media is keywords, evidence of censorship or different topics of interest. Ta- ble 2 summarizes various co-occurring patterns between 1. their TSDF in at least one local news media is Twitter and news media that we are able to observe consistently different from TSDF in Twitter during from our real world dataset and more details are dis- a time period, cussed as follows. Topic is of interest both in social media and news media: In early March 2014, Malaysia Airlines 2. their TSDF in local news media are consistently Flight MH370 disappeared while flying. We are able to well correlated to TSDF in Twitter before the time observe sparks in mentions of relevant keywords across period, and both social media and news media. Topic is of interest only in social media: In 3. their TSDF in at least one different local news late June 2014, there is one soccer game between Mexico outlet does not depict abnormal absenteeism during and Holland during the 2014 FIFA World Cup. We are the time period. able to observe spikes in mentions of relevant keywords

Copyright c 2020 by SIAM Unauthorized reproduction of this article is prohibited Table 2: Different patterns of co-occurrence observed between social media and news media sources.

Topic is of interest in both Topic is of interest in social Topic is of interest in news Censorship in one news me- social media and news me- media but not in news me- media but not in social me- dia source. dia. dia. dia.

Example: Late June 2014 Example: In late September 2014, 125 Example: In late September 2014, Example: In early March featured a soccer game be- heads of state and governments at- 43 students from Ayotzinapa Rural 2014, Malaysia Airlines tween Mexico and Holland as tended the Global Climate Summit, Teachers’ College went missing in Flight MH 370 went missing. part of the 2014 FIFA World which was seen as a milestone to a new Mexico. Cup. legal agreement on climate change.

Table 3: Description of major notation.

Variable Meaning t T time series of daily frequency of node v in uncensored {a (v)}t=1 Twitter dataset expected daily frequency of node v in the Twitter λ (v) a dataset. t T time series of daily frequency of node v in the {b (v)}t=1 censored news dataset λb(v) expected daily frequency of node v in data source b TSDF time series of daily frequency

(a) ayotzinapa (b) desaparecidos (missing)

4 Methodology 4.1 Problem Formulation Suppose we have a dataset of news reports and a dataset of tweets within a shared time period in a country of interest. Each news report or tweet is represented by a set of keywords and is indexed by a time stamp (e.g., day). We model the joint information of news reports and tweets using an (c) normalistas (students undirected keyword co-occurrence graph G = (V, E), trained to become teachers) (d) iguala where V = {1, 2, ··· , n} refers to the ground set of nodes/keywords, n refers to the total number of nodes, and E ⊆ V × V is a set of edges, in which an edge (i, j) indicates that the keywords i and j co-occur in at least one news report or tweet. Each node v ∈ V is associ- t T t T ated with four attributes: {a (v)}t=1, λa(v), {b (v)}t=1, and λb(v) as defined in Table 3. As our study is based on the analysis of correlations between frequencies of keywords in the news and Twitter datasets, we only (e) Left: The strong connectivity of these keywords indicates consider the keywords whose frequencies in these two their frequent co-occurrence in Twitter and News. A larger size of node indicates higher keyword frequency and a larger datasets are well correlated (with correlations above a width of edge indicates more frequently co-occurrence; Right predefined threshold 0.15). Our goal is to detect a clus- : word cloud representing censored keywords in News around 2014-09-26 in Mexico ter (subset) of co-occurred keywords and a time window as an indicator of censorship pattern, such that the dis- Figure 1: Example TSDF in News vs. TSDF in Twitter for tribution of frequencies of these keywords in the news a set of connected keywords. These keywords are relevant dataset is significantly different from that in the Twitter to the 43 missing students from Ayotzinapa Rural Teachers’ dataset. College on Sep 26th, 2014 in Mexico. We can find consistent Suppose the chosen time granularity is day and the censorship pattern in El Mexicano Gran Diario Regional (el- shared time period is {1, ··· ,T }. We consider two hy- mexicano.com.mx) shortly after the students are missing. potheses: under the null (H0), the daily frequencies of each keyword v in the news and Twitter datasets follow two different Poisson distributions with the mean pa-

Copyright c 2020 by SIAM Unauthorized reproduction of this article is prohibited rameters λa(v) and λb(v), respectively; under the alter- Problem 1. [ GLRT Optimization Problem] native (H1(S, R)), there is a connected cluster S of key- Given a keyword co-occurrence graph G(V, E) and a words and a continuous time window R ⊆ {1, ··· ,T }, predefined significance level α, the GLRT optimization in which the daily frequencies of each keyword v in problem is to find the set of highest scoring and sig- the Twitter dataset follow a Poisson with an elevated nificant clusters O. Each cluster in O is denoted as a mean parameter q ·λ (v), but those in the news dataset specific pair of connected subset of keywords (Si ⊆ V) a a and continuous time window (R ⊆ {1, ··· ,T }), in follows a Poisson with a down-scaled mean parameter i which Si is the highest scoring subset within the time qb · λb(v). Formally, they can be defined as follows: window Ri: • Null hypothesis H0: (4.3) maxS⊆V F (S, Ri) s.t. S is connected, t a (v) ∼ Pos(λa(v)), ∀v ∈ V, t ∈ {1, ··· ,T } t and is significant w.r.t the significance level α. b (v) ∼ Pos(λb(v)), ∀v ∈ V, t ∈ {1, ··· ,T }

• Alternative hypothesis H1(S, R): 4.2 GraphDPD Algorithm Our proposed algo- rithm GraphDPD decomposes Problem 1 into a set at(v) ∼ Pos(q · λ (v)) , bt(v) ∼ Pos(q · λ (v)), ∀v ∈ S, t ∈ R a a b b of sub-problems, each of which has a fixed continu- t t a (v) ∼ Pos(λa(v)), b (v) ∼ Pos(λb(v)), ∀v∈ / S or t∈ / R ous time window, as shown in Algorithm 4.1. For where qa > 1, qb < 1, S ⊆ V, the subgraph induced each specific day i (the first day of time window R by S (denoted as GS) must be connected to ensure in Line 6) and each specific day j (the last day of that these keywords are semantically related, and R ⊆ time window R of Line 6), we solve the sub-problem {1, 2, ··· ,T } is a continuous time window defined as (Line 7) with this specific R = {i, i + 1, ··· , j} us- {i, i+1, ··· , j}, 1 ≤ i ≤ j ≤ T . Given the Poisson prob- ing Relaxed-GrapMP algorithm which will be elab- ability mass function denoted as p(x; λ) = λxe−λ/x!, orated later. For each connected subset of keywords a generalized log likelihood ratio test (GLRT) statistic can then be defined to compare these two hypotheses, S returned by Relaxed-GraphMP, its p-value is esti- and has the form: mated by randomization test procedure [7](Line 8). The Q Q t pair (S, R) will be added into the result set O (Line 9) maxq >1 p(a (v); qaλa(v)) F (S, R) = log a t∈R v∈S if its empirical p-value is less than a predefined signifi- Q Q p(at(v); λ (v)) t∈R v∈S a cance level α (e.g., 0.05). The procedure getPValue in Q Q t maxqb<1 t∈R v∈S p(b (v); qbλb(v)) Line 8 refers to a randomization testing procedure based (4.1) + log Q Q t . t∈R v∈S p(b (v); λb(v)) on the input graph G to calculate the empirical p-value In order to maximize the GLRT statistic, we need to of the pair (S, R) [7]. Finally, we return the set O of signifiant clusters as indicators of censorship events in obtain the maximum likelihood estimates of qa and qb, which we set ∂F (S, R)/∂qa = 0 and ∂F (S, R)/∂qb = 0, the news data set. respectively and get the best estimateq ˆa = Ca/Ba of P t Algorithm 4.1. (GraphDPD) 1: Input: Graph qa andq ˆb = Cb/Bb of qb where Ca = v∈S,t∈R a (v), P t P Instance G and significant level α; Cb = b (v), Ba = λa(v), Bb = v∈S,t∈R v∈S,t∈R 2: Output: set of anomalous connected subgraphs O; P λ (v). Substituting q and q with the best v∈S,t∈R b a b 3: O ← ∅; estimationsq ˆ andq ˆ , we obtain the parametric form of a b 4: for i ∈ {1, ··· ,T } do the GLRT statistic as follows: 5: for j ∈ {i + 1, ··· ,T } do (4.2) 6: R ← {i, i + 1, ··· , j} ; // time window R  Ca   Cb  F (S, R) = Ca log + Ba − Ca + Cb log + Bb − Cb 7: S ← ( ,R); Ba Bb Relaxed-GraphMP G 8: if getPValue( , S, R) ≤ α then Given the GLRT statistic F (S, R), the problem of G 9: ← ∪ (S, R); censorship detection can be reformulated as Problem O O 10: end if 1 that is composed of two major components: 1) High- 11: end for est scoring clusters detection. The highest scoring 12: end for clusters are identified by maximizing the GLRT statis- 13: return ; tic F (S, R) over all possible clusters of keywords and O time windows; 2) Statistical significance analysis. Line 7 in Algorithm 4.1 aims to solve an instance of The empirical p-values of the identified clusters are es- Problem 1 given a specific time window R, which is timated via a randomization testing procedure [7], and a set optimization problem subject to a connectivity are returned as significant indicators of censorship pat- constraint. Tung-Wei et. al. [6] proposed an approach terns in the news dataset, if their p-values are below a for maximizing submodular set function subject to predefined significance level (e.g., 0.05). a connectivity constraint on graphs. However, our

Copyright c 2020 by SIAM Unauthorized reproduction of this article is prohibited objective function F (S, R) is non-submodular as shown Algorithm 4.2. (Relaxed-GraphMP) 1: Input: in Theorem 4.1 and this approach is not applicable here. Graph instance G, continous time window R; 2: Output: the co-occurrence subgraph ; Theorem 4.1. [] Given a specific window R, our GS 3: i ← 0; xi ← an initial vector; objective function F (S, R) defined in (4.2) is non- 4: repeat submodular. i ˆ i ∂Fˆ(x ,R) 5: ∇F (x ,R) ← ∂xi by Equation (4.5) ; We propose a novel algorithm named Relaxed- i 6: g ← Head(∇Fˆ(x ,R), G); // Head projection GraphMP to approximately solve Problem 1 in nearly step linear time with respect to the total number of nodes 7: Ω ← supp(g) ∪ supp(xi); in the graph. We first transform the GLRT statistic ˆ 8: b ← arg maxx∈[0,1]n F (x,R) s.t. supp(x) ⊆ Ω; in Equation(4.2) to a vector form. Let x be an n- 9: xi+1 ← Tail(b, ); // Tail projection step T G dimensional vector (x1, x2, ··· , xn) , where xi ∈ {0, 1} 10: i ← i + 1,S ← supp(xi); and xi = 1 if i ∈ S, xi = 0 otherwise. We define 11: until halting condition holds; P, Q, Λa, Λb as follows: 12: return (S, R);

" #T Our proposed algorithm Relaxed-GraphMP de- X t X t T P = a (1), ··· , a (n) , Λa = [λa(1), ··· , λa(n)] , composes Problem 2 into two sub-problems that are t∈R t∈R easier to solve: 1) a single utility maximization prob- " #T lem that is independent of the connectivity constraint; X t X t T Q = b (1), ··· , b (n) , Λb = [λb(1), ··· , λb(n)] . and 2) head projection and tail projection problems [5] t∈R t∈R subject to connectivity constraints. We call our method Relaxed-GraphMP which is analogous to GraphMP Therefore, Ca, Cb, Ba, and Bb in Equation(4.2) can be reformulated as follows: proposed by Chen et al. [3]. The high level of Relaxed- T T T T GraphMP is shown in Algorithm 4.2. It contains 4 Ca = P x,Cb = Q x,Ba = |R|Λa x,Bb = |R|Λb x main steps as described below. Hence, F can be reformulated as a relaxed function Fˆ: • Step 1: Compute the gradient of relaxed GLRT T problem (Line 5). The calculated gradient is T P x T T Fˆ(x,R) = P x log + |R|Λ x − P x i T a ∇Fˆ(x ,R). Intuitively, it maximizes this gradient |R|Λa x T with connectivity constraint that will be solved in T Q x T T (4.4) + Q x log T + |R|Λb x − Q x next step. |R|Λb x • Step 2: Compute the head projection (Line 6). This We relax the discrete domain {0, 1}n of S to the step is to find a vector g so that the corresponding continuous domain [0, 1]n of x, and obtain the relaxed subset supp(g) can maximize the norm of the projec- version of Problem 1 as described in Problem 2. tion of gradient ∇Fˆ(xi,R) ( See details in [5]). Problem 2. [Relaxed GLRT Optimization Prob- • Step 3: Solve the maximization problem without lem] Let Fˆ be a continuous surrogate function of F that connectivity constraint. This step (Line 7,8) solves is defined on the relaxed domain [0, 1]n and is identi- the maximization problem subject to the supp(x) ⊆ cal to F (S, R) on the discrete domain {0, 1}n. The re- Ω, where Ω is the union of the support of the previous laxed form of GLRT Optimization Problem is de- solution supp(xi) with the result of head projection fined the same as the GLRT optimization problem, ex- supp(g) (Line 7). A gradient ascent based method is cept that, for each pair (Si,Ri) in O, the subset of key- proposed to solve this problem. Details is not shown words Si is identified by solving the following problem here due to space limit. with Si = supp(xˆ): • Step 4: Compute the tail projection (Line 9). This ˆ xˆ = arg max F (x,Ri) s.t. supp(x) is connected. final step is to find a subgraph S so that bS is x∈[0,1]n G close to b but with connectivity constraint. This where supp(x) = {i|xi 6= 0} is the support of x. The tail projection guarantees to find a subgraph GS with gradient of Fˆ(x,R) has the form: constant approximation guarantee (See details in [5]). ∂Fˆ(x,R) PTx  PTx  • Halting: The algorithm terminates when the condi- = log T P + |R| − T Λa ∂x |R|Λa x Λa x tion holds. Our algorithm returns a connected sub- QTx  QTx  graph GS where the connectivity of GS is guaranteed (4.5) + log Q + |R| − Λ T T b by Step 4. |R|Λb x Λb x

Copyright c 2020 by SIAM Unauthorized reproduction of this article is prohibited Time Complexity Analysis: The GraphDPD al- but not necessarily connected subsets of data records by gorithm is efficient as its time complexity is propor- maximizing a score function. We also compare our pro- tional to the total number of continous time windows posed method with two state-of-art baseline methods T 2. Therefore, the time complexity of GraphDPD designed specifically for connected anomalous subgraph is mainly dependent on the run time of Relaxed- detection, namely, EventTree [13] and NPHGS [2]. GraphMP. We give the detailed time complexity anal- Performance Metrics: The performance metrics ysis in Theorem 4.2. include: (1) precision (Pre), (2) recall (Rec), and (3) f- measure (F-score). Given the returned subset of nodes ∗ Theorem 4.2. []GraphDPD runs in O(T 2 · t(nT + S and the corresponding true subset of anomalies S , 3 we can calculate these metrics as follows: nl + |E|log n)) time, where T is the maximal time |S ∩ S∗| |S ∩ S∗| 2|S ∩ S∗| window size, nT is the time complexity of Line 5 in Pre = , Rec = , F-score = Algorithm 4.1, nl is the run time of Line 8 using |S| |S∗| |S∗| + |S| 3 gradient ascent, |E|log n is the total run time of head Collecting labels for real data: We collect labels projection and tail projection algorithms, and t is the for real-world instances of censorship from all abnormal total number of iterations needed in Algorithm 4.2. absence patterns identified in News by all baseline methods. For each abnormal absence pattern in News, 5 Experiments we need to first identify if there are any relevant events Through experiments conducted on semi-synthetic data of interest taking place around the associated time and real data, we evaluate the performance of our pro- region. Although there is no publicly available gathered posed approach in censorship pattern detection com- information of all existing censorship, we can validate pared with alternative methods. The results of these the correctness of the detected self-censorship through experiments show the superiority of GraphDPD al- evidences in news reports. For example, news articles gorithm for detecting likely patterns of media self- 3 4 verified self-censorship in Ultimas Noticias, the censorship over other state-of-the-art options. largest daily in Venezuela, about the massive protests in February 2014. An indicator of censorship pattern 5.1 Experimental Design Real world datasets: is also considered as valid if we can find the event of Table 4 gives a detailed description of real-world interest is: 1) not reported in some local news outlets datasets we used in this work. Details of Twitter and while reported in some different local news outlets, news data access have been provided in Section 3. Daily 2) reported in influential international news outlets, keyword co-occurrence graphs, which integrate News and 3) reported of censorship activity in local news with Twitter, are generated as described in Section 3.1. media from other news outlets during the associated time window. The evaluation process is analyzed with the inner-annotator agreement by multiple independent Table 4: Real-world dataset used in this work. Tweets: annotators. average number of daily tweets. News Articles: average number of daily local news articles. News Outlets: number 5.2 Semi-synthetic Data Evaluation We evaluate of news outlets used in this study. the accuracy of our approach to detect the disrupted Daily Daily News # of News ground truth anomalies. We find that overall our ap- Country Tweets Articles Outlets proach consistently outperforms all other baseline meth- Mexico 84556 444 13 Venezuela 65916 91 6 ods. Here are the major findings: (1) Our approach outperforms baseline methods especially at low pertur- bation intensities where the detection is harder to carry Data Preprocessing: The preprocessing of the out, and the performance increases gradually with the real world datasets has been discussed in detail in Sec- increase of perturbation intensity. (2) The recall of tion 3.1. In particular, we considered keywords whose NPHGS becomes quite low when true subgraph is rel- day-by-day frequencies in news media and Twitter data atively large. (3) The recall of EventTree is among the have linear correlations above 0.15, in order to filter best when perturbation intensity is low. (4) LTSS did noisy keywords. well in average recall but poorly in average precision as Semi-synthetic datasets: We create semi- the size of anomalous graph increases. synthetic datasets by using the coordinates from real- world datasets and injecting anomalies [15]. 3http://www.nybooks.com/daily/2014/04/09/venezuela- Our proposed Graph-DPD and baseline protests-censorship/ methods: We compare our proposed method with 4https://www.pri.org/stories/2014-03-27/venezuela-protests- one baseline method LTSS [8], which finds anomalous shed-light-extent-media-censorship

Copyright c 2020 by SIAM Unauthorized reproduction of this article is prohibited Table 5: Example indicators of censorship identified by our approach in Mexico and Venezuela during Year 2014 (with significance level ≤ 0.05)

Mexico Example local news media Date Example censored keywords Reasons for censorship in news media detected with censorship Tens of thousands of people marched in Mexico City on Labor reforma(reform), gasolina(petrol), edu- Day to protest the new laws, which target at Mexico’s education 2014-05-01 Noroeste caci´on(education) system and opening up the state controlled oil industry to foreign investors. ayotzinapa, iguala, normalistas, desa- 43 students from the Ayotzinapa Rural Teachers’ College went 2014-09-27 parecidos(missing), detenidos(detained), El Mexicano Gran Diario Regional missing and kidnapped in Iguala on September 26, 2014. protesta(protest) Protests in Mexico City demanding the return of the missing ayotzinapa, estudiantes(students), normalistas, students, who came from Ayotzinapa Rural Teachers’ College and 2014-11-10 desaparecidos(missing), protesta(protest), mil- El Mexicano Gran Diario Regional went missing in Iguala on September 26, 2014, turned violent for itares(military), iguala the first time. Venezuela represi´on(repression), dispar´o(shooting), Mass protests led by opposition leaders, including Leopoldo L´opez, marchamos(march), heridos(wounded), nico- 2014-02-18 Ultimas Not´ıcias occurred in 38 cities across Venezuela asking for the release of the lasmaduro, armados(armed), leopoldolopez, arrested students. apresar(arrest), ntn24 muertes(deaths), cambio(change), caracas, Thousands of Venezuelans demonstrated in Caracas to commemo- 2014-05-01 El Tiempo in Anzoategui presidente(president), labor rate Labor Day and denounce shortages. Venezuela is the only country in Latin America with increasing gubernamental(government), anticontrabando, 2014-08-12 El Nacional number of malaria. With the spreading of Ebola virus, Venezuela contrabando, ´ebola, muerte(death) is one of the most vulnerable countries in Latin America .

5.3 Real Data Evaluation As before, we ap- identified by our approach in Mexico and Venezuela ply three anomaly detection baseline methods, LTSS, with significance level ≤ 0.05. The rest of this section NPHGS, and EventTree, to detect anomalies in News will evaluate a couple of these instances in greater detail. on graphs with all possible time windows from 3 days Mexico May 2014. In December 2013, Mexi- to 7 days during Year 2014. The baseline methods can can president and Congress amended the Constitution, find anomalous subgraphs according to their own score opening up the state controlled oil industry to foreign functions; however, they are not able to evaluate the investors. Tens of thousands of protesters demonstrated significance level of each subgraph. For the purpose of in Mexico City on Labor Day (May 1) to protest against comparison, we remove duplicate subgraphs with over- the energy reform. In additions, protesters were also un- lapping time regions in the same manner as our method. satisfied with the 2013 reforms of the educational sector. The remaining subgraphs are ranked from the best to However, this incident was not reported in a number of the worst according to their function values and the influential newspapers in Mexico, which is an indicator same number of subgraphs are selected from the top to of censorship. Fig. 2a shows a cluster of censored key- compare with subgraphs detected by our method. words detected by our method around May 1, 2014 in Mexico. Our approach has successfully captured con- Table 6: Comparison of false positive rates in censorship sistent censorship patterns among a collection of rele- detection between GraphDPD and three baseline methods: vant keywords, which well describe the topics around LTSS, NPHGS, and EventTree on real data of Mexico and which the May 1 demonstrations were organized (re- Venezuela during year 2014. forma, gasolina, dinero, educaci´on,escuela). Country LTSS NPHGS EventTree GraphDPD Venezuela February 2014. As a result of the collapse of the price of oil, Venezuela suffered from in- Mexico 0.722 0.667 0.556 0.278 Venezuela 0.714 0.786 0.643 0.357 flation, shortages of basic foodstuffs and other necessi- ties. Mass opposition protests led by opposition leaders demanding the release of the students occurred in 38 Table 6 summarizes the comparison of false posi- cities across Venezuela on February 12, 2014. While tive rates in censorship detection and our method out- this incident was reported by a number of major inter- performs LTSS, NPHGS, and EventTree. The baseline national newspapers, there was significant censorship in methods, which are designed for event detection instead the country’s largest daily Ultimas Not´ıcias,an event re- of censorship detection, will capture all falling patterns ported by a number of international news outlets. Fig. in News. In particular, the baseline methods are not 2b shows a cluster of censored keywords detected by able to successfully differentiate censored events from our method around February 18, 2014 in Venezuela, non-censored events, e.g., the normal end of attention which well describes the populations involved (estudi- paid to breaking events. ante, chavistas, opositores, leopoldolopez) and the tar- get of the demonstrations (nicolasmaduro). 5.4 Case Studies This section provides some more detailed case studies to illustrate both the method and the types of cases that were self-censored. Table 5 summarizes a list of example instances of censorship

Copyright c 2020 by SIAM Unauthorized reproduction of this article is prohibited [3] F. Chen and B. Zhou. A generalized matching pursuit approach for graph-structured sparsity. In Proc. IJCAI, pages 1389–1395, 2016. [4] Y. Gao, S. Zhao, Y. Yang, and T.-S. Chua. Multimedia social event detection in microblog. In MMM, pages 269– 281. Springer, 2015. [5] C. Hegde, P. Indyk, and L. Schmidt. A nearly-linear (a) Mexico 2014-05-01 (b) Venezuela 2014-02-18 time framework for graph-structured sparsity. In Proc. ICML, pages 928–937, 2015. Figure 2: Word cloud representing censored keywords in [6] T.-W. Kuo, K. C.-J. Lin, and M.-J. Tsai. Maximiz- News identified by our method ing submodular set function with connectivity constraint: Theory and application to networks. IEEE/ACM Trans- 6 Conclusion actions on Networking (TON), 23(2):533–546, 2015. [7] D. B. Neill. An empirical comparison of spatial scan In this paper, we have presented a novel unsupervised statistics for outbreak detection. International Journal of approach to identify censorship patterns in domestic Health Geographics, 8(1):20, 2009. news media using social media as a sensor. This ap- [8] D. B. Neill. Fast subset scan for spatial pattern detec- proach has demonstrated great promise in detecting self- tion. Journal of the Royal Statistical Society: Series B censorship as compared to current event-detection tech- (Statistical Methodology), 74(2):337–360, 2012. nologies. It also has demonstrated utility for providing [9] A. Olteanu, C. Castillo, N. Diakopoulos, and K. Aberer. an assessment of freedom of the press in countries with Comparing events coverage in online news and social active social media populations, and for understanding media: The case of climate change. ICWSM, 15:288–297, the patterns of self-censorship within countries. 2015. [10] S. Petrovic, M. Osborne, R. McCreadie, C. Macdonald, This method may also have applications in other I. Ounis, and L. Shrimpton. Can twitter replace newswire areas. Indeed, the GraphDPD methodology should for breaking news? In ICWSM, 2013. be useful for many applications where text is compared [11] N. Ramakrishnan, P. Butler, S. Muthiah, N. Self, between sources over time. In future work, we intend R. Khandpur, P. Saraf, W. Wang, J. Cadena, A. Vul- to explore these themes further, including developing likanti, G. Korkmaz, et al. ’beating the news’ with em- strategies for forecasting censorship. bers: forecasting civil unrest using open source indicators. In Proc. KDD, pages 1799–1808. ACM, 2014. Acknowledgements [12] A. Ritter, O. Etzioni, S. Clark, et al. Open domain event extraction from twitter. In Proc. KDD, pages 1104– Supported by the Intelligence Advanced Research 1112. ACM, 2012. Projects Activity (IARPA) via DoI/NBC contract num- [13] P. Rozenshtein, A. Anagnostopoulos, A. Gionis, and ber D12PC000337, the US Government is authorized N. Tatti. Event detection in activity networks. In Proc. to reproduce and distribute reprints of this work for KDD, pages 1176–1185. ACM, 2014. Governmental purposes notwithstanding any copyright [14] R. S. Tanash, Z. Chen, T. Thakur, D. S. Wallach, annotation thereon. Aravind Srinivasan and Khoa- and D. Subramanian. Known unknowns: An analysis Trinh were also supported in part by NSF Award CNS of twitter . In ACM Privacy in the 1010789. Disclaimer: The views and conclusions con- Electronic Society, pages 11–20. ACM, 2015. tained herein are those of the authors and should not [15] H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random be interpreted as necessarily representing the official walk with restart and its applications. In Proc. ICDM, policies or endorsements, either expressed or implied, pages 613–622. IEEE Computer Society, 2006. [16] K. Watanabe, M. Ochi, M. Okabe, and R. Onai. of IARPA, DoI/NBC, or the US Government. Jasmine: a real-time local-event detection system based on geolocation information propagated to microblogs. In References Proc. CIKM, pages 2541–2544. ACM, 2011.

[1] A. A. Casilli and P. Tubaro. Social media censorship in times of political unrest-a social simulation experi- ment with the uk riots. Bulletin of Sociological Method- ology/Bulletin de Methodologie Sociologique, 115(1):5–20, 2012. [2] F. Chen and D. B. Neill. Non-parametric scan statistics for event detection and forecasting in heterogeneous social media graphs. In Proc. KDD, pages 1166–1175. ACM, 2014.

Copyright c 2020 by SIAM Unauthorized reproduction of this article is prohibited