A Weather Tracker for Internet Censorship
Total Page:16
File Type:pdf, Size:1020Kb
This paper will appear at the 14th ACM Conference on Computer and Communications Security, Oct. 29-Nov. 2, 2007 ConceptDoppler: A Weather Tracker for Internet Censorship Jedidiah R. Crandall Daniel Zinn Michael Byrd Univ. of New Mexico Univ. of California at Davis Univ. of California at Davis [email protected] [email protected] [email protected] Earl Barr Rich East Univ. of California at Davis Independent Researcher [email protected] [email protected] ABSTRACT The text of this paper has passed across many Internet routers on its way to the reader, but some routers will not pass it along unfet- Everybody talks about the weather but nobody does tered because of censored words it contains. We present two sets anything about it. of results: 1) Internet measurements of keyword filtering by the Great “Firewall” of China (GFC); and 2) initial results of using la- Charles Dudley Warner (1829–1900) tent semantic analysis as an efficient way to reproduce a blacklist of censored words via probing. Our Internet measurements suggest that the GFC’s keyword fil- Categories and Subject Descriptors tering is more a panopticon than a firewall, i.e., it need not block K.4.m [Computers and Society]: Miscellaneous every illicit word, but only enough to promote self-censorship. China’s largest ISP, ChinaNET, performed 83.3% of all filtering of our probes, and 99.1% of all filtering that occurred at the first General Terms hop past the Chinese border. Filtering occurred beyond the third Experimentation, human factors, legal aspects, measurement, secu- hop for 11.8% of our probes, and there were sometimes as many as rity 13 hops past the border to a filtering router. Approximately 28.3% of the Chinese hosts we sent probes to were reachable along paths that were not filtered at all. While more tests are needed to provide Keywords a definitive picture of the GFC’s implementation, our results dis- LSA, latent semantic analysis, latent semantic indexing, firewall prove the notion that GFC keyword filtering is a firewall strictly at ruleset discovery, Internet censorship, Great Firewall of China, In- the border of China’s Internet. ternet measurement, panopticon, ConceptDoppler, keyword filter- While evading a firewall a single time defeats its purpose, it ing, blacklist would be necessary to evade a panopticon almost every time. Thus, in lieu of evasion, we propose ConceptDoppler, an architecture for maintaining a censorship “weather report” about what keywords 1. INTRODUCTION are filtered over time. Probing with potentially filtered keywords is Societies have always divided speech into objectionable and per- arduous due to the GFC’s complexity and can be invasive if not missible categories. By facilitating the flow of information, the done efficiently. Just as an understanding of the mixing of gases Internet has sharpened this debate over categorization. Inspired by preceded effective weather reporting, understanding of the rela- initial work [8] on the Great Firewall of China (GFC)’s keyword tionship between keywords and concepts is essential for tracking filtering mechanism, we sought a better understanding of its im- Internet censorship. We show that LSA can effectively pare down plementation and found it to be not a firewall at all, but rather a a corpus of text and cluster filtered keywords for efficient probing, panopticon where the presence of censorship, even if easy to evade, present 122 keywords we discovered by probing, and underscore promotes self-censorship. the need for tracking and studying censorship blacklists by discov- Clayton et al. [8] provide more details about how the GFC’s ering some surprising blacklisted keywords such as l (con- keyword filtering operates. GFC routers scan for keywords in GET version rate), K— (Mein Kampf), and ýE0(ÑfT requests or HTML responses (and possibly in other protocols) that (International geological scientific federation (Beijing)). are on a blacklist of keywords considered sensitive. If a packet con- taining a keyword passes through one of these routers, the router sends one or more reset (RST) packets to both the source and desti- nation IP addresses of the packet in an attempt to reset the connec- Permission to make digital or hard copies of all or part of this work for tion. personal or classroom use is granted without fee provided that copies are While we do not wish to take sides in any particular Internet not made or distributed for profit or commercial advantage and that copies censorship debate in a technical paper, much of this paper is written bear this notice and the full citation on the first page. To copy otherwise, to to develop a technology to perform surveillance on a censorship republish, to post on servers or to redistribute to lists, requires prior specific mechanism. However, the technical material in this paper can aid permission and/or a fee. CCS’07, October 29–November 2, 2007, Alexandria, Virginia, USA. those on either side of a censorship debate. We probed the GFC to Copyright 2007 ACM 978-1-59593-703-2/07/0011 ...$5.00. find out the locations of the filtering routers and how reliably they perform the filtering. Two insights came from the results of these censorship is used. A censorship weather report would give policy probes, one that motivates a research focus on surveillance rather makers an exact record of how a censorship mechanism was used than evasion, and a second that motivates doing the surveillance and how it was implemented over a period of time. For example, efficiently: policy makers cannot ask important questions such as why øÕ b'Õ (Judicial yuan grand justices) or sVëH (Virgin • Contrary to common belief, the filtering mechanism is not prostitution law case) were filtered at a particular place and time a firewall that peremptorily filters all offending packets at without first knowing what was filtered. the international gateway of the Internet between China and As a first step toward an Internet censorship weather report, we other countries. explore the keyword filtering mechanism of the GFC. Keyword fil- tering is an important tool for censorship, and a complete picture • Probing is very arduous because of the complexity of the of the blacklist of keywords that are filtered, over time and for dif- GFC. ferent geographic locations within a specific country, can prove in- The first of these insights motivates the need for surveillance. valuable to those who wish to understand that government’s use of Our results suggest that only roughly one fourth of the filtering oc- keyword-based Internet censorship. curs at the international gateway, with a much larger constituent of 1.1 Keyword-based Censorship the filtering occurring several hops into China, and some filtering occurring as many as 13 hops past the border. In fact, depending The ability to filter keywords is an effective tool for governments on the path packets take from our source point into China, 28.3% that censor the Internet. Numerous techniques comprise censor- of the IP addresses we probed were reachable without traversing ship, including IP address blocking, DNS redirection, and a myriad a GFC router, therefore no filtering at all occurred for these des- of legal restrictions, but the ability to filter keywords in URL re- tinations. Combined with the fact that a single ISP did a dispro- quests or HTML responses allows a high granularity of control that portionate amount of the filtering, our results show that the GFC’s achieves the censor’s goals with low cost. implementation is much more centralized at the AS-level1 than had As pointed out by Danezis and Anderson [10], censorship is an economic activity. The Internet has economic benefits and more been previously thought. Even on routes where there are filtering 2 routers, the filtering is inconsistent and tends to allow many pack- blunt methods of censorship than keyword filtering , such as block- ets through during busy network periods. While evading a firewall ing entire web sites or services, decrease those benefits. There is a single time defeats the firewall, it would be necessary to evade a also a political cost of more blunt censorship mechanisms due to panopticon almost every time to defeat its purpose. This is why we the dissatisfaction of those censored. For example, while the Chi- propose ConceptDoppler as a first step towards an Internet cen- nese government has shut down e-mail service for entire ISPs, tem- sorship weather report. porarily blocked Internet traffic from overseas universities [5], and The second of these insights motivates the need for efficient could conceivably stop any flow of information [14], they have also probing. As pointed out by the Open Net Initiative, “China’s so- been responsive to complaints about censorship from Chinese cit- phisticated filtering system makes testing its blocking difficult” [3]. izens, recently allowing access to the Chinese language version of Not only is the filtering heterogeneous in its implementation [8] and Wikipedia [18, 19], before restricting access again [20]. Keyword- inconsistent both during busy periods and depending on the path, based censorship gives censoring governments the ability to control but there is also much noise in such forms as RSTs generated by Internet content in a less draconian way than other technologies, misconfigured routers and hosts, inconsistent paths because of traf- making censorship more effective in achieving their goals. fic shaping, IP tunnelling, Internet Exchange Points (IXPs) [25], To motivate the need to track an Internet keyword blacklist over and routers that do not conform to the RFC for handling TTLs. time, we must first refute the notion that censorship is always try- We can send candidate words through a filtering router and receive ing in vain to stop a flood of ideas.