ConceptDoppler: A Weather Tracker for Internet Censorship Abstract Our probes were designed to find out where the filtering Technical research about Internet censorship has centered routers are and how reliably they do the filtering. Two insights around evasion and counter-evasion. In this paper we offer came from the results of these probes: a new research direction: surveillance of the mechanisms • of censorship and counter-surveillance, where those against Contrary to common belief, the filtering mechanism is not a censorship track its implementation and application over time firewall that peremptorily blocks all offending packets at the for scrutiny by others. international gateway of the Internet between China and We present two sets of results in this paper, Internet other countries. Our results suggest that only roughly one measurement results on the keyword filtering of the Great fourth of the filtering occurs at the international gateway, ``Firewall'' of China (GFC), and initial results of using latent with a much larger constituent of the blocking occurring semantic analysis (LSA) as an efficient way to probe for several hops into China, and some filtering occurring as unknown keywords and reporoduce a blacklist of censored many as 14 hops past the border. In fact, depending on the words. Our Internet measurements suggest that the GFC's path packets take from our source point into China, XX% of keyword filtering is more of a panopticon than a firewall and the IP addresses we probed were not blocked at all because show that probing is arduous due to the GFC's complexity. This they did not pass a filtering router. Combined with the motivates ConceptDoppler, an architecture for maintaining a fact that a single ISP did a disproportionate amount of the censorship ``weather report'' about what words are blocked blocking, our results show that the GFC's implementation in different places around the world, using LSA for efficient is much more centralized than previously thought. Even probing. This can lead to a much more complete understanding on routes where there are filtering routers, the filtering is of censorship. For example, we discovered 666 words on the inconsistent and tends to let many packets through during GFC blacklist, including surprises such as Y‡ (ethanol) and busy network periods. yÒ (Hitler). • Probing can be very arduous because of the complexity of the GFC. Not only is the filtering heterogeneous in its Everybody talks about the weather but nobody does implementation and inconsistent both during busy periods anything about it. and depending on the path, but there is also a lot of noise in such forms as RSTs generated by misconfigured routers Charles Dudley Warner (1829--1900) and hosts, inconsistent paths because of traffic shaping, IP tunneling, Internet Exchange Points (IXPs) [ref], and routers that do not conform to the RFC for handling TTLs 1. Introduction [ref]. Inspired by recent work [?] on the Great Firewall of China The first of these insights motivates the need for (GFC)'s keyword filtering mechanism, we sought a better surveillance/counter-surveillance as a focus of censorship re- understanding of its implementation and found it to be not search rather than evasion/counter-evasion. Evasion is nugatory a firewall at all, but rather more of a panopticon where for a censorship mechanism that was not designed to keep up the presence of censorship, even if not pervasive, promotes with high loads of traffic during peak periods, and that does not self-censorship. Clayton et al. [?] provide more details about filter on a significant number of paths. This is why we propose how the GFC's keyword filtering operates. Basically, filtering- ConceptDoppler as a first step towards an Internet censorship capable routers watch for keywords in GET requests or HTML weather report. responses (and possibly in other protocols) that are on a blacklist The second of these insights motivates the need for efficient of keywords considered to be sensitive. If a packet containing a probing. We can send candidate keywords through a blocking keyword passes through one of these routers, the router sends router and receive an answer in the form of a RST as to a reset packet (RST) to both the source and destination IP whether that word is on the blacklist in that location or not address of the packet in an attempt to reset the connection. with some probability, but in order to track a blacklist where words can be added and removed at different places over time requires efficiency. It is not possible to take the encyclopedic dictionary of words for a particular language and probe each word in thousands of places every day. Even if it were, this amount of traffic is very invasive. This is why we propose latent semantic analysis (LSA) as a way to efficiently probe for unknown keywords on the blacklist by testing only words related to concepts that have been deemed as potentially sensitive. Regardless of whether we are considering the censorship [copyright notice will appear here] of Nazi-related material in Germany [?], the blocking of child 1 2007/5/6 pornography in England [?], the filtering of sexual topics in 1.2 Proposed Framework libraries in the United States [?], or the more global restrictions In this paper, we propose a framework for understanding of countries such as Iran [?] or China [?], it is imperative, when keyword-based censorship. We seek not only to discover which developing policy about Internet censorship, that we understand words are part of the blacklist used for a keyword-based both the technical mechanisms of censorship and the way in censorship mechanism, but also to monitor the blacklist over which censorship is used. This gives policy makers an exact time as words are added or deleted when the implementation record of how a censorship mechanism was used and how it was of the censorship mechanism itself is heterogeneous and implemented over a period of time. For example, policy makers varies in different parts of the Internet infrastructure. With cannot ask important questions such as why ``øÕb'Õ˜'' such a framework the research community could maintain a (Judicial yuan grand justices) was blocked at a particular place ``censorship weather report.'' While this could be used to and time without first knowing that it was blocked. evade censorship (Zittrain and Edelman [?] propose putting As a first step toward an Internet censorship weather report, HTML comments within blocked words, and we discuss other we explore the keyword filtering mechanism of the GFC. possibilities in Section 5 where we discuss future work), more Keyword filtering is an important tool for censorship, and a importantly we can use real-time monitoring of an entire complete picture of the blacklist of words that are blocked, over country's Internet infrastructure to understand the ways in time and for different geographic locations within a specific which keyword blocking correlates with current events. This can country, can prove invaluable to those who wish to understand aid those on both sides of a particular censorship debate, either that government's use of keyword-based Internet censorship. by adding weight to efforts to reduce censorship by pressuring the censor, or by giving policy makers a complete picture of the 1.1 Keyword-based Censorship application and implementation of different mechanisms. We design and evaluate our framework for the Great The ability to block keywords is an effective tool for govern- Firewall of China, the most advanced keyword-based Internet ments that censor the Internet. Censorship can be comprised censorship mechanism. As pointed out by the Open Net of numerous techniques, including IP address blocking, DNS Initiative, ``China's sophisticated filtering system makes testing redirection, and a myriad of legal restrictions, but the ability to its blocking difficult'' [?]. Essentially, we perform active probing block keywords in URL requests or HTML responses allows of the GFC from outside of China, focusing exclusively on for a high granularity of control that achieves the censor's goals keyword-based filtering of HTTP traffic. with low cost. In addition to covering a broad cross section of the country, As pointed out by Danezis and Anderson [?], censorship probing should also be continuous, so that if a current event is an economic activity. The Internet has economic benefits means that a word is temporarily blocked, as has been observed and more blunt methods of censorship than keyword filtering1, for URL blocking [?], we will know when the keyword was such as blocking entire web sites or services, decrease those added to the blacklist and in what regions of the country it was benefits. There is also a political cost of more blunt censorship blocked. While a snapshot of the blacklist from one router at mechanisms due to the dissatisfaction of those censored. For one time is a gold nugget of information, our goal is to refine example, while the Chinese government has shut down e-mail a large quantity of ore and maintain a complete picture of the service for entire ISPs, temporarily blocked Internet traffic from blacklist. overseas universities [?], and could conceivably stop any flow of This requires great efficiency in probing for new keywords, information [?], they have also been responsive to complaints thus we propose the use of conceptual web search techniques, about censorship from Chinese citizens, recently allowing access notably latent semantic analysis [?], to continually monitor the to the Chinese language version of Wikipedia [?, ?], before blacklist of a keyword-based censorship mechanism, such as restricting access again [?]. Keyword-based censorship gives the GFC. We propose to apply latent semantic analysis to pare censoring governments the ability to control Internet content in down a corpus of text (the Chinese version of Wikipedia [ref] a less draconian way than other technologies, making censorship in this paper) into a small list of words that, based on the much more effective in achieving their goals.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages8 Page
-
File Size-