Automated Discovery of Internet Censorship by Web Crawling
Total Page:16
File Type:pdf, Size:1020Kb
Automated Discovery of Internet Censorship by Web Crawling Alexander Darer Oliver Farnan Joss Wright Dept. Computer Science Dept. Computer Science Oxford Internet Institute University of Oxford University of Oxford University of Oxford Oxford, UK Oxford, UK Oxford, UK ABSTRACT ACKNOWLEDGMENTS Censorship of the Internet is widespread around the world. As This work was supported by EPSRC through the Centre for Doctoral access to the web becomes increasingly ubiquitous, filtering of this Training in Cyber Security, University of Oxford and The Alan resource becomes more pervasive. Transparency about specific Turing Institute under the EPSRC grant EP/N51012. content that citizens are denied access to is atypical. To counter Alexander Darer & Oliver Farnan are funded by the Center of this, numerous techniques for maintaining URL filter lists have Doctoral Training in Cyber Security. been proposed by various individuals and organisations that aim Joss Wright is partially funded by the Alan Turing Institute as a to empirical data on censorship for benefit of the public and wider Turing Fellow under Turing Award Number TU/B/000044. censorship research community. We present a new approach for discovering filtered domains in 1 INTRODUCTION different countries. This method is fully automated and requires The effort expended by censorship regimes around the world at- no human interaction. The system uses web crawling techniques tempting to filter Internet resources they deem to be too sensitive or to traverse between filtered sites and implements a robust method against the morality of their own interests is on-going. As Internet for determining if a domain is filtered. We demonstrate the effec- access has become more ubiquitous, the scale of deployed filtering tiveness of the approach by running experiments to search for systems is increasing. A recent study has shown that blocking the filtered content in four different censorship regimes. Our results reachability of popular sites at national levels is widespread and show that we perform better than the current state of the art and disruptive [24][29]. Advocates for free-speech and a free Internet have built domain filter lists an order of magnitude larger than the push for transparency and openness, while censors attempt to re- most widely available public lists as of Jan 2018. Further, we build a press the flow of certain information within their networks. Keyto dataset mapping the interlinking nature of blocked content between this is the blocking of specific webpages and the URLs that point to domains and exhibit the tightly networked nature of censored web them. resources. In response to large scale filtering of web resources, there have been numerous studies over recent years aimed at determining the CCS CONCEPTS type of content being blocked in different countries. Of particular • Social and professional topics → Censorship; • Security and interest are periods of time when blocking has occurred and the privacy → Social aspects of security and privacy; • Networks → development of techniques to monitor filtered URLs and keywords Naming and addressing; [12][13][18][23][33]. We introduce an approach for discovering filtered domains in KEYWORDS different countries at scale and reasonable cost. The system is fully automated and does not require per-country expertise or coop- censorship; DNS; filtering; transparency; monitoring eration - meaning that the safety of individuals within censored regimes won’t be compromised. Our method applies webcrawling ACM Reference Format: techniques to find blocked content and uses a seed list of known Alexander Darer, Oliver Farnan, and Joss Wright. 2018. Automated Dis- filtered URLs to initiate the search. We make use of DNS servers covery of Internet Censorship by Web Crawling. In WebSci ’18: 10th ACM arXiv:1804.03056v2 [cs.CY] 19 Apr 2018 within a target country as measurable devices. This allows us to Conference on Web Science, May 27–30, 2018, Amsterdam, Netherlands. ACM, monitor the filter status of individual domains and sub-domains New York, NY, USA, 10 pages. https://doi.org/10.1145/3201064.3201091 without human intervention. The system is recursive so newly dis- covered filtered URLs are fed back into the search to allow on-going measurement. Results from our experiments using four different Permission to make digital or hard copies of all or part of this work for personal or test countries have shown that our approach can be used to find classroom use is granted without fee provided that copies are not made or distributed filtered URLs that are not present in the original seed lists. Further- for profit or commercial advantage and that copies bear this notice and the full citation more, we collect data about the linked nature of various filtered on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or domains to gain further insight into how different pieces of filtered republish, to post on servers or to redistribute to lists, requires prior specific permission content are associated. and/or a fee. Request permissions from [email protected]. WebSci ’18, May 27–30, 2018, Amsterdam, Netherlands © 2018 Copyright held by the owner/author(s). Publication rights licensed to the 1.1 Related Work Association for Computing Machinery. ACM ISBN 978-1-4503-5563-6/18/05...$15.00 Over the last decade there have been many approaches for detecting https://doi.org/10.1145/3201064.3201091 censorship of the Internet around the world. Of these, many are WebSci ’18, May 27–30, 2018, Amsterdam, Netherlands Alexander Darer, Oliver Farnan, and Joss Wright country specific and have focused on China [10][18][22][25][40], Indonesia [21][36], Iran [7][8], Pakistan [2][27] and Thailand [19] among others. Seed URLs The most widely adopted and current URL filter lists are main- tained by the CitizenLab [9]. They are constructed using local knowl- edge and reports of filtering in different countries and collate data from different sources such as OONI [16]. Developing new techniques for discovering filtered URLs is a Isolate filtered URLs Retrieve webpages challenging problem. Yet, this is a rich research field with numer- ous techniques published over recent years [4]. The use of DNS as a means of testing censorship of web content is not new, but can be advantageous due to its scalability and remote nature [35]. These attributes make DNS a common tool for other censorship monitoring architectures such as UBICA [3], FilteredWeb [13] and Extract URLs from HTML CensMon [33]. Building in-depth and accurate URL filter lists is an important aspect for censorship research. These collections are in widespread use among the research community for various different measure- Figure 1: High-level overview of filtered webpage traversal ments and tests for internet reachability, web content blocking and circumvention techniques [30][38]. Furthermore, the subsequent and on-going maintenance of these lists provides opportunities for dissimilar to conventional web crawling techniques widely used by insight into the condition of internet filtering around the world. large search engines. Given this, we aim to build a new dataset that The data collected by the aforementioned monitoring architectures contains information pertaining to backlinks1 of filtered webpages. is vital if we are to construct a model of censorship as it develops. This technique works on a simple premise - filtered webpages contain links to other filtered webpages. We begin the discovery 1.2 Contributions by seeding the system with a number of known filtered URLs with This paper introduces a new approach for discovering filtered do- the presupposition that these will contain hyperlinks to further mains within target censorship regimes. We have created an imple- blocked content. A high-level overview of the technique is shown mentation of the technique and, through experimentation, shown in Figure 1 and works as follows: it to be an effective tool for building URL filter lists. Furthermore, (1) Start with a list of known filtered URLs for country c our results reveal that the approach has found significantly more (2) Retrieve webpages for all known filtered URLs in our list filtered URLs for the test countries than were currently available in (3) Extract any URLs from the downloaded webpages the largest public filter lists. Our formalised contributions are: (4) Isolate the URLs that are filtered in country c from the ex- • A new approach for discovering previously unknown filtered tracted URLs domains (5) Add the newly identified filtered URLs to the list, then goto • Experimental analysis of the technique through measure- step 2 ment of filtering activity within four know censorship regimes A number of hyperlinks in any webpage will point to resources • Category breakdown of the types of content being blocked that do not provide utility for our discovery. We ignore any URLs within these regimes that point to static HTML assets - such as javascript, css or image • Analysis of forward filtered links and filtered backlinks of files and also remove any self-referencing URLs - hyperlinks tothe webpages on filtered domains same domain for the webpage. We aim to reduce the possibility A substantial research output from this body of work is a test of having the crawler becoming stuck in cliques such as affiliate list containing a large number of currently filtered domains within or adult site networks this way. For purposes of analysis of the China, Indonesia, Iran and Turkey that have previously been un- approach we visit each unique URL only once. published. We aim to make this list available as soon as possible to the wider censorship research community. 2.1 Methods for Checking Webpage Availability 2 TRAVERSAL OF FILTERED WEBPAGES A key part of this system is the capability to determine if a domain Traversal between webpages using embedded hyperlinks is the most is filtered in a certain country or not.