Website Fingerprinting at Internet Scale

Website Fingerprinting at Internet Scale Andriy Panchenko∗, Fabian Lanze∗, Andreas Zinneny, Martin Henzez, Jan Pennekamp∗, Klaus Wehrlez, and Thomas Engel∗ ∗University of Luxembourg (LU), yRheinMain University of Applied Sciences (DE), zRWTH Aachen University (DE) E-mail: firstname.lastname @uni.lu, [email protected], lastname @comsys.rwth-aachen.de ∗f g y zf g Abstract—The website fingerprinting attack aims to identify first anonymization node. It can be, for example, a local system the content (i.e., a webpage accessed by a client) of encrypted administrator, an ISP, or everyone in the sending range of a and anonymized connections by observing patterns of data flows signal if the user is connected via a wireless link. An entity such as packet size and direction. This attack can be performed with such capabilities is one of the weakest adversaries in the by a local passive eavesdropper – one of the weakest adversaries attacker model of this and other anonymization techniques [8]. in the attacker model of anonymization networks such as Tor. In this paper, we present a novel website fingerprinting The website fingerprinting (WFP) attack is a special case of attack. Based on a simple and comprehensible idea, our approach traffic analysis. Performed by a local eavesdropper, it aims to outperforms all state-of-the-art methods in terms of classification infer information about the content (i.e., the website visited) of accuracy while being computationally dramatically more efficient. In order to evaluate the severity of the website fingerprinting encrypted and anonymized connections by observing patterns attack in reality, we collected the most representative dataset that of data flows. Here, the attacker merely utilizes meta infor- has ever been built, where we avoid simplified assumptions made mation, such as packet size and direction of traffic, without in the related work regarding selection and type of webpages and breaking the encryption. Before the end of 2011, Tor was con- the size of the universe. Using this data, we explore the practical sidered to be secure against this threat [11], [21]. Since then, it limits of website fingerprinting at Internet scale. Although our has become an active field of research. Several related works novel approach is by orders of magnitude computationally more showed the feasibility of the WFP attack in Tor, however, efficient and superior in terms of detection accuracy, for the using relatively small datasets (compared to the size of the first time we show that no existing method – including our own world wide web) and proposed voluminous countermeasures. – scales when applied in realistic settings. With our analysis, A recent work [14] questions the practical realizability of we explore neglected aspects of the attack and investigate the realistic probability of success for different strategies a real-world the attack in the light of assumptions typically made and the adversary may follow. impact of the base-rate fallacy on the classification results. In this paper, we propose a novel WFP attack based I. INTRODUCTION on a subtle method to map network traces into a robust representation of a class (in machine learning, a class is a Anonymous communication on the Internet is about hiding set or collection of abstracted objects that share a common the relationship between communicating parties. For many characteristic; in our case these are traces recorded for the people, in particular for those living in oppressive regimes, the same webpage). We abstract the loading process of a webpage use of anonymization techniques is the only way to exercise by generating a cumulative behavioral representation of its their right to freedom of expression and to freely access trace. From this, we extract features for our classifier. These information, without fearing the consequences. Besides, these implicitly cover characteristics of the traffic that other classi- techniques are often used to bypass country-level censorship. fiers have to explicitly consider, e.g., packet ordering or burst Hence, the users of anonymization techniques strongly rely on behavior. By design, our classifier is robust against differences the underlying protection as defined in their attacker model. in bandwidth, congestion, and the timing of a webpage load. For them, it is particularly important to know the level of As we will show, this approach outperforms all existing state- protection provided against considered adversaries. Several of-the-art classifiers in terms of classification accuracy while methods for low-latency anonymous communication had been being computationally tremendously more efficient. proposed by the research community but only a few systems have been actually deployed. The Tor network [8] – the most To evaluate the severity of the WFP attack in reality, popular system nowadays that is used by millions of daily we constructed the most representative dataset that has ever users – promises to hide the relationship between the sender been assembled in this domain. It consists of over 300,000 of a message and its destination from a local observer. This webpages (this is ten times larger than the biggest set used is the entity that eavesdrops traffic between the sender and the before, i.e., the one described in [14]) and is not subject to simplified assumptions made by the related work. For instance, most researchers consider only the index pages, i.e., those Permission to freely reproduce all or part of this paper for noncommercial that web servers provide for a requested domain (e.g., [26], purposes is granted provided that copies bear this notice and the full citation on the first page. Reproduction for commercial purposes is strictly prohibited [9], [11]). Their objective is limited to differentiating index without the prior written consent of the Internet Society, the first-named author pages, i.e., to detect certain index pages within a set of (for reproduction of an entire paper only), and the author’s employer if the other index pages. We clearly differentiate the classification paper was prepared within the scope of employment. of webpages and websites. Our datasets enable for the first NDSS ’16, 21-24 February 2016, San Diego, CA, USA Copyright 2016 Internet Society, ISBN 1-891562-41-X time to study the detectability of both single webpages and http://dx.doi.org/10.14722/ndss.2016.23477 complete websites within realistic Internet traffic (serving as background noise). Further, we do not limit our dataset to the Information leakage based on these metrics constitutes the most popular websites according to Alexa1, since we argue foundation of the website fingerprinting attack. The objective that their index page cannot serve as realistic background is to match patterns of a website load trace to a previously- traffic. For our datasets, we combined different sources for recorded trace in order to reveal which particular website a user information such as links distributed via Twitter or traces of a is visiting over the anonymized and encrypted path. Typically, Tor exit node in order to create a random and representative multiple traces of a single website are retrieved and analyzed. sample of webpages actually visited on the Internet (or, over These are called instances. Tor in particular) at the time of evaluation. Website fingerprinting is commonly evaluated in two sce- We use our superior attack together with our collected narios: in the closed-world scenario, the number of websites data to study the limits of webpage and website fingerprinting a user may visit is limited to a fixed number. Obviously, this at Internet scale. We investigate the probability of success scenario is not realistic. However, it is suitable to compare for different strategies a realistic adversary may follow. We and analyze the performance of classification approaches. In show that with existing classifiers under realistic conditions, the more realistic open-world scenario, the adversary tries to webpage fingerprinting for any webpage is similar to finding identify whether a visited website belongs to a given set of a needle in a haystack – in general it is doomed to failure. monitored websites even though the user may also visit sites However, website fingerprinting, despite being a more realistic unknown to the adversary. Here, we call this set of sites, which scenario, is also easier to handle for existing classifiers. Our are unknown, the background set and the set of monitored sites evaluation reveals several tactics that increase the probability the foreground set, correspondingly. In the remainder of this for a successful attack. paper we clearly distinguish between the terms “website” and “web page”. A website is a collection of web pages, which The contributions of this paper are as follows: are typically served from a single web domain. The initial 4 1) We propose a novel WFP attack on Tor based on web page of a website is called the index page . This page is the idea to sample features from a cumulative rep- served by the web server when a user queries the domain name resentation of a trace. We show that our approach of the corresponding website. In the related work, website outperforms all attacks existing in the literature on fingerprinting is commonly applied only for such index pages. the state-of-the-art dataset in terms of classification In our evaluation, we extend the universe to arbitrary web accuracy while being computationally more efficient pages, and differentiate between the objectives of an adversary, by orders of magnitude. e.g., to monitor all pages belonging to a particular website, or 2) We provide the most comprehensive dataset to evalu- to monitor a single particular web page. ate the WFP attack. Instead of being limited to index pages of popular websites, it contains various web Attacker Model pages actually retrieved on the Internet. We managed to assemble more than 300,000 of such pages. We assume the attacker to be a passive observer.

Website Fingerprinting at Internet Scale

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support