Privacy Leakage Via IP-Based Website Fingerprinting

Proceedings on Privacy Enhancing Technologies ; 2021 (4):1–21 Nguyen Phong Hoang, Arian Akhavan Niaki, Phillipa Gill, and Michalis Polychronakis Domain Name Encryption Is Not Enough: Privacy Leakage via IP-based Website Fingerprinting Abstract: Although the security benefits of domain 1 Introduction name encryption technologies such as DNS over TLS (DoT), DNS over HTTPS (DoH), and Encrypted Client Due to the increase of Internet surveillance in recent Hello (ECH) are clear, their positive impact on user years [13, 31], users have become more concerned about privacy is weakened by—the still exposed—IP address their online activities being monitored, leading to the information. However, content delivery networks, DNS- development of privacy-enhancing technologies. While based load balancing, co-hosting of different websites on various mechanisms can be used depending on the de- the same server, and IP address churn, all contribute to- sired level of privacy [48], encryption is often an in- wards making domain–IP mappings unstable, and pre- dispensable component of most privacy-enhancing tech- vent straightforward IP-based browsing tracking. nologies. This has led to increasing amounts of Internet In this paper, we show that this instability is not a traffic being encrypted [3]. roadblock (assuming a universal DoT/DoH and ECH Having a dominant role on the Internet, the web deployment), by introducing an IP-based website finger- ecosystem thus has witnessed a drastic growth in printing technique that allows a network-level observer HTTP traffic being transferred over TLS [28]. Although to identify at scale the website a user visits. Our tech- HTTPS significantly improves the confidentiality of nique exploits the complex structure of most websites, web traffic, it cannot fully protect user privacy on its which load resources from several domains besides their own when it comes to preventing a user’s visited web- primary one. Using the generated fingerprints of more sites from being monitored by a network-level observer. than 200K websites studied, we could successfully iden- Specifically, under current web browsing standards, the tify 84% of them when observing solely destination IP domain name information of a visited website can still addresses. The accuracy rate increases to 92% for pop- be observed through DNS queries/responses, as well as ular websites, and 95% for popular and sensitive web- the Server Name Indication (SNI) field of the TLS hand- sites. We also evaluated the robustness of the gener- shake packets. To address this problem, several domain ated fingerprints over time, and demonstrate that they name encryption technologies have been proposed re- are still effective at successfully identifying about 70% cently to prevent the exposure of domain names, in- of the tested websites after two months. We conclude cluding DNS over TLS (DoT) [51], DNS over HTTPS by discussing strategies for website owners and host- (DoH) [49], and Encrypted Client Hello (ECH) [88]. ing providers towards hindering IP-based website fin- Assuming an idealistic future in which all network gerprinting and maximizing the privacy benefits offered traffic is encrypted and domain name information is by DoT/DoH and ECH. never exposed on the wire as plaintext, packet metadata Keywords: Domain Name Encryption, DoT, DoH, En- (e.g., time, size) and destination IP addresses are the arXiv:2102.08332v2 [cs.CR] 16 Jun 2021 crypted Client Hello, Website Fingerprinting only remaining information related to a visited website that can be seen by a network-level observer. As a result, DOI 10.2478/popets-2021-0058 Received 2021-02-28; revised 2021-06-15; accepted 2021-06-16. tracking a user’s browsing history requires the observer to infer which website is hosted on a given destination IP address. This task is straightforward when an IP address hosts only one domain, but becomes more challenging Nguyen Phong Hoang: Stony Brook University, E-mail: when an IP address hosts multiple domains. Indeed, re- [email protected] Arian Akhavan Niaki: University of Massachusetts - cent studies have shown an increasing trend of websites Amherst, E-mail: [email protected] being co-located on the same hosting server(s) [47, 95]. Phillipa Gill: University of Massachusetts - Amherst, E-mail: Domains are also often hosted on multiple IP addresses, [email protected] while the dynamics of domain–IP mappings may also Michalis Polychronakis: Stony Brook University, E-mail: change over time due to network configuration changes [email protected] or DNS-based load balancing. Domain Name Encryption Is Not Enough: Privacy Leakage via IP-based Website Fingerprinting 2 Given these uncertainties in reliably mapping do- nections per fingerprint. Furthermore, our attack ex- mains to their IPs, we investigate the extent to which ploits the fact that websites often load many external re- accurate browsing tracking can still be performed by sources, including third-party analytics scripts, images, network-level adversaries based solely on destination and advertisements, making their fingerprints more dis- IPs. In this work, we introduce a lightweight website fin- tinguishable. We thus also investigated whether the re- gerprinting (WF) technique that allows a network-level moval of these resources due to browser caching or ad observer to identify with high accuracy the websites a blocking could help to make websites less prone to IP- user visits based solely on IP address information, en- based fingerprinting (§8). abling network-level browsing tracking at scale [74]. For By analyzing the HTTP response header of the web- instance, an adversary can use—already collected by ex- sites studied, we find that 86.1% of web resources are isting routers—IPFIX (Internet Protocol Flow Informa- cacheable, causing fewer network connections to be ob- tion Export) [101] or NetFlow [2] records to easily ob- servable by the adversary if these resources are loaded tain the destination IP addresses contacted by certain from the browser’s cache (§8.1). Moreover, using the users, and track their browsing history. Brave browser to crawl the same set of websites, we For our attack, we first crawl a set of 220K web- found that the removal of third-party analytics scripts sites, comprising popular and sensitive websites selected and advertisements can impact the order in which web from two website ranking lists (§5). After visiting each resources are loaded (§8.2), significantly reducing the ac- website, we extract the queried domains to construct a curacy of the enhanced fingerprints from 91% to 76%. domain-based fingerprint. The corresponding IP-based Nonetheless, employing the initially proposed WF tech- fingerprint is then obtained by continuously resolving nique in which the critical rendering path [38] is not the domains into their IPs via active DNS measurement taken into account, we could still fingerprint 80% of the (§4.1). By matching these IPs from the generated fin- websites even when browser caching and ad blocking are gerprints to the IP sequence observed from the network considered. traffic when browsing the targeted websites, we could Regardless of the high degree of website co-location successfully fingerprint 84% of them (§6.3). The suc- and the dynamics of domain–IP mappings, our findings cessful identification rate increases to 92% for popular show that domain name encryption alone is not enough websites, and 95% for popular and sensitive websites. to protect user privacy when it comes to IP-based WF. To further enhance the discriminatory capacity As a step towards mitigating this situation, we discuss of the fingerprints, we consider the critical rendering potential strategies for both website owners and hosting path [38] to capture the approximate ordering structure providers towards hindering IP-based WF and maximiz- of the domains that are contacted at different stages ing the privacy benefits offered by domain name encryp- while a website is being rendered in the browser (§4.2). tion. To the extent possible, website owners who wish to Our results show that the enhanced fingerprints could make IP-based website fingerprinting harder should try allow for 91% of the tested websites to be successfully to (1) minimize the number of references to resources identified based solely on their destination IPs(§6.4). that are not served by the primary domain of a website, Given the high variability of website content and and (2) refrain from hosting their websites on static IPs domain–IP mappings across time, we expect that once that do not serve any other websites. Hosting providers generated, a fingerprint’s quality will deteriorate quickly can also help by (1) increasing the number of co-located over time. To assess the aging behavior of the finger- websites per hosting IP, and (2) frequently changing the prints, we conducted a longitudinal study over a period mapping between domain names and their hosting IPs, of two months. As expected, fingerprints become less to further obscure domain–IP mappings, thus hindering accurate over time, but surprisingly, after two months, IP-based WF attacks. they are still effective at successfully identifying about 70% of the tested websites (§7). 2 Background and Motivation As our WF technique is based on the observation of the IPs of network connections that fetch HTTP re- In this section, we review some background information sources, it is necessary to evaluate the impact of HTTP on domain name encryption technologies and discuss the caching on the accuracy of the fingerprints. This is be- motivation behind our study. In particular, we highlight cause cached resources can be loaded directly from the how our IP-based fingerprinting attack is different from browser’s cache when visiting the same website for a prior works, allowing network-level adversaries to effec- second time, resulting in the observation of fewer con- tively mount the attack at scale. Domain Name Encryption Is Not Enough: Privacy Leakage via IP-based Website Fingerprinting 3 2.1 Domain Name Encryption cannot provide any meaningful privacy benefit without the use of DoT/DoH, and vice versa.

Load more