A Deeper Look at Web Content Availability and Consistency Over HTTP/S
Total Page:16
File Type:pdf, Size:1020Kb
A Deeper Look at Web Content Availability and Consistency over HTTP/S Muhammad Talha Paracha*, Balakrishnan Chandrasekaran†, David Choffnes*, Dave Levin‡ *Northeastern University, †Max-Planck-Institut fur¨ Informatik, ‡University of Maryland, College Park Abstract—Recent studies of HTTPS adoption have found rapid progress towards an HTTPS-by-default web. However, these studies make two (sometimes tacit) assumptions: (1) That server- side HTTPS support can be inferred by the landing page (“/”) alone, and (2) That a resource hosted over HTTP and HTTPS has the same content over both. In this paper, we empirically show that neither of these assumptions hold universally. We crawl beyond the landing page to better understand HTTPS content unavailability and inconsistency issues that remain in today’s popular HTTPS-supporting websites. Our analysis shows that 1.5% of the HTTPS-supporting websites from the Alexa top 110k have at least one page available via HTTP but not HTTPS. Surprisingly, we also find 3.7% of websites with at least one URL where a server returns substantially different content over HTTP Fig. 1: Examples of webpages with different content using compared to HTTPS. We propose new heuristics for finding HTTP vs HTTPS. these unavailability and inconsistency issues, explore several root causes, and identify mitigation strategies. Taken together, our secure variant—as many papers and tools have suggested— findings highlight that a low, but significant fraction of HTTPS- risks unexpected side-effects in usability. supporting websites would not function properly if browsers use In this work, we empirically show that this conventional HTTPS-by-default, and motivate the need for more work on wisdom does not universally hold. We identify inconsistencies automating and auditing the process of migrating to HTTPS. between HTTP and HTTPS requests to the same URI, in Index Terms—HTTPS, Internet Measurements, HTTP/S Con- sistency terms of content unavailability over one protocol and content differences between them. Rather than restrict our study to I. INTRODUCTION landing pages, we conduct a deep crawl (up to 250 pages) of each of the Alexa top 100k sites, and for a 10k sample The importance and ease of deploying HTTPS websites of the remaining 900k sites on the Alexa top 1M list [7]. has increased dramatically over recent years. Recent studies We then analyze the retrieved pages to identify substantial of HTTPS adoption have found rapidly increasing adoption, content inconsistencies between HTTP and HTTPS versions leading some to speculate that most major websites will soon of the same page, using a novel combination of state-of-the-art be able to redirect all HTTP requests to HTTPS [1]. Indeed, heuristics. some client-side tools even go so far as to force all web We find that a small but significant fraction of sites have requests to be made via HTTPS [2]. content unavailability (1.5%) and differences (3.7%) for at However, to our knowledge, all previous work measuring least one page on the website. Fig. 1 provides two examples the deployment of HTTPS [1], [3]–[6] makes two basic but of how pages can differ significantly over HTTP/S. These fundamental assumptions: (1) They assess server-side HTTPS issues affect pages across all levels of popularity, with content support by looking at a single page: the domain’s landing page differences being slightly more likely on popular pages than (also known as its root document, “/”), and (2) They assume unpopular ones. While the number of pages per site with that any resource that is available over both HTTP and HTTPS inconsistencies is low on average, we find that such issues has the same content (in the absence of an attack, of course)— affect more than 50 pages for 15% of sites analyzed. We char- that is, that the only difference between http://URI and acterize common patterns for such inconsistencies, identifying https://URI is that the latter is served over TLS. that pages on websites are unavailable due to partial support Unfortunately, neither of these assumptions has been em- of HTTPS (some directories are never served over HTTPS), pirically evaluated. This is important because, if they do not different versions of webpage content available only over one hold, they threaten our understanding of how HTTPS is truly protocol, and misconfigurations related to redirections and deployed. For example, the fact that a landing page is (or is hard-coded port numbers. In addition, we find that content is not) secure might not necessarily indicate the security of the different between HTTP/S for reasons such as HTTPS-support site writ large. Moreover, if there are content differences be- “in name only” (default server page is returned over HTTPS), tween HTTP and HTTPS, then merely defaulting to the more redirections exclusive to one protocol, or misconfigured HTTP 978-3-903176-27-0 © 2020 IFIP status codes. Last, we return to the HTTPS adoption metrics used in prior research and evaluate their efficacy in light of Kumar et al. [13] explored how third-party dependencies our findings. can hinder HTTPS adoption. In their study, 28% of HTTP- The main contributions of this paper are a large-scale mea- only sites were blocked from migrating to HTTPS due to surement to reassess the consistency of HTTPS support beyond their reliance on at least one HTTP-only third-party Javascript the landing page, new heuristics for automatically identifying and/or CSS resource. But their analysis also assumed that significant content differences between HTTP/S, a character- resources available over both HTTP/S serve the same content, ization of observed HTTP/S inconsistencies, and analysis of as they suggested the other 72% sites to migrate to HTTPS mitigation opportunities. In addition to these main contribu- by merely updating the protocol in the resource URLs. tions, for the purposes of repeatability we have made our efficient web crawler code, dataset, and analysis code publicly HTTPS Misconfigurations Fahl et al. [14] surveyed 755 available at https://github.com/NEU-SNS/content-differences. webmasters of the sites with invalid HTTPS certificates, in an effort to determine whether the behavior was intentional II. RELATED WORK or accidental. Two-thirds of the respondents admitted to de- liberately using an invalid certificate, for reasons that include Relevant HTTPS attacks and mitigations There are using a custom PKI or administering test pages not intended severe security and privacy risks to users caused by a com- for real-world users. In this work, we find inconsistent content bination of HTTP-by-default behavior in web browsers, and and availability for HTTP/S-supporting websites and manually limited deployment of HTTPS support (e.g., [8]–[10]). Prior investigate a sample of cases to understand their root causes. work argues for HTTPS to be supported on every page of a website to mitigate these threats. In this paper, we investigate III. METHODOLOGY whether sites conform to this approach (and discover signifi- This section describes our measurement methods for crawl- cant discrepancies). ing websites and identifying inconsistencies. To force a web browser to use HTTPS to access a site, web servers may use HTTP Strict Transport Security [11] A. HTTP/S Inconsistencies (HSTS) and users can install extensions like HTTPS Ev- We focus our study on websites that support both HTTP and erywhere in their web browser [2]. While potentially help- HTTPS on the landing page. For this set of sites, we measure ful, these approaches rely on manually maintained lists1 of whether there are different results when accessing the same (non)conforming websites that cannot scale to the entire web. resource over HTTP and HTTPS. We categorize such cases of In light of such limitations, Sivakorn et al. [12] argue that inconsistencies between protocols as follows: an HTTPS-by-default approach might be the only solution. In Content unavailability This occurs when a resource is this work, we identify that there remain a small but substantial successfully retrieved via HTTP, but not HTTPS (i.e., HTTP fraction of sites with HTTPS deployment gaps (including those status code >= 400,2 or an error3). using HSTS) that must be addressed before the web is ready Content difference This occurs when a resource is available for it. over both HTTP and HTTPS, but the content retrieved differs HTTPS adoption There is a significant body of work on significantly (as defined in §III-C). measuring HTTPS adoption using both passive data sets and active scans (e.g., [1], [3]–[6]). Durumeric et al. [3] measured B. Crawling Overview HTTPS adoption using one year of repeated scans of the entire We limit our analysis to differences in HTML content served IPv4 address space. In contrast, Felt et al. [1] conducted active by a website, which we refer to as pages, and do not consider scans and combined them with passive browser telemetry embedded resources (e.g., CSS and Javascript files). When data—obtained from the Google Chrome and Mozilla Firefox crawling a website, we conduct a depth-first search of all browsers—to provide server- as well as client-based perspec- links to other webpages with the same second-level (TLD+1) tives on HTTPS adoption. Overall, the results from these prior domain—which we call internal links)—starting at the landing studies indicate a steady trend towards HTTPS-everywhere. page for the site. However, they either implicitly or explicitly assume that if The websites we crawled are a subset of the Alexa global the landing page is hosted via HTTPS then the same is true top 1 M domains. We chose this list because it represents sites for all of that website’s resources; and that when both HTTP visited by users via web browsers [15]. Our analysis covers and HTTPS versions of a page are available, then the content all of the top 100K most popular domains.