ABSTRACT
JUECKSTOCK, JORDAN PHILIP. Enhancing the Security and Privacy of the Web Browser Platform via Improved Web Measurement Methodology. (Under the direction of Alexandros Kapravelos.)
The web browser platform today serves as a dominant vehicle for commerce, communication, and content consumption, rendering the assessment and improvement of that platform’s user security and privacy important research priorities. Accurate web measurement via simulated user browsing across popular real-world web sites is essential to the process of assessing and improving web browser platform security and privacy, particularly when developing improved policies that can be deployed in production to millions of real-world users. However, the state of the art in web browser platform measurement instrumentation and methodology leaves much to be desired in terms of robust instrumentation, reproducible experiments, and realistic design parameters. We propose that enhancing web browser policies to improve privacy while retaining compatibility with legacy content requires robust and realistic web measurement methodologies leveraging deep browser instrumentation. This document comprises research results supporting the above-stated thesis. We demonstrate the limitations of shallow, in-band JavaScript (JS) instrumentation in web browsers, then describe and demonstrate an open source out-of-band instrumentation tool, VisibleV8 (VV8), embedded in the V8 JS engine. We show that VV8 consistently outperforms equivalent in-band instrumentation, provides coverage unavailable to in-band techniques, yet has proved readily maintainable across numerous updates to Chromium and the V8 JS engine. Next, we test the assumption, implicit in typical web measurement studies, that automated crawls generalize to the experience of typical web users with a robustly controlled parallel web measurement experiment comparing observations from multiple network vantage points (VP) and via naive or realistic browser configurations (BC). Our results indicate that VP and especially BC selection result in measurable shifts in HTTP traffic and JS behaviors observed from third-party content providers, underscoring the importance of realism in web measurement experiment design. Finally, we apply the insights gained from our work on instrumentation and experiment design to evaluate a novel web browser third-party storage policy designed to improve user protection against stateful online tracking while retaining compatibility with real-world content. Our evaluation results suggest that our proposed policy achieves its privacy and compatibility goals, as does Brave Software’s recent public deployment of a directly derived storage policy. © Copyright 2021 by Jordan Philip Jueckstock
All Rights Reserved Enhancing the Security and Privacy of the Web Browser Platform via Improved Web Measurement Methodology
by Jordan Philip Jueckstock
A dissertation submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements for the Degree of Doctor of Philosophy
Computer Science
Raleigh, North Carolina
2021
APPROVED BY:
Anupam Das William Enck
Bradley Reaves Alexandros Kapravelos Chair of Advisory Committee DEDICATION
To my parents, who laid the moral and mental foundations of my life at great personal cost. To my wife, who built with me a loving and stable home for our three children and sustained it through this entire saga despite my late nights and frayed nerves. And to my Creator, without Whom none of this would matter. Soli Deo gloria.
ii BIOGRAPHY
Jordan Jueckstock was born in Princeton, West Virginia, and raised near Vicenza, Italy. He was homeschooled by his mother, a former secretary who never encountered a job too unimportant to do carefully, and by his father, a musicologist and former music teacher with a fearless talent for practical engineering. Jordan earned his Bachelor of Science in Computer Science from Bob Jones University (BJU) in Greenville, SC, in May 2009. After starting a graduate program at Clemson University the following fall, he transferred to the NSF CyberCorps program at The University of Tulsa in Tulsa, OK, completing a Master of Science in Computer Science there in December 2011. Following two-and-a-half years of work at the National Security Agency in Ft. Meade, MD, Jordan returned to BJU as an instructor. He set out to complete his formal education in computer science by joining the doctoral program at NC State in the fall of 2017. He collaborated with privacy researchers at Brave Software as a summer intern in 2020. Following his graduation from NC State, he will be resuming full-time teaching at BJU.
iii ACKNOWLEDGEMENTS
This document and the work it represents have been possible only with tremendous support, help, and encouragement from many people and sources. The following deserve particular attention and thanks for their essential role in whatever success I have achieved in this process: ...my advisor: Dr. Alexandros Kapravelos. Thanks to his proactive outreach, I actually missed out on that most stressful of freshman-PhD-student activities: finding an advisor. My advisor found me! His practical approach to research removed my chief barriers to entry, and his personal manner made meeting and working with him a genuine pleasure. His bleeding-edge approach to lab infrastructure may have caused me some uncomfortably deep dives into Kubernetes documentation and code, but it forced me to grow both my technical and management skills. He made me a researcher, to the extent that I am one; a better teacher; and a better hacker. ...my committee members: Drs. Will Enck, Brad Reaves, and Anupam Das. Individually they have provided both encouragement and challenges to me in classrooms, lab meetings, and personal conversations. As a committee, they have provided a healthy blend of confirmation, criticism, and counsel in directing me to the conclusion of my studies and rounding out my education in the art and science of research. ...my WSPR lab colleagues who shared valuable educational and technical advice, daily com- miseration, and memorable life stories. At the risk of leaving out somebody important, memorable names (past and present) include: Micah Bushouse, Lucas Enloe, Abida Haque, Igibek Koishybayev, Nikolaos Pantelaios, and Isaac Polinsky. Two of my lab mates require special mention: Shaown Sarker and Kyle Martin. Shaown has shared with me friendship, serial coauthorship, intriguing philo- sophical discussion, and the special misery of debugging distributed systems written in NodeJS. Kyle has shared with me friendship, serial late-night collaboration at DARPA hackathons, and the mystical bond of brothers-in-arms formed in joint combat against recalcitrant routers, switches, servers, and Ansible playbooks. He even does not hate me—too much, anyway—for making him learn Rust for that compiler class project. ...the collaborators and mentors I met working with Brave Software: Pete Snyder, Matteo Varvello, Panos Papadopoulos, and Ben Livshits. Special thanks to Pete for multiple research project ideas and collaborations, for engineering my Brave internship at the last possible moment, and for tearing apart and reworking my writing when necessary (which was ... frequently). ... my family, already mentioned but impossible to thank enough. My parents, John and Judy Jueckstock, deserve all credit for whatever positive character traits and skills I possessed when starting my higher education saga, to say nothing of life in general. My in-laws, David and Deborah Andrews, are responsible for the raising of the most wonderful woman in the world: Jessica Jueck- stock, nee Andrews, my darling wife. Our three children, Johnny, Josie, and Jadyn, have suffered much in the way of an absent-minded if not simply absent father at various points over the last four years, but their love and joy and energy in spite of it are reflections of their mother’s steadfast home-making magic. This is yours, too, Jessica. It simply could not have happened without you.
iv TABLE OF CONTENTS
LIST OF TABLES ...... viii
LIST OF FIGURES ...... ix
Chapter 1 Introduction ...... 1 1.1 Thesis Statement...... 1 1.2 Contributions ...... 2 1.3 Thesis Organization...... 4
Chapter 2 Background & Motivation ...... 6 2.1 Overview...... 6 2.2 JavaScript Instrumentation for Browsers...... 7 2.2.1 Trends and Trade-offs...... 7 2.2.2 Fundamental Criteria...... 8 2.2.3 The Case Against In-Band JS Instrumentation ...... 9 2.2.4 Summary...... 12 2.3 Web Browser Storage & Security Policies...... 12 2.3.1 Same-Origin Policy & Storage Basics ...... 12 2.3.2 User Tracking...... 14 2.3.3 Threat Model...... 15 2.3.4 Deployed Stateful Tracking Defenses...... 16 2.3.5 Compatibility and Tracking Protections...... 18
Chapter 3 VisibleV8: In-browser Monitoring of JavaScript in the Wild ...... 20 3.1 Introduction ...... 20 3.2 System Architecture...... 22 3.2.1 Chromium/V8 Internals...... 22 3.2.2 VisibleV8 Implementation...... 23 3.2.3 Performance ...... 25 3.2.4 Maintenance & Limitations...... 26 3.2.5 Collection System...... 27 3.3 Data Collection ...... 28 3.3.1 Methodology ...... 28 3.3.2 Data Post-Processing ...... 28 3.3.3 Results...... 29 3.4 Bot Detection Artifacts...... 30 3.4.1 Artifact Discovery Methodology...... 31 3.4.2 Artifact Analysis Results ...... 33 3.4.3 Case Studies...... 35 3.5 Related Work...... 37 3.6 Conclusion ...... 38 3.7 Availability...... 38
Chapter 4 Towards Realistic and Reproducible Web Crawl Measurements ...... 39 4.1 Introduction ...... 40 4.2 Methodology...... 42
v 4.2.1 Approach to Realism...... 42 4.2.2 Realism Variables...... 42 4.2.3 Control Constants...... 44 4.2.4 Web Site Selection ...... 45 4.2.5 Implementation Details ...... 46 4.2.6 Precautions & Pilot Experiments...... 47 4.2.7 Quantifying Measurement Bias...... 49 4.3 Results ...... 50 4.3.1 Refusenik Sites...... 50 4.3.2 Volume Biases in HTTP Traffic...... 50 4.3.3 Content-Level Biases in JavaScript...... 57 4.4 Discussion and Future Work...... 61 4.4.1 Application to Future Research...... 61 4.4.2 Limitations...... 62 4.4.3 Future Work...... 62 4.5 Related Work...... 63 4.5.1 Generalizability ...... 63 4.5.2 Cloaking and Bot Detection...... 63 4.5.3 Network Endpoint Discrimination...... 64 4.6 Conclusion ...... 65
Chapter 5 Page-Length Storage: A Solution to the Privacy vs. Compatibility Trade-off in Preventing Third-Party Stateful Tracking ...... 66 5.1 Introduction ...... 67 5.2 Design & Implementation ...... 69 5.2.1 Policy Design...... 69 5.2.2 Prototype Implementation ...... 71 5.2.3 Implementation Remarks...... 71 5.3 Methodology...... 72 5.3.1 Stateful Crawl Methodology ...... 72 5.3.2 Primary Evaluation Methodology ...... 74 5.3.3 Performance Evaluation...... 78 5.4 Results ...... 78 5.4.1 Stateful Crawl Statistics...... 79 5.4.2 Privacy: Cross-Site Tracking Potential ...... 79 5.4.3 Privacy: Cross-Time Tracking Potential ...... 79 5.4.4 Compatibility: Quantitative Assessment ...... 80 5.4.5 Compatibility: Qualitative Assessment...... 81 5.4.6 Performance Evaluation...... 82 5.5 Discussion...... 85 5.5.1 Limitations...... 85 5.5.2 Next Steps ...... 85 5.6 Related Work...... 86 5.7 Conclusion ...... 88
Chapter 6 Conclusions & Future Work ...... 89 6.1 Thesis Statement Revisited...... 89 6.2 Directions for Future Work...... 90
vi BIBLIOGRAPHY ...... 91
vii LIST OF TABLES
Table 2.1 Survey of published JS instrumentation systems ...... 7
Table 3.1 Final domain status after collection...... 29 Table 3.2 Bot detection seed artifacts...... 31 Table 3.3 Candidate artifacts classified...... 32 Table 3.4 Highest ranked visit domains probing identified bot artifacts...... 34 Table 3.5 Top security origin domains probing bot artifacts...... 34 Table 3.6 Most-probed bot artifacts...... 35
Table 4.1 Some “refusenik” sites always fail navigation from a single configuration but not its’ complements ...... 50 Table 4.2 Total HTTP requests by EasyList/EasyPrivacy match and frame context . . . . 54 Table 4.3 Many domains showing no overall request volume bias serve script content with distinct VP/BC biases...... 60
Table 5.1 Candidate URL deviations as assesses by holistic manual grading (n=100) . . 82 Table 5.2 Per-policy/per-temperature mean JS execution time for third-party execution contexts, normalized relative to the permissivebaseline...... 84
viii LIST OF FIGURES
Figure 2.1 Reference monitors (RM) in traditional OS & application security...... 8 Figure 2.2 Third-party storage (a) fully allowed, (b) fully blocked, (c) partitioned by first-party context, and (d) scoped to hosting page life time (our proposal). A, B, & T are distinct domains; T is embedded as a third-party within A & B. . 13 Figure 2.3 Stock market graph broken by strict third-party storage blocking (left) and working with page-length storage (right)...... 19
Figure 3.1 V8 architecture with VV8’s additions...... 23 Figure 3.2 Instrumentation performance on BrowserBench.org [Bro] and Mozilla Dro- maeo [Dro] ...... 25 Figure 3.3 The complete data collection and post-processing system ...... 27 Figure 3.4 Cumulative feature use over the Alexa 50k ...... 30
Figure 4.1 Workflow from Domain List to Target Server...... 43 Figure 4.2 Distributions of cross-VP request volume bias by 3rd-party domains (stealth BC only); nearly twice as many domains consistently favor residential VP over the cloud VP as vice versa (3.5% > 1.8%)...... 51 Figure 4.3 Distribution of stealth-vs.-naive traffic volume bias scores for 3rd-party do- mains (residential VP only); more symmetric than its cross-VP counterparts. 52 rd Figure 4.4 Distributions of cross-VP ad/tracker request volume bias by 3 -party do- mains (stealth BC only); little change from global cross-VP distributions. . . . 53 Figure 4.5 Distribution of stealth-vs.-naive ad/tracker traffic volume bias scores for 3rd-party domains (residential VP only); BC bias is more common among these domains than the global population...... 54 rd Figure 4.6 Distributions of stealth-vs.-naive ad/tracker traffic volume bias scores for 3 - party domains (residential VP only) broken down by browser frame context; sub-frames show radically different (and less intuitive) BC bias distributions than main frames...... 55 Figure 4.7 Distribution of cross-VP execution frequency bias for families of JS code (stealth BC only)...... 58 Figure 4.8 Distribution of stealth-vs.-naive execution frequency bias for families of JS code (residential VP only)...... 59
Figure 5.1 Crawl success rate varied modestly across policies but was always reasonably high...... 79 Figure 5.2 Of our tested policies, all but permissive essentially eliminated stateful cross- site tracking potential...... 80 Figure 5.3 Cross-time tracking comparisons considering cookies and local storage . . . 81 Figure 5.4 Our page-length policy produces page behaviors within third-party frames much closer to the permissive baseline than does the breakage-prone strict third-party storage blocking policy...... 82 Figure 5.5 Time to Largest Contentful-Paint: page-length is always comparable/favor- able to “cold” permissive...... 83 Figure 5.6 Policy overhead within JS execution time and HTTP request volume ...... 84
ix CHAPTER
1
INTRODUCTION
1.1 Thesis Statement
The undeniable dominance of the web browser platform as a vehicle for commerce, communication, and content consumption makes assessment and improvement of the platform’s security and privacy compelling research priorities. Securing the control of the browser platform itself against, e.g., 0-day vulnerabilities, is an obvious imperative with obvious solutions. Browser vendors have become aggressively adept at finding and patching critical vulnerabilities and rolling out new browser versions at a dizzying pace. Security of user identity and assets from vulnerabilities in web applications are likewise generally straightforward problems that call for straightforward solutions (i.e., patches) or workaround. Such solutions may or may not involve changing the web browser platform, though robustness against common vulnerabilities of the past is a criterion that has guided the development of modern browser mechanisms such as the SameSite cookie1 attribute. Securing the privacy of user interests and associations is a much more subtle and ultimately harder problem, one that typically defies straightforward solutions. Such solutions as may be designed might call for web browser platform changes that are unpopular with prominent players in a complex online commerce ecosystem rife with paradoxical and even perverse incentives. Essential to the process of assessing and improving web browser platform security and privacy is the ubiquitous and often haphazard practice of web measurement. A typical web measurement study attempts to approximate the user browsing experience across top ranked web sites while extracting and measuring features of interest from the pages visited. Security studies can quantify how many servers or sites are vulnerable to specific, newly identified attacks. Studies of platform
1https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie/SameSite
1 evolution and emerging threats can identify web browser platform features that are popular, obscure, obsolete, or being put to surprising (and potentially harmful) uses. Privacy studies measure the pervasiveness of online tracking to quantify just how widely (and accurately) user interests and associations can be tracked across sites. The state of the art in web browser platform measurement instrumentation and methodology leaves much to be desired. To begin, even though web measurement is pervasive in security and privacy research involving the web platform, the specific measurement methodologies employed are unfortunately often poorly documented, much less repeatable. Furthermore, typical approaches to browser instrumentation (e.g., browser extensions or automation tools injecting JavaScript prototype patches into the global object) are subject to platform limitations in coverage and performance and may further be exposed to easy identification by evasive or even hostile content. Finally, crawls using unrealistic network endpoints or browser configurations are vulnerable to measurement bias caused by selective or even adversarial response. Incomplete or biased measurement results stand as obstacles to improving the security and privacy of the web browser platform since the scale of a given problem, or the efficacy of a proposed solution, or both, may not be correctly observed. Furthermore, even where a novel browser mech- anism or policy could demonstrably improve user security or privacy, the problem of evaluating the solution’s compatibility with real-world legacy content remains serious and unsolved in the general case. An effective but incompatible, site-breaking solution is no solution at all, as it cannot be deployed at scale to real-world users. The traditional approach to evaluating site-breakage in- evitably relies on human agents to define “breakage,” preventing effective scaling. We summarize these research problems impeding the improvement of web browser security and privacy and our contributions toward solving them in the following thesis statement:
Enhancing web browser policies to improve privacy while retaining compatibility with legacy content requires robust and realistic web measurement methodologies leveraging deep browser instrumentation.
The sufficiency of such measurement methodologies to assess the real-world practicality of novel privacy policies is demonstrated by the results presented in Chapter5 and by the recent adoption and deployment within the Brave browser of a policy directly derived from that proposed and evaluated in our work. Their necessity is supported by findings from our study of in-browser JavaScript instrumentation (Chapter3) compared to its popular alternatives and our study of measurement bias attributable to choice of network vantage point and browser realism when performing web crawls (Chapter4).
1.2 Contributions
The work documented in this thesis makes the following specific research contributions:
2 • Documentation of critical limitations of in-band browser instrumentation. We illustrate how browser platform and JavaScript language peculiarities make comprehensive and stealthy JavaScript API instrumentation implemented in-band, via injected JavaScript (JS) code, im- practical if not impossible.
• Presentation, release, and demonstration of effective and maintainable out-of-band JS instru- mentation. Despite being deeply embedded into the V8 JS engine, VisibleV8 has been readily maintained from Chromium 63 to Chrome 84 without major effort. It enabled discovery of 46 client-side bot-detection artifacts, in use by content on 29% of the Alexa top 50k sites, that would not have been detectable with conventional in-band JS instrumentation. It has since proved useful to independent research efforts, such as a recently published study of JavaScript obfuscation in the wild [Sar20].
• Design, documentation, and evaluation of a robustly-controlled methodology for assessing web measurement bias attributable to design criteria such as network vantage point or browser configuration. We compare three vantage points (university, cloud, and residential networks), finding a modest but consistent bias effect of third-party content providers serving more traffic to non-cloud clients. To assess browser realism’s impact, we compare a naive headless client with a non-headless browser hardened against basic bot detection, finding significant bias in traffic volume served by third-party domains (up to 19% of domains associated with advertising and user tracking). Other results suggested our findings represented a lower bound on the real effect of unrealistic vantage points or browser configurations. We go to significant lengths to eliminate all other sources of client-side noise in web measurement, documenting and where possible empirically justifying all experiment parameters.
• Design, implementation, and evaluation of a novel third-party browser storage policy which preserves both user privacy and content compatibility. The policy uses elements of existing mechanisms (ephemeral storage, storage partitioning by first-party domain context) in a novel way, restricting third-party storage to the lifetime of a top-level document. This approach prevents not only traditional cross-site tracking but also more subtle threats such as cross-visit session linking within one site context. Our evaluation experiment uses the open-source Page- Graph deep browser instrumentation system to collect and compare fine-grained elements of content behavior among our proposed policy and several representative alternatives. The results show tracking prevention equivalent to full blocking of third-party storage (effective, but known to break popular content) while maintaining compatibility comparable to the default baseline (allowing all third-party storage).
• Direct influence on a new privacy-preserving storage policy recently deployed in production by Brave software. Despite modest relaxations in the policy strictness to facilitate multi-tab browsing scenarios, the deployed policy is substantially the same as ours, confirming both the effectiveness of our policy proposal and the general accuracy of our compatibility assessment.
3 1.3 Thesis Organization
In Chapter3, we describe VisibleV8, a system for improving the coverage and stealth of JavaScript API instrumentation in web browsers. We identify in-band (i.e., non-browser-modifying) instru- mentation approaches as the prevailing state of the art in comparable work, and we then identify fundamental limitations of the browser platform and JavaScript language that inhibit effective in-band instrumentation. We present the design and implementation of an out-of-band JavaScript native API instrumentation framework providing greatly improved visibility over existing alter- natives. VisibleV8 allows us to detect pages’ attempts to probe for the existence of non-standard properties on built-in objects in the native browser APIs, a behavior associated with “bot detection,” the classification of browsers visiting a site as automated robots, or “bots.” We use VisibleV8 to discover novel bot detection artifacts and to measure the prevalence of testing for them across the Alexa top 50K sites, finding a significant number (29%) to engage in apparent bot detection via such probes. In Chapter4, we describe and execute a methodology for measuring the impact of expected real- ism factors on repeated measurements of web network traffic and JavaScript behavior. We perform a large-scale, repeated, synchronized stateless crawl of the Tranco top 25K sites, capturing network request data and JavaScript execution traces, the latter via VisibleV8. Realism factors compared are network vantage point (VP; cloud vs. residential vs. university) and browser configuration (BC; off-the-shelf automated headless browser vs. automated browser camouflaged to appear like a typi- cal desktop browser). While overall traffic levels stayed comparable across all variables, significant numbers of domains showed consistent, large-scale differences in traffic volume depending on the realism of the VP or BC used, including up to 19% of domains serving traffic flagged by popular filter lists (EasyList, EasyPrivacy). JavaScript loading and execution analysis revealed similar patterns of selective behavior favoring one VP or BC over another. We conclude that simple realism factors, either unconsidered or left undocumented in too many web measurement studies [Ahm20], can significantly impact measurement results that are closely related to security and privacy (namely, context-sensitive script behavior and HTTP traffic to known advertising and tracking domains). In Chapter5, we propose and evaluate a policy improvement for the web browser platform to prevent stateful user tracking without breaking user experience on sites embedding third-party content. Our policy is inspired by a simple insight: third-party browser storage allows tracking, but only when accessed later in time or from the context of a different site To prevent tracking without breaking content that assumes it can access third-party storage, simply restrict the scope and lifetime of that storage to the lifespan of a single loaded Web page in the browser so it cannot be accessed on a subsequent visit to that or any other site. Our policy proves straightforward, if not easy, to prototype as minor modifications to Chromium. We perform quantitative and qualitative measurement studies to assess the privacy and compatibility efficacy of our proposed policy. Our quantitative measurements are heavily informed by our prior work: out-of-band (i.e., in-browser) instrumentation, in this case using PageGraph [Iqb; Che21], giving deep coverage with good perfor-
4 mance and stealth; and a carefully documented crawl design with realism factors informed by our multi-endpoint study (a university VP and a non-headless BC). Our results support our expectations of improved privacy without a high cost in web compatibility. The Brave browser (on which Page- Graph is based) has recently adopted a new third-party storage policy directly derived from ours, demonstrating that our approach can produce effective privacy enhancements that are sufficiently robust to deploy in the real world.
5 CHAPTER
2
BACKGROUND & MOTIVATION
2.1 Overview
This section provides background material first on technical challenges and pitfalls our work faces and then on the policy landscape we seek to improve. Our thesis presumes a solid understanding of both the technical and policy details underlying the evolution of user privacy in web browsers. The choice of deep browser instrumentation over its traditional alternatives is motivated by a solid grasp of modern browser architecture and the distinctive semantics of the JavaScript language at the heart of rich web content and browser-based applications. Any proposed solution in the applied problem space of user tracking and legacy compatibility is constrained by details of both historical and modern browser security and privacy policies, or the lack thereof. Web browsers are complex, not only in their implementation details but also in their conceptual structure and policy details. Modern browser implementations must parse dozens of complex data formats, perform typographic layout and graphical rendering of large documents with high quality and low latency, and compile and execute arbitrary programs written in a highly dynamic scripting language (JavaScript) with extremely high performance relative to traditional scripting language interpreters. Unsurprisingly, then, modern browser implementations are large, typically comprising millions of lines of code. The rapid, ad hoc evolution of browser features over three decades has produced a sometimes incoherent platform stance toward security and privacy. Broadly standardized privacy policies (e.g., the Same Origin Policy, or SOP) exist but are marked by edge-cases within and inconsistencies between vendor implementations without. Recent societal and industry development of heightened privacy awareness has prompted various mechanisms and policies, from both browser vendors and
6 independent privacy advocates, to mitigate traditional weaknesses in the browser’s privacy stance. This trend is encouraging but also increases the complexity of the platform and complicates any attempt to introduce universally helpful policy improvements.
2.2 JavaScript Instrumentation for Browsers
2.2.1 Trends and Trade-offs
The state of the art in web measurements for security and privacy relies on full browsers, not simple crawlers or browser-simulators, to visit web sites. Experience with the OpenWPM framework [Eng16] indicated measurements from standard browsers (e.g., Firefox and Chromium) to be more complete and reliable than those from lightweight headless browsers (e.g., PhantomJS [Pha]). However, the question of how best to monitor and measure JS activity within a browser remains open. Assuming an open-source browser, researchers can modify the implementation itself to provide in-browser (i.e., out-of-band) JS instrumentation. Alternatively, researchers can exploit the flexibility of JS itself to inject language-level (i.e., in-band) instrumentation directly into JS applications at run-time. We provide a summary of recent security and privacy related research that measured web content using JS instrumentation in Table 2.1. Note that here “taint analysis” implies “dynamic analysis” but additionally includes tracking tainted data flows from source to sink. Fine-grained taint analysis is a heavy-weight technique, as is comprehensive forensic record and replay, so it is not surprising that these systems employed out-of-band (in-browser) implementations in native (C/C++) code. Lighter weight methodologies that simply log (or block) use of selected API features have been implemented both in- and out-of-band, but the in-band approach is more popular, especially in more recent works.
Table 2.1 Survey of published JS instrumentation systems
System Implementation Role Problem Platform Availability
1 2 OpenWPM [Das18; Mir17; Eng16] In-band Dynamic analysis Various Firefox SDK Snyder et al., 2016 [Sny16] In-band Dynamic analysis Attack surface Firefox SDK FourthParty [May12] In-band Dynamic analysis Privacy/tracking Firefox SDK TrackingObserver [Roe12] In-band Dynamic analysis Privacy/tracking WebExtension JavaScript Zero [Sch18] In-band Policy enforcement Side-channels WebExtension Snyder et al., 2017 [Sny17a] In-band Policy enforcement Privacy/tracking Firefox SDK Li et al., 2014 [Li14] Out-of-band Dynamic analysis Malware Firefox (unspecified) 3 FPDetective [Aca13] Out-of-band Dynamic analysis Privacy/tracking Chrome 32 WebAnalyzer [Sin10] Out-of-band Dynamic analysis Privacy/tracking Internet Explorer 8 4 JSgraph [Li18] Out-of-band Forensic record/replay Malware/phishing Chrome 48 WebCapsule [Nea15] Out-of-band Forensic record/replay Malware/phishing Chrome 36 Mystique [Che18] Out-of-band Taint analysis Privacy/tracking Chrome 58 Lekies et al. [Lek13; Lek17] Out-of-band Taint analysis XSS Chrome (unspecified) Stock et al. [Sto15] Out-of-band Taint analysis XSS Firefox (unspecified) Tran et al., 2012 [Tra12] Out-of-band Taint analysis Privacy/tracking Firefox 3.6 1 Including only OpenWPM usage depending on JS instrumentation 2 Supported only through Firefox 52 (end-of-life 2018-09-05) 3 Used both PhantomJS and Chrome built from a common patched WebKit 4 Binaries only
7 "User Space" (JavaScript Code)
"Kernel Space" (Browser Code) Traditional RM Execution Context
Execution Context
Inlined RM
= Protected Browser API
= Instrumentation Layer
Figure 2.1 Reference monitors (RM) in traditional OS & application security
2.2.2 Fundamental Criteria
The problem of monitoring untrusted code dates to the very dawn of computer security and inspired the concept of a reference monitor [And72], a software mediator that intercepts and enforces policy on all attempted access to protected resources. The traditional criteria of correctness for reference monitors are that they be tamper proof, be always invoked (i.e., provide complete coverage), and be provably correct, though this last element may be lacking in practical implementations. For security and privacy measurements, we add the additional criterion of stealth: evasive or malicious code should not be able to trivially detect its isolation within the reference monitor and hide its true intentions from researchers, since such an evasion could compromise the integrity of the derived results. A classic example of a practical reference monitor is an operating system kernel: it enforces access control policies on shared resources, typically using a rings of protection scheme (Figure 2.1) assisted by hardware. In order to enforce security policies like access controls and audits, the kernel must run in a privileged ring. Untrusted user code runs in a less-privileged ring, where the kernel (i.e., the reference monitor) can intercept and thwart any attempt to violate system policy. Alternatively, inlined reference monitors (IRMs) [Erl03] attempt to enforce policy while cohabiting the user ring with the monitored application, typically by rewriting and instrumenting the application’s code on the fly at load or run time. On the web, the browser and JS engine provide the equivalent of a kernel, while JS application code runs in a “user ring” enforced by the semantics of the JS language. JS instrumentation in general
8 is a kind of reference monitor; implemented in-band, it constitutes an IRM. We argue that the JS language’s inherent semantics and current implementation details make it impossible to build sound, efficient, general-purpose IRMs in JS on modern web browsers.
2.2.3 The Case Against In-Band JS Instrumentation
The standard approach to in-band JS instrumentation, which we call “prototype patching,” is to replace references to target JS functions or objects with references to instrumented wrapper functions or proxy objects1. The wrappers can access the original target through references captured in a private scope inaccessible to any other code. Note that the target objects themselves are not replaced or instrumented, only the references to them (a potential pitfall highlighted in prior work [Mey10]). Structural Limits. The JS language relies heavily on a global object (window in browsers) which doubles as the top level namespace. There is no mutable root reference to the global object, and thus no way to replace it with a proxy version. Specific properties of the global object may be instru- mented selectively, but this process naturally requires a priori knowledge of the target properties. In-band instrumentation cannot be used to collect arbitrary global property accesses, as required for our methodology in Section 3.4. This limitation means that in-band JS instrumentation fails the complete coverage criterion. Policy Limits. Not all Chrome browser API features can be patched or wrapped by design pol- icy. These features can be identified using the WebIDL [Webb] (interface definition language) files included in the Chromium sources. For Chrome 64, these files defined 5,755 API functions and prop- erties implemented for use by web content (there are more available only to internal test suites). 21 are marked Unforgeable and cannot be modified at all. Notably, this set includes window.location and window.document, preventing in-band instrumentation of arbitrary-property accesses on either of these important objects. Again, such a restriction would have eliminated many of our results in Section 3.4.
1 Other forms of JS IRM exist, like nested interpreters [Cao12; Ter12] and code-rewriting systems [Chu15], but these have not yet proven fast enough for real-world measurement work.
9 1 /* from https://cdn.flashtalking.com/xre/275/ 2 2759859/1948687/js/j-2759859-1948687.js*/ 3 /*(all variable names original)*/ 4 var badWrite = !(document.write instanceof Function&& 5 ~document.write.toString().indexOf( '[native code] ')); 6 7 /*(later on, among other logic checks)*/ 8 if(badWrite ||o.append){ 9 o.scriptLocation.parentNode.insertBefore( 10 /* omitted for brevity*/); 11 } else{ 12 document.write(div.outerHTML); 13 }
Listing 2.1 Prototype patch evasion in the wild
Patch Detection. Prototype patches of native API functions (or property accessors) can be detected directly and thus fail the criterion of stealth. JS functions are objects and can be coerced to strings. In every modern JS engine, the resulting string reveals whether the function is a true JS function or a binding to a native function. Patching a native function (e.g., window.alert) with a non-native JS wrapper function is a dead giveaway of interposition. The function-to-string probe has been used to detect fingerprinting countermeasures [Vas18b] and appears commonly in real-world JS code. In many cases, such checks appear strictly related to testing available features for browser compatibility. But there also exist cases like Listing 2.1, in which the script changes its behavior in direct response to a detected patch. Function-to-string probe evasions abound, from the obvious (patch the right toString function, too) to the subtle. In Listing 2.2, the "[native code]" string literal in the patch function appears in the output of toString and will fool a sloppy function-to-string probe that merely tests for the presence of that substring.
1 /* from https://clarium.global.ssl.fastly.net("..." comment means irrelevant portions elided for brevity)*/ 2 patchNodeMethod: function(a){ 3 varb= this, 4 c= Node.prototype[a]; 5 Node.prototype[a] = function(a){ 6 "[native code]"; 7 vard=a.src ||""; 8 return/*...*/ 9 c.apply(this, arguments) 10 } 11 /*...*/ 12 }
Listing 2.2 Function patches hiding in plain sight
Let us assume a “perfect” patching system invisible to toString probes has been used to
10 instrument createElement, the single most popular browser API observed in our data collection across the Alexa 50k (Section 3.3). Such a patch is still vulnerable to a probe that exploits JS’s type coercion rules with a Trojan argument to detect patches on the call stack at runtime (Listing 2.3). For brevity, the provided proof-of-concept calls the Error constructor, which could itself be patched, but there are other ways of obtaining a stack trace in JS. The Byzantine complexity of JS’s pathologically dynamic type system offers many opportunities for callback-based exposure of patches and proxies via stack traces. Here prototype patches face a cruel dilemma: either invoke the toString operation and open the gates to side-effects (allowing detection and evasion), or refuse to invoke it and break standard JS semantics (allowing detection and evasion).
1 function paranoidCreateElement(tag){ 2 return document.createElement({ 3 toString: function(){ 4 var callers= new Error().stack.split( '\n ').slice(1); 5 if(/at paranoidCreateElement/.test(callers[1])) { 6 return tag;/* no patch*/ 7 } else{ 8 throw new Error("evasive action!");/* patched!*/ 9 }},});}
Listing 2.3 Trojan argument attack (Chrome variant)
Patch Subversion. Finally, prototype patches can be subverted through abuse of
1 /* from https://an.yandex.ru/resource/context_static_r_4583.js*/ 2 /*(some names changed for clarity; cachedJSON is initially null)*/ 3 if(window.JSON 4 &&a.checkNativeCode(JSON.stringify) 5 &&a.checkNativeCode(JSON.parse)) 6 return window.JSON; 7 if(!cachedJSON){ 8 vart= getInjectedIFrameElement(); 9 cachedJSON=t.contentWindow.JSON; 10 vare=t.parentNode; 11 e.parentNode.removeChild(e) 12 }
Listing 2.4 Patch subversion in the wild
In Listing 2.4, the script resorts to frame injection to avoid patched JSON encoding/decoding functions. Short of blocking the creation of all
11 through some browser automation and debugging frameworks, like Chrome DevTools. But the web extension APIs of both Firefox and Chrome, as currently implemented, do not provide such a guarantee to extension authors, effectively crippling privacy or security (or research [Roe12; Sny16; Sch18]) extensions that rely on this technique. (The author of a prior related work [Sny17a] reported this bug to both the Firefox [Buga] and Chrome [Bugb] projects; neither report has been resolved at the time of this writing, and we have confirmed that Chrome 71 is still affected.)
2.2.4 Summary
Robust JS instrumentation systems must be tamper proof, must provide comprehensive coverage, and must not introduce unmistakable identifying artifacts. At present, JS language semantics and browser implementation details prevent in-band implementations from meeting these criteria. We believe that security-critical JS instrumentation, like traditional operating system auditing and enforcement logic, belongs in “kernel space,” i.e., within the browser implementation itself. But to be useful such a system must be cost-effective, both to develop and to maintain.
2.3 Web Browser Storage & Security Policies
Modern browser technologies and the security policies that govern them are complex, so we must clearly define our terms and provide essential background on browser storage policy, user tracking techniques, and the state of the art in tracking countermeasures.
2.3.1 Same-Origin Policy & Storage Basics
Sites & Origins. Browsers isolate storage (e.g., cookies, localStorage, indexDB) according to the Same-Origin Policy (SOP) [Wha]. The SOP has grown complex, multifaceted, and inconsis- tent [Sch17], and applies to many aspects of the web; here we describe only its most basic and universal elements, particularly as they relate to storage. An origin comprises a scheme (e.g., https), a complete DNS hostname, and an optional TCP port number. All state-impacting activities in a browser are associated with an origin derived from some relevant URL. For example, a script’s execution origin is derived from the URL of the frame in which the script executes, and an HTTP request origin is derived from the URL being fetched. Many activities are restricted to same-origin boundaries. For example, a script executing in origin A cannot access cookies stored for origin B. This is true even when a sub-document from origin B is embedded in a page from origin A. Storage is strictly isolated according to SOP: scripts can access cookies and DOM storage (e.g., localStorage) only for their execution origin, and HTTP requests store and transmit cookies only for their destination origin. First and Third Parties. We now define two terms used through the rest of this paper, first-party and third-party. These terms are not unique to this work, but are frequently used to mean similar
12 (a) Traditional/Permissive (b) Blocking 3rd Party Storage A B A B T T T T Pages/Frames Pages/Frames
A T B A B Storage Storage
(c) Partitioning 3rd Party Storage by Site (d) Page-Length 3rd Party Storage A B A T T B T T T T Pages/Frames Pages/Frames
A T/A T/B B A B Storage Storage
Figure 2.2 Third-party storage (a) fully allowed, (b) fully blocked, (c) partitioned by first-party context, and (d) scoped to hosting page life time (our proposal). A, B, & T are distinct domains; T is embedded as a third-party within A & B.
but not-quite-the-same things in research and web standards, so we define their use in this work explicitly. When loading a website, the first-party is the “site” portion of the top level document. This is the eTLD+1 of the URL shown in the navigation bar of the browser. Any sub-resources or sub- documents included in the page are considered first-party if they’re fetched from the same eTLD+1 as the top level document. Third-parties, then, are any site not equal to the top-level document. A sub-document (i.e.,
13 2.3.2 User Tracking
Types of Behaviors Tracked. This work uses the term “tracking” to refer to a third-party re-identifying a visitor across visits to first-party sites. Unless otherwise specified, we use “tracking” to refer both to cross-site tracking (i.e., a third-party can link an individual’s behavior across first-parties) and cross-time tracking (i.e., a third-party can identify the same person returning to the same first-party across sessions). Stateful Tracking. The oldest, simplest and most common form of online tracking is “stateful” tracking, where a third-party stores a unique value on the user’s browser, and reads that value back across different first-parties. While the terms explicit and inferred used by Roesner et al. [Roe12] appear more precise, as both techniques involve state of some kind, the stateful/stateless terminology popularized by Mayer and Mitchell [May12] appears dominant in subsequent research. In the simplest case, stateful tracking works as follows. Sites A and B both include an
• Browser fingerprinting refers to uniquely identifying a browser (or browser user) not through the storage and transmission of a unique identifier, but by identifying unique characteristics of the browser’s configuration (e.g., plugins, preferred language, “dark mode”) and execution environment (e.g., operating system, hardware capabilities). • Server-side tracking is a broad term that loosely means tracking users across sites not through stored identifiers (i.e., stateful tracking) or unique configuration (i.e., fingerprinting), but through information the user provides to the site. For example if a user uses the same email address when registering on two different sites, a tracker could later use the repeated email address to link the users behavior across sites.
14 2.3.3 Threat Model
Here we present a simple threat model defining the scope boundaries for our proposed storage policy improvements. It provides useful criteria for evaluating both deployed and experimental policy alternatives. Actors. We exclusively consider threats originating from third-party content providers engaged in user tracking. While we do consider the possibly of first-party errors or carelessness amplifying the threat posed by third-party actors, we consider active collusion between third-parties and first- parties (e.g., the disturbing new tactic of cloaking third-party content behind first-party CNAME DNS records) to be out of scope. Mechanisms. Our focus is on stateful tracking, though we consider instances where stateless tracking mechanisms may be used to bridge or synchronize stateful sessions. We consider the threat of fully stateless tracking (i.e.,, a universal, per-user fingerprint needing no state transfer) to be out of scope. This choice is deliberate: we believe stateful tracking is where browsers are most lacking practical, robust, compatible defenses. While significant research has gone into building web-compatible defenses against stateless tracking (e.g., [Lap17; Nik15]), the existing techniques for preventing stateful third-party tracking are either incomplete (i.e., they still allow significant privacy harm to occur) or incompatible (i.e., they break a significant number of websites). Threats. The primary threat considered is classic cross-site user tracking as enabled by traditional unified, persistent third-party storage. We do not believe it is controversial to consider such tracking, which amounts to disclosure of a user’s browsing history, to be an undesirable breach of personal privacy. However, there exist additional subtle cross-time tracking concerns raised by persistent third-party storage even when it is partitioned by first-party context (a relatively common proposed defense mechanism; see Figure 2.2 and Section 2.3.4). Such third-party tracking of return-visit activity within a single first-party context can enable or amplify attacks like session linking or cookie syncing. By cookie syncing we refer not to cross-vendor syncing [Pap19] but to the possibility of cross-site syncing enabled by browser implementation flaws. E.g., consider a scenario in which a first-party site embeds a third-party frame. One week later, that frame gains the ability to cookie sync (e.g., a new browser feature adds enough entropy to fingerprint). But the next day, that ability is lost (e.g., a high-priority browser update removes the privacy leak). Effectively, this disaster scenario temporarily neuters any attempt to partition third-party stored state by first-party context. The impact on privacy is determined by how much longitudinal data is available in third-party storage to by synced across first-party boundaries. In the example scenario, it is one week of browsing data with stable third-party storage and one day of data with only ephemeral storage. What we are considering a threat, then, is not the possibility of cookie-syncing itself, but rather the scale of damage it could cause. Our concern is defense in depth, just as cryptographers implementing perfect forward secrecy do so not because they expect frequent key exposure but because they wish to mitigate the impact of its hopefully unlikely occurrence.
15 By session linking we mean third-parties exploiting any flaw that allows inference of first-party login state to link two or more login identities that the user intended to keep disassociated. Robust SOP enforcement should prevent such inference, but loopholes (e.g., Referer leaks, postMessage mishandling) have been and probably will continue to be found and exploited in the wild. If such a vulnerability is ever found in a sensitive first-party site (e.g., a web mail or personal finance portal), the persistence of third-party state across first-party session boundaries opens up the possibility of a session linking attack by any third-party content embedded in that site. Finally, we consider breakage of essential web content to be a threat, too. Availability has always been a critical component of information security. If a storage policy prevents all cross-site and cross-time third-party tracking perfectly but breaks any significant amount of the web in the process, users will not tolerate the breakage and will revert to policies that are vulnerable to one or more of the other threats described above.
2.3.4 Deployed Stateful Tracking Defenses
With one notable exception, all of today’s major web browsers implement proactive user tracking defenses. Here we present summaries of the third-party storage policies currently deployed in Google Chrome, Microsoft Edge, Mozilla Firefox, Apple Safari, and Brave. These illustrate a range of possible trade-offs between privacy and compatibility. Unless otherwise indicated, all policy and mechanism details are drawn from the Cookie Status project[Coo]. Note that these summaries do not cover tracking defenses unrelated to third-party storage (e.g., third-party content blocking, first-party storage lifetime restrictions, “bounce” tracking defenses, fingerprinting defenses, etc.). User tracking defenses can be decomposed into two independent aspects: mechanism (i.e., how storage access is affected) and policy (i.e., for what actors, under what conditions). Mechanisms include altering the lifetime of third-party storage, partitioning it by first-party site context, or even blocking it entirely. Such defense mechanisms can be applied to all third-party storage or to a restricted subset of storage mechanisms (e.g., cookies vs. local storage, HTTP Cookie headers vs. JS code). Defense policies may be global, for all third-party domains; or selective, for third-party domains classified as trackers based on a priori filter lists or dynamic behavior analysis and scoring.
2.3.4.1 Major Browsers
Google Chrome. Unlike all other browsers considered here, Google Chrome [Chrb] permits full third-party storage use, including sending cookies on HTTP requests to third-party sub-resources. It is therefore wide open to both cross-site and cross-time stateful user tracking per our threat model (Section 2.3.3) Google has announced intentions to phase out third-party cookie support [Chra] in the near future; technical details remain vague, but their wording implies eliminating only cookies on third-party HTTP requests, not restricting third-party storage in general. Chrome dominates as the world’s most popular browser for both desktop and mobile markets [Staa], understandably prompt- ing web developers to target its behavior for maximum compatibility and indirectly perpetuating the status quo of stateful tracking in the process.
16 Microsoft Edge. Microsoft Edge [Edg] uses selective third-party storage restrictions based primarily on filter lists. The mechanism used is strong: full blocking of all third-party storage. The enforcement policy, however, is conservative and prone to false-negatives: only “tracking” domains are subject to blocking, unless exempted by the user via the Storage Access API escape hatch. “Tracking” domains are so classified by Microsoft’s Trust Protection List (derived from Disconnect [Dis], tuned by user- selected strictness levels), augmented by in-browser “site engagement” score. Edge’s compromise avoids significant site breakage while mitigating cross-site and cross-time tracking by known-bad actors, but not that done by unknown-bad actors.
Mozilla Firefox. Like Edge, Mozilla Firefox [Fir] currently restricts third-party storage using a filter list by default, providing a similar trade-off between privacy and compatibility. It uses a strong mechanism, full third-party storage blocking, but applies it only to “tracking” domains as defined by Disconnect unless exempted via the Storage Access API. However, Mozilla has recently announced2 a strict “Total Cookie Protection” opt-in mode that partitions third-party storage by first-party site context globally, i.e., for all third-party domains. This announcement is consistent with Firefox’s general trend towards more aggressive privacy protections (e.g., per-site caches). Turning on Total Cookie Protection would eliminate cross-site tracking via third-party storage entirely, but would not prevent cross-time tracking and its associated threats.
Apple Safari. Apple Safari [Saf] has long been advancing the state of the art in user tracking defenses via Intelligent Tracking Prevention (ITP), a sophisticated combination of storage restriction policies, opt-in APIs, and on-client classification of tracking domains via machine learning [Weba]. It uses an aggressive blend of mechanisms for restricting third-party storage: full blocking of all cookies and IndexedDB use, and per-site partitioning of local storage (which is flushed on browser restart). More significantly, its policy is to apply these restrictions globally, across all third-party domains, except where exempted via the Storage Access API. Modern Safari thus eliminates cross-site tracking and leaves only a short window (i.e., the life-time of one browser instance) for cross-time tracking based on local storage. Previous versions of ITP had allowed but partitioned third-party cookies, too, and had allowed partitioned third-party storage to persist beyond browser restart, significantly extending the time window for cross-time tracking risks. Safari’s evolution suggests that keeping third-party storage ephemeral is as important as keeping it partitioned between sites.
Brave. Brave [Bra], a Chromium fork featuring aggressive privacy protections called “Shields”, has traditionally defaulted to blocking all forms of third-party storage. The blocking mechanism is augmented by a number of tweaks to reduce site breakage, such as white-listing a small number of common third-party widget components and silently ignoring JavaScript access to third-party storage instead of raising JavaScript exceptions (as specified in the relevant standards). Policy enforcement, global by default, is augmented by a user interface for selectively “lowering shields” on sites exhibiting breakage. This aggressive stance eliminates cross-site and cross-time tracking per Section 2.3.3, but is prone to frequent site breakage. Brave has since introduced revised Shields that still block all third-party cookies on HTTP sub-resources but that allows partitioned and ephemeral
2https://hacks.mozilla.org/2021/02/introducing-state-partitioning/
17 (i.e., flushed on last same-site tab closed, or browser restart) third-party storage within third-party sub-frames. This new policy closely mirrors the one we propose in this work; see Section 5.2.1 for more details.
2.3.4.2 Storage Access API
As mentioned above, several browsers have implemented the Storage Access API as an “escape hatch” for letting the user explicitly grant a third-party domain access to persistent storage. Semantic details vary across browsers, user prompts are typically given only if the API request comes in response to user interaction with the page, to prevent spam prompts. Motivating use cases 3 all revolve around third-party widgets (e.g., comment sections, payment processors, embedded video players) keeping cross-site login state for user convenience. Re-enabling cross-site tracking is an explicit non-goal of the API specification, though the lack of a concrete semantic specification leaves this concern up to the browser vendor’s judgment.
2.3.5 Compatibility and Tracking Protections
Finally, we present some ways that existing protections against third-party stateful tracking break websites. We present these as moderating examples, and useful constraints, in designing page- length storage. Without considering these compatibility concerns, solutions will tend to simplistic, “block-everything” approaches that end up not being useful, and so not being effective in protecting privacy. We gather the following examples from Brave’s public issue tracker4. This is a useful source, as Brave has traditionally featured the most aggressive restrictions on third-party storage of the surveyed browsers, making storage-related compatibility a more frequent issue for its users.
2.3.5.1 Uncaught Exceptions from Blocking Storage
Strict third-party storage blocking breaks embedded SlideShare slide show widgets5on Chrome. The widget becomes inert, not responding to clicks, when Chrome’s implementation of strict third-party storage blocking (correctly, per the specification) raises JavaScript exceptions on access to storage APIs. Brave’ssilent no-op implementation of strict third-party storage blocking is sufficient to prevent breakage in this case; successful storage access is clearly not essential to this widget’s functionality. Page-length storage also prevents breakage by providing fully functional, but ephemeral and isolated, third-party storage. A similar example is provided by a data plot widget6broken by strict third-party storage blocking. Once again, strict third-party storage blocking causes a JavaScript run-time error which results in a
3https://github.com/privacycg/storage-access#motivating-use-cases 4https://github.com/brave/brave-browser/issues 5https://support.blogactiv.eu/2015/04/24/how-to-embed-slideshare/ 6https://www.otcmarkets.com/stock/NSRGY/overview
18 Figure 2.3 Stock market graph broken by strict third-party storage blocking (left) and working with page- length storage (right).
blank data plot (see Figure 2.3). In this case, Brave’s silent no-op blocking does not help: the error is caused by property access on a null value returned from a no-op API stub.
2.3.5.2 Breaking Cookie-Based Third-Party Sessions
A server-side example of strict third-party storage blocking causing breakage is provided by a live 7 code editing/running widget embedded in the R language documentation. The embedded widget tries to establish a cookie-based session with third-party domain multiplexer-prod.datacamp.com. Failure to persist third-party cookies results in HTTP 403 errors on subsequent HTTP requests, preventing code execution and output display. A broken video player8on the popular commentary and analysis site fivethirtyeight.com provides another example. With third-party storage blocked, the video player remains blank indefinitely. In this case, the video player functionality is broken because the frame attempts to use localStorage to persist values across pages.
7https://www.rdocumentation.org/packages/grid/versions/3.6.2/topics/grid.plot.and. legend 8https://fivethirtyeight.com/videos/do-you-buy-that-biden-should-pick-a-running-mate- from-a-swing-state/
19 CHAPTER
3
VISIBLEV8: IN-BROWSER MONITORING OF JAVASCRIPT IN THE WILD
Modern web security and privacy research depends on accurate measurement of an often evasive and hostile web. No longer just a network of static, hyperlinked documents, the modern web is alive with JavaScript (JS) loaded from third parties of unknown trustworthiness. Dynamic analysis of potentially hostile JS currently presents a cruel dilemma: use heavyweight in-browser solutions that prove impossible to maintain, or use lightweight inline JS solutions that are detectable by evasive JS and which cannot match the scope of coverage provided by in-browser systems. We present VisibleV8, a dynamic analysis framework hosted inside V8, the JS engine of the Chrome browser, that logs native function or property accesses during any JS execution. At less than 600 lines (only 67 of which modify V8’s existing behavior), our patches are lightweight and have been maintained from Chrome versions 63 through 84 without difficulty. VV8 consistently outperforms equivalent inline instrumentation, and it intercepts accesses impossible to instrument inline. This comprehensive coverage allows us to isolate and identify 46 JavaScript namespace artifacts used by JS code in the wild to detect automated browsing platforms and to discover that 29% of the Alexa top 50k sites load content which actively probes these artifacts.
3.1 Introduction
“Software is eating the world” [And11], and that software is increasingly written in JavaScript (JS) [Red; Oct]. Computer applications increasingly migrate into distributed, web-based formats, and web
20 application logic increasingly migrates to client-side JS. Web applications and services collect and control vast troves of sensitive personal information. Systematic measurements of web application behavior provide vital insight into the state of online security and privacy [May12]. With modern web development practices depending heavily on JS for even basic functionality, and with increas- ingly rich browser APIs [Sny17a] providing an inviting attack surface for fingerprinting [Aca13] and tracking [Eng16], effective security and privacy measurement of the web must include some degree of JS behavioral analysis. As if dealing with the quirky, analysis-hostile dynamism of JS itself were not enough, researchers are also locked in an arms race with evasive and malicious web content that frequently cloaks [Inv16] itself from unwanted clients. The ad-hoc approaches to JS instrumentation common in the literature suffer obvious shortcomings. Heavy-weight systems built by modifying a browser’s JS engine can provide adequate coverage and stealth, but suffer from high development costs (on the order of thousands of lines of C/C++ code [Tra12; Nea15; Li18]) and poor maintainability, with patches rarely or never ported forward to new releases. Lighter-weight systems built by injecting in-band, JS-based instrumentation hooks directly into the target application’s namespace avoid these pitfalls but suffer their own drawbacks: structural and policy limits on coverage, and vulnerability to well-known detection and subversion techniques (Section 2.2). We argue in this paper that a maintainable in-browser JS instrumentation that matches or exceeds in-band equivalents in coverage, performance, and stealth is possible. As proof, we present VisibleV8 (VV8): a transparently instrumented variant of the Chromium browser1 for dynamic analysis of real-world JS that we have successfully maintained across eight Chromium release versions (63 to 72). VV8 lets us passively observe native (i.e., browser-implemented) API feature usage by popular websites with fine-grained execution context (security origin, executing script, and code offset) regardless of how a script was loaded (via static script tag, dynamic inclusion, or any form of eval). Native APIs are to web applications roughly what system calls are to traditional applications: security gateways through which less privileged code can invoke more privileged code to access sensitive resources. As such, VV8 provides a JS analog to the classic Linux strace utility. VV8 can be used interactively like any other browser, but is primarily intended for integration with automated crawling and measurement frameworks like OpenWPM [Eng16]. We demonstrate VV8 by recording native feature usage across the Alexa top 50k websites, iden- tifying feature probes indicative of bot detection, and analyzing the extent of such activity across all domains visited. Our collection methodology takes inspiration from Snyder et al. [Sny16], using an automated browser instrumentation framework to visit popular domains, to randomly exercise JS-based functionality on the landing page, and to collect statistics on JS feature usage. Our identifi- cation and analysis of bot detection artifacts used in the wild showcases VV8’s unique advantages over traditional JS instrumentation techniques: improved stealth in the face of evasive scripts, and universal property access tracking on native objects that do not support proxy-based interposition.
1Chromium is the open-source core of Chrome, equivalent in functionality but lacking Google’s proprietary branding and service integration. We use the names interchangeably in this paper.
21 The contributions of the VisibleV8 project include:
• We demonstrate that state-of-the-art inline JS instrumentation fails to meet classical crite- ria for reference monitors [And72] and cannot prevent evasive scripts from deterministic detection of instrumentation. • We present the first maintainable in-browser framework for transparent dynamic analysis of JS interaction with native APIs. Using instrumented runtime functions and interpreter bytecode injection, VV8 monitors native feature usage from inside the browser without disrupting JIT compilation or leaving incriminating artifacts inside JS object space. We will open source our patches and tools upon publication. • We use VV8’s universal property access tracing to discover non-standard features probed by code that detects “bots” (i.e., automated browsers, such as those used by researchers). We find that on 29% of the Alexa top 50k sites, at least one of 49 identified bot artifacts is probed. This is clear evidence that web measurements can be affected by the type of browser or framework that researchers use.
3.2 System Architecture
We present cost-effective out-of-band JS instrumentation via VisibleV8, a variant of Chrome that captures and logs traces of all native API accesses made by any JS execution during browsing. Here we explain VV8’s internal design, relate our experience maintaining it over several Chrome update cycles, evaluate its raw performance against several alternatives, and describe the data collection and analysis system we have built around VV8 to demonstrate its potential for real-world measurement work.
3.2.1 Chromium/V8 Internals Chromium is a massive project (over 20 million lines of code at time of writing), actively developed, and frequently updated. Fortunately, the Chromium browser’s architecture is modular, as is the design of its V8 JS engine, so we can restrict our changes to a tiny subset of the entire browser code base. Modern versions of V8 handles JS parsing and execution via the Ignition bytecode interpreter and the TurboFan JIT compiler (Figure 3.1). Ignition parses JS source code, generates bytecode, and executes bytecode; its design is optimized for low latency, not high throughput. When run-time statistics indicate that JIT compilation is desired, TurboFan aggressively optimizes and translates the relevant bytecode into native machine code. Ignition and TurboFan rely on a large supporting run-time library (RTL) that includes a foreign- function interface allowing JS code to call into native (i.e., C++) code via type-safe API function bindings created by the hosting application (e.g., Chrome). Functionality not implemented by JS code or by V8 built-in code (e.g., Math.sqrt) must use an API binding to native browser code. V8’s
22 Source Source AST Optimized Parser JIT Compiler Machine Code Turbofan VV8 Bytecode Bytecode Generator
V8 Runtime Library VV8 Bytecode Statistics Interpreter
Ignition VV8 Logging
Figure 3.1 V8 architecture with VV8’s additions
RTL thus forms a software-enforced JS/native boundary not unlike a system call boundary in a traditional OS kernel. In V8’s 2-stage implementation, source code translation happens once, and the run-time be- havior of that source code is fixed (modulo JIT compiler optimizations) at the point of bytecode generation. This predictable workflow keeps our patches to V8 small and self-contained.
3.2.2 VisibleV8 Implementation
VV8 intercepts and logs all native API access from JS execution during browsing. Native API accesses comprise API function calls and all get and set operations on properties of JS objects constructed by native (i.e., C++) code in the browser itself. The resulting traces include context information (e.g., the source code location triggering this event), feature names, and some abbreviated activity details like function call parameters and the value being stored during property writes. Our patches are small: 67 lines of code changed or added inside V8 itself to insert our instrumentation hooks, and 472 lines of new code for filtering and logging. Instrumenting Native Function Calls. All foreign function calls through the JS/native boundary are routed through a single V8 runtime function that handles the transition from the bytecode interpreter to native execution and back. By adding a single call statement invoking our central- ized tracing logic to this C++ function, we can hook all such calls made under Ignition bytecode interpretation. However, when the bytecode is JITted to optimized machine code, one of TurboFan’s hundreds of optimization transforms will reduce the call through that hooked runtime function into a more direct alternative. This transformation would disrupt our function call tracing, so we disabled that single specific reduction, leaving the rest of the JIT compiler untouched. This removal slows V8 down by 1.3% on the Mozilla Dromaeo micro-benchmark suite (Section 3.2.3), with margins of
23 error near 1% as well. For the cost of two trivial code modifications and very modest overhead, we gain full visibility of JS calls into native API bindings under both bytecode interpretation and JITted code execution. Instrumenting Native Property Accesses. V8 provides no similarly convenient single choke- point from which all native object property accesses can be observed. Property access is a frequent and complicated operation in JS, and V8 has multiple fast-paths for different access scenarios. Therefore, we target not the execution of any bytecode here, but rather the generation of property- accessing bytecode. JS code entering V8 is first processed by Ignition’s source-to-bytecode compiler before any exe- cution. The bytecode compiler uses a classic syntax-directed architecture. First, a parser constructs an abstract syntax tree (AST) from JS source code. Then, the bytecode generator walks this AST while generating bytecode to implement the semantics required by the original JS syntax. We instrument property access to native objects by patching the bytecode generator. Specifically, we add statements to the AST visitor logic for property get and set expressions to emit additional bytecode instructions in each case. These instructions call a custom V8 runtime function containing our tracing logic. Since such runtime calls are effectively opaque black boxes to the TurboFan optimizer, our hook instruction cannot be automatically optimized away during JIT compilation. So our injected hook’s semantics are preserved from bytecode generation, through interpretation and JIT compilation, to optimized machine code execution. For completeness, we also hook the built-in implementation of Reflect.get and Reflect.set in the RTL using the same approach as for native function calls. Thus, we also capture property accesses via calls to the JS Reflection API, not only through member-access expressions. Capturing Execution Context. All of our hooks, whether in runtime functions or in injected bytecode, call into our central tracing logic. Written in C++ and compiled into V8, this code is responsible for filtering events and capturing execution context information for the trace log. Native API calls are always logged. But since our property-access hooks intercept all syntax- and reflection-based property accesses, we must filter those events. We log only property accesses on native objects as indicated by V8’s internal object metadata API. V8 treats the JS global object as a unique special case, but we treat it as a standard native object for logging purposes. Fine-grained feature-usage analysis requires a significant amount of execution context to be logged along with each function call or property access. We link feature usage not just with a visited domain, but also with the active security origin, active script, and location within that script. We use V8’s C++ APIs to extract the invoking script and location from the top frame of the JS call stack and the security origin from the origin property of the active global object. V8 and Blink sometimes execute internal JS code in a non-Web context, where the global object has no origin property or it has a non-string value. In this case an “unknown” origin is recorded (and we can later discard this activity from our analysis). The visit domain (i.e., from the URL displayed in the browser’s address bar) is associated with the log during post-processing. Logging Trace Data. Reliably recording JS trace data at low cost introduced its own engineering
24 Chrome 64 (plain) VisibleV8 (light) Chrome 64 (w/in-band) VisibleV8 (full)
1.0
0.8
0.6
0.4 Relative to Normalized Baseline
0.2
0.0 JetStream ARES-6 MotionMark Speedometer Dromaeo
Figure 3.2 Instrumentation performance on BrowserBench.org [Bro] and Mozilla Dromaeo [Dro] challenges. Repeatedly looking up and logging identical context for successive events wastes CPU time, I/O bandwidth, and storage space. We therefore track execution context state (such as active script) over time, and log it only when it has changed since the last logged event. This optimization introduces state-tracking and synchronization issues. Chrome uses multiple processes and threads to achieve good performance and strong isolation. Even with just a single browser tab open, JS code can be executing simultaneously across multiple threads. To keep our traces coherent, we must track context and log events on a per-thread basis. To store our separate trace log streams without races or synchronization bottlenecks, we create per-thread log files.
3.2.3 Performance
With every JS object property access intercepted and possibly logged, we expected VV8 to be sig- nificantly slower than stock Chrome in the worst case. We verified this expectation by measuring the overhead by benchmarking a set of Chrome and VV8 variants with both the WebKit project’s BrowserBench [Bro] and Mozilla’s Dromaeo [Dro]. Unless otherwise noted, tests were performed under Linux 4.19.8 on an Intel Core i7-7700 (4 cores, 3.6GHz) with 16GiB RAM and an SSD. See Figure 3.2. We tested four variants of Chrome 64 for Linux, including a baseline variant with no JS instru- mentation at all (plain). We include two VV8 builds: the complete system as described above (full) and a variant with property access interception disabled (light). VV8-light is roughly equivalent in coverage and logging to our final variant, stock Chrome running a custom prototype-patching extension roughly equivalent to the (unreleased) instrumentation used by Snyder et al. to measure browser feature across the Alexa top 10K [Sny16]. This last variant (w/in-band) attempts to provide
25 an apples-to-apples comparison of in-band and out-of-band instrumentation instrumenting a com- parable number of APIs (functions only) and recording comparably rich trace logs. All Chrome builds were based on the Chrome 64 stable release for Linux and use the same settings and optimizations. BrowserBench tests the JS engine in isolation (JetStream, ARES-6), JS and the DOM (Speedome- ter), and JS and Web graphics (MotionMark). VV8-light either meets or decisively beats its in-band equivalent in every case. VV8-full consistently suffers 60% to 70% overhead vs. the baseline, but on the whole-browser tests (Speedometer, MotionMark) it performs comparably to the in-band variant, which captures significantly less data (i.e., no property accesses). These numbers match our experience interacting with VV8, where we observe it providing acceptable performance on real-world, JS-intensive web applications like Google Maps. Significantly, VV8-full on the work- station compared favorably (i.e., equal or better BrowserBench scores) to Chrome 64 plain on a battery-throttled laptop running Linux 4.18.15 on an Intel Core i7-6500 (2 cores, 2.5GHz) with 16GiB RAM and an SSD. The Mozilla Dromaeo suite of micro-benchmarks focuses exclusively on JS engine performance. It avoids the browser’s layout and render logic as much as possible, and reveals more slowdowns for all instrumented variants. Dromaeo’s recommended test suite comprises 49 micro-benchmarks, too many to effectively visualize in a single figure, so we provide only the reported aggregate score.2 VV8-light still handily outperforms in-band instrumentation, but VV8-full is significantly slower than the baseline (6x in aggregate). VV8-full showed a wide range of performance on Dromaeo micro-benchmarks, from six showing no slowdown at all to three showing pathological slowdown over 100x.
3.2.4 Maintenance & Limitations
Thanks to the small size and minimal invasiveness of VV8’s patches, maintenance has thus far proved inexpensive. Development began on Chrome 63, then easily transitioned to Chrome 64, which was used for primary data collection. We have since ported our patches through Chrome 72 and encountered only trivial issues in the process (e.g., whitespace changes disrupting patch merge, capitalization changes in internal API names). Our trace logs must be created on the fly as new threads are encountered. Since the Chrome sandbox prevents file creation, we currently run VV8 with the sandbox disabled as an expedience. In production, we run VV8 inside isolated Linux containers, mitigating the loss of the sandbox somewhat. Future development will include sandbox integration should the need arise. Past work [Mow11; Mul13] on fingerprinting JS engines indicates that sophisticated adversaries could use relative scores across micro-benchmarks as a side-channel to identify VV8. However, such benchmarks and evasions would be detectable in VV8’s trace logs, and JS timing side-channel attacks can be disrupted [Sch18; Cao17]. In any case, it is unlikely that an adversary sophisticated enough to fingerprint VV8 in the wild would not also be able to fingerprint in-band instrumentation,
2The full results can be viewed at http://dromaeo.com/?id=276022,276023,276026,276027; the four columns are Chrome (plain), Chrome (w/in-band), VisibleV8 (light), and VisibleV8 (full), respectively.
26 Collection Worker Nodes g
Blink, etc. Carburetor Uncompressed Manifold Trace Log Files Chrome URLs DevTools Remote VisibleV8 Control API
Collection Custom Chromium Build Work Queue
Compressed Log Data Post-Processing Worker Nodes
Aggregates Logs Logs MongoDB Archives
Log Parser/Aggregator Dispatcher PostgreSQL Post-Processing Analysis Work Queue Database
Compressed Log Data
Figure 3.3 The complete data collection and post-processing system
which also shows measurable deviation from baseline performance. Furthermore, we expect to improve VV8 performance in future iterations by exploring asyn- chronous log flushing, log-filtering tests placed in the injected bytecode (where they can be JIT optimized), and cheaper forms of context tracking.
3.2.5 Collection System
To collect data at large scale using VV8, we built the automated crawling and post-processing system diagrammed in Figure 3.3. Worker nodes (for collection, post-processing, and work queues) are deployed across a Kubernetes cluster backed by 80 physical CPU cores and 512GiB of RAM distributed across 4 physical servers. Initial jobs (i.e., URLs to visit) are placed in a Redis-based work queue to be consumed by collection worker nodes. Post-processing jobs (i.e., archived logs to parse and aggregate) are placed in another work queue to be consumed by post-processing worker nodes. Collection metadata and trace logs are archived to a MongoDB document store. Aggregate feature usage data is stored in a PostgreSQL RDBMS for analytic queries. The collection worker node Docker image contains the VV8 binary itself and a pair of accompa- nying programs written in Python 3: Carburetor and Manifold. Carburetor is responsible for fueling VV8: using the Chrome DevTools API to open a tab, navigate to a URL, and monitor progress of the job. Manifold handles the byproducts of execution, compressing and archiving the trace log files
27 emitted during automated browsing. The post-processor worker node Docker image contains a work queue dispatcher and the main post-processor engine. The dispatcher interfaces with our existing work queue infrastructure and is written in Python 3. The post-processor engine is written in Go, which provides ease-of-use comparable to Python but significantly higher performance.
3.3 Data Collection
3.3.1 Methodology
Overview. We collected native feature usage traces and related data by automating VV8 via the Chrome DevTools interface to visit the Alexa top 50k web domains. We began each visit to DOMAIN using the simple URL template http://DOMAIN/. We visited each domain in our target list 5 times (see below); each planned visit constituted a job. We recorded headers and bodies of all HTTP requests and responses along with the VV8 trace logs. Trace log files were compressed and archived immediately during jobs, then queued for post-processing. Post-processing associated logs with the originating job/domain and produced our analytic data set. User Input Simulation. Simply visiting a page may result in much JS activity, but there is no guarantee that this activity is representative. The classic challenge of dynamic analysis—input generation—rears its head. We borrowed a solution to this problem from Snyder et al. [Sny16]: random “monkey testing” of the UI using the open source gremlins.js library [Gre]. To preserve some degree of reproducibility, we used a deterministic, per-job seed for gremlins.js’s random number generator. Once a page’s DOM was interaction-ready, we unleashed our gremlins.js interaction for 30 seconds. We blocked all main-frame navigations if they led to different domains (e.g., from example.com to bogus.com). When allowing intra-domain navigation (e.g., from example.com to www.example.com), we stopped counting time until we loaded the new destination and resume the monkey testing. We immediately closed any dialog boxes (e.g., alert()) opened during the monkey testing to keep JS execution from blocking. This 30 second mock-interaction procedure was per- formed 5 times, independently, per visited domain. (Snyder et al. [Sny16] arrived at these parameters experimentally.)
3.3.2 Data Post-Processing
We parsed the trace logs to reconstruct the execution context of each recorded event and to aggregate results by that context. The resulting output included all the distinct scripts encountered and aggregate feature usage tuples. Script Harvesting. VisibleV8 records the full JS source of every script it encounters in its trace log (exactly once per log). We extracted and archived all such scripts, identifying them by script hash and lexical hash. Script hashes are simply the SHA256 hash of the full script source (encoded as UTF-8); they served as the script’s unique ID. Lexical hashes were computed by tokenizing the script
28 and SHA256-hashing the sequence of JS token type names that result. These are useful because many sites generate “unique” JS scripts that differ only by timestamps embedded in comments or unique identifiers in string literals. Such variants produced identical lexical hashes, letting us group variant families. Feature Usage Tuples. We recorded a feature usage tuple for each distinct combination of log file, visit domain, security origin, active script, access site, access mode, and feature name. The log file component let us distinguish collection runs. The visit domain is the Alexa domain originally queued for processing. The security origin was the value of window.origin in the active execution context, which may be completely different from the visit domain in the context of an
3.3.3 Results
Success and Failure Rates. Our methodology called for 5 visits to the Alexa 50k, so the whole ex- periment consisted of 250,000 distinct jobs. Successful jobs visited the constructed URL, launched gremlins.js, and recorded at least 30 seconds of pseudo-interaction time. Jobs resulting in imme- diate redirects (by HTTP or JS) to a different domain before any interaction began were deemed “dead ends”. From job status we extract the per-domain coverage listed in Table 3.1. For “active” domains, all 5 jobs succeed and we observed native JS API usage. For “silent”: all 5 jobs succeed, but we observed no native JS API usage. For “facade”: all 5 jobs were “dead ends” (i.e., the domain is an alias). All of the above are considered “successful” domains. Some domains were “broken,” with all 5 jobs failing; a tiny number were “inconsistent,” with a mix of failed/succeeded jobs. This failure rate is not out of line with prior results crawling web sites. Snyder et al. [Sny16] reported a lower per-domain failure rate (2.7%), but this was over the Alexa top 10,000 only. On the other extreme, a recent measurement study by Merzdovnik et al. [Mer17] reported successful visits to only about 100k out of the top Alexa 200k web domains. Aggregate Feature Usage. Over the entire Alexa 50k, we observed 53% of Chrome-IDL-defined standard JS API features used at least once. Note that our observations comprise a lower bound
Table 3.1 Final domain status after collection
Active 42,845 85.69% Silent 1,702 3.40% 92.11%
Success Facade 1,508 3.02% 98.54%
Consistent Broken 3,214 6.43% Inconsistent 731 1.46% TOTAL 50,000 100.00%
29 Cumulative Observed Feature Usage 100% Cumulative Feature Usage 90% 53% (Over top 50k)
80%
70%
60%
50%
40%
30% WebIDL Features Observed 20%
10%
0% 10000 20000 30000 40000 Alexa Domain Rank
Figure 3.4 Cumulative feature use over the Alexa 50k
on usage, since we did not crawl applications requiring authentication (e.g., Google Documents), which we intuitively anticipate may use a wider range of APIs than generic, public-facing content. Most modern sites use JS heavily, but no site uses all available features. The plot in Figure 3.4 thus climbs steeply before leveling out into a gentle upward slope. The small but distinctive “cliffs” observed at rocket-league.com (Alexa 16,495) and noa.al (Alexa 22,184) are caused by large clumps of SVG-related features being used for the first time.
3.4 Bot Detection Artifacts
Modern websites adapt their behavior based on the capabilities of the browser that is visiting them. The identification of a specific browser implementation is called user-agent fingerprinting and it is often used for compatibility purposes. To provide a case study of VV8’s unique abilities, we use it to automatically discover artifacts employed by a form of user-agent fingerprinting used by some websites in the wild to detect automated browser platforms. The technique we study exploits the presence of distinctive, non-standard features on API objects like Window (which doubles as the JS global object) and Navigator as provided by automated browsers and browser simulacra. (Since even modern search engine indexers need some degree of JS support[Goob], we do not consider mechanisms used to identify “dumb,” non-JS-executing crawlers like wget.) Here VV8’s ability to trace native property accesses without a priori knowledge of the properties to instrument sets it apart from in-band instrumentation, which cannot wrap a proxy around the global object or the unforgeable window.document property. Note that “native API property access” here means a property access on an object that crosses the JS/native API boundary, regardless of whether or not that specific property is standardized or even implemented. Bot detection is a special case of user-agent fingerprinting, where “bots” are automated web clients not under the direct control of a human user (e.g., headless browsers used as JS-supporting web crawlers). Bots may be a nuisance or even a threat to websites [Jac12], and they may cause
30 Table 3.2 Bot detection seed artifacts
Artifact Name Bot Platform Indicated
Window._phantom PhantomJS [Pha] Window.webdriver Selenium [Sel] WebDriver Window.domAutomation ChromeDriver (WebDriver for Chrome)
financial loss to advertisers via accidental (or intentional) impression and/or click fraud. If the visitor’s user-agent fingerprint matches a known bot, a site can choose to “defend” itself against undesired bot access by taking evasive action (e.g., redirecting to an error page) [Inv16]. Non- standard features distinctive to known bot platforms, then, constitute bot artifacts. We exploit VV8’s comprehensive API-property-access tracing to systematically discover novel artifacts.
3.4.1 Artifact Discovery Methodology
We discover previously unknown bot artifacts by clustering the access sites (i.e., script offsets of feature accesses) for candidate features near those of known “seed” artifacts. The key insight underlying our approach is code locality: in our experience, artifact tests tend to be clustered near each other in user-agent fingerprinting code encountered across the web. We exploit this locality effect to automate the process of eliminating noise and identifying a small set of candidates for manual analysis. Candidate Feature Pool. Before searching for artifacts, we prune our search space to eliminate impossible candidates. We eliminate features defined in the Chrome IDL files, since these are standard-derived features unlikely to be distinctive to a bot platform. We also eliminate features seen set or called: these are likely distinctive to JS libraries, not the browser environment itself. This second round of pruning is especially important because JS notoriously conflates its global namespace with its “global object.” Thus, in web browsers, global JS variables are accessible as properties of the window object along with all the official members of the Window interface. Retaining only features we never see set or called eliminates significant noise (e.g., references to the Window.jQuery feature) from our pool of candidate features: from 7,928,522 distinct names to 1,907,499. Seed Artifact Selection. We further narrow our candidate pool using access site locality clus- tering around “seed” artifacts (Table 3.2). These features are among the most commonly listed in anecdotal bot detection checklists found in developer hubs like Stack Exchange [Stab], reflecting the popularity of Selenium’s browser automation suite and the lighter-weight PhantomJS headless browser. Candidate Artifact Discovery. With a pruned candidate feature pool and a set of seed artifacts in hand, we can automatically discover candidate artifacts by following these steps: (1) find all distinct access sites for seed artifacts in all archived scripts, (2) find all candidate feature access sites no more than 1,024 characters away from one of the located seed access sites and (3) extract the set of all candidate features whose access sites matched the seed locality requirement. From our
31 Table 3.3 Candidate artifacts classified
Count Category 3 Seed Bot Artifact 46 New Bot Artifact 10 Possible Bot Indicator 19 Device/Browser Fingerprint 46 Property Pollution/Iteration 11 Type Error/Misspelling 8 Missing Dependency 5 Other
initial set of 3 seed artifacts, the above process yields a set of 209 candidate artifacts (0.01% of the candidate pool) found near seed access sites in 7,528 distinct scripts (of which only 1,813 scripts were lexically-distinct). Modern Browser Artifacts. We next eliminated from our candidates any artifacts found in a current, major web browser. We tested a total of nine browser variants manually: two for Chrome (v70 on Linux, v69 on macOS), three for Firefox (v63 on Linux, v62 on macOS, v63 on macOS), one for Safari (v12.0 on macOS), one for Edge (v17 on Windows 10), and two for Internet Explorer (v8 on Windows 7, v11 on Windows 10). In total we found 61 of the candidates present on at least one tested browser, leaving 148 candidates that might be indicative of a distinctive bot platform. The only 2 present on all 9 browsers were in fact standard JS (but not WebIDL-defined) features in the global object: Object and Function. Manual Classification. The remaining 148 candidate artifacts we classified manually. Intuitively, if for every one of a candidate artifact’s access sites there exists a data flow from that site to apparent data exfiltration or evasion logic, we consider that candidate a true artifact. If there exist benign or inconclusive examples, we conservatively assume the candidate is not a true artifact. (We also attempt to categorize false positives, but that process often depends on subjective judgment of programmer intent.) To assist this process, we classified artifact access sites into 3 categories: direct if the feature name appears in the source code at the exact offset of the access site; indirect if the name appears only elsewhere in the code; and hidden if the name does not appear at all. A candidate found in a small number of distinct scripts and accessed mostly via hidden, monomorphic access sites almost always proved to be a bot detection artifact. Conversely, candidates found far and wide and accessed mostly via direct or polymorphic sites usually proved to not be true bot artifacts. Table 3.3 shows the breakdown of manual classifications. We identified a total of 49 artifacts (including our seeds) used exclusively, as far as we could tell, for bot detection. We identified 10 more that we did see used for bot detection activity but not exclusively so. (To avoid false positives, we exclude these “maybe” artifacts from our aggregate results.) An additional 19 appeared to be known or suspected fingerprinting artifacts of specific browsers or devices (e.g., standard features
32 with vendor prefixes like moz- and WebGL information query constants). Almost all of the remaining candidate artifacts appear to be side-effects of JS language quirks and sloppy programming. An example, extracted from a lightly obfuscated bot detection routine, explains some of the 46 artifacts we attribute to property pollution in iteration (Listing 3.1). This code iterates over an array of property names to check (in this case, all true bot artifacts). However, JS arrays intermingle indexed values with named properties, and this code fails to exclude properties (e.g., findAll)inherited from the array’s prototype. As a result, a single polymorphic access site within our clustering radius would access both true bot artifacts and unrelated array method names, bloating our initial candidate artifact pool with spurious features that had to be weeded out manually. Fortunately, the small size of the candidate set combined with the insights provided by access site classification made this task straightforward and tractable. Other identifiable sources of noise in the final candidate set include obvious type errors or misspellings (11) and what appear to be missing dependencies (8).
3.4.2 Artifact Analysis Results
Across Visited Domains. Our trace logs recorded probes of at least one definite bot artifact dur- ing visits to 14,575 (29%) of the Alexa top 50k. This number includes artifact accesses from both monomorphic (sites accessing only one feature; 24%) and polymorphic access sites (those accessing more than one feature name; 5%). If we consider only monomorphic access sites, the number drops to 11,830 visited sites, which is under 24% of the top 50k. The modest size of that drop implies that most bot detection scripts, even if obfuscated, perform artifact probes on a one-by-one basis rather than through changing, loop-carried indirect member accesses. Table 3.4 shows the top 10 visited domains (by Alexa ranking at the time of data collection) on which bot artifact probes were detected. Across Security Origin Domains. When we consider security origin domains as well as visit domains, we find that the majority (over 73%) of bot artifact accesses happen inside third-party sourced
1 /* originally obfuscated via string opaque concatenation*/ 2 vard=["_phantom","__nightmare","_selenium","callPhantom","callSelenium" ,"_Selenium_IDE_Recorder"], 3 e= window; 4 5 for(varl ind){ 6 varv=d[l]; 7 if(e[v]) returnv 8 } Listing 3.1 Noisy artifact probing
33 Table 3.4 Highest ranked visit domains probing identified bot artifacts
Visit Domain Alexa Rank youtube.com 2 yahoo.com 6 reddit.com 7 amazon.com 11 tmall.com 13 weibo.com 21 google.de 23 ebay.com 45 mail.ru 50 stackoverflow.com 55
Table 3.5 Top security origin domains probing bot artifacts
Origin Domain Visit Domains tpc.googlesyndication.com 10,291 googleads.g.doubleclick.net 3,980 ad.doubleclick.net 1,853 secure.ace.advertising.com 1,150 www.youtube.com 1,041 nym1-ib.adnxs.com 699 media.netseer.com 321 adserver.juicyads.com 175 openload.co 168 aax-us-east.amazon-adsystem.com 121
We found bot artifact probes in the contexts of 6,257 distinct security origin domains. Table 3.5 lists the top 10 origin domains for bot detection activity. Naturally, four of the top five are affiliated with Google’s advertising platform. Scripts running in the context of the top domain, tpc.google- syndication.com, probed no less than 42 of our 49 confirmed artifacts (85%). We believe most of these instances to be benign in intent. Advertisers have legitimate incentive to avoid paying for pointless ad impressions by blocking bots. But large-scale (i.e., automated) web measurement accuracy may become collateral damage in this arms race. The future is not bright for naive, off-the-shelf web crawling infrastructure. Popular Artifacts. In Table 3.6 we list our 15 most popular (by visit domain cardinality) bot detection artifacts. Unsurprisingly, given our seed artifacts, most results appear associated with variants of Selenium and PhantomJS. But our locality search pattern also discovered artifacts of additional automation platforms: Awesomium, NightmareJS, and Rhino/HTMLUnit. The full list of discovered artifacts includes a superset of all the Selenium and PhantomJS artifacts tested for in the latest available version [Vas19] of Fp-Scanner [Vas18b]. Most of the artifact names are highly suggestive and/or self-explanatory, with the single most
34 Table 3.6 Most-probed bot artifacts
Visit Security Artifact Feature Name Domains Origins HTMLDocument.$cdc_asdjflasutopfhvcZLmcfl_ 11,409 887 Window.domAutomationController 11,032 2,317 Window.callPhantom 10,857 5,088 Window._phantom 10,696 5,052 Window.awesomium 10,650 203 HTMLDocument.$wdc_ 10,509 18 Window.domAutomation 7,013 2,674 Window._WEBDRIVER_ELEM_CACHE 6,123 1,803 Window.webdriver 2,756 1,832 Window.spawn 1,722 1,559 HTMLDocument.__webdriver_script_fn 1,526 1,390 Window.__phantomas 1,363 1,103 HTMLDocument.webdriver 1,244 529 Window.phantom 953 820 Window.__nightmare 909 628
common association being Selenium, but a few require explanation. The $cdc_... artifact is an indicator of ChromeDriver, as $wdc_ is of WebDriver; notably, these are among the relative minority of artifacts found on non-global objects like window.document. spawn is an artifact of the Rhino JS engine, which is itself an indicator of the HTMLUnit headless browser system.
3.4.3 Case Studies
Explicit Bot Identification. Listing 3.2 shows part of a script loaded from http://security. iqiyi.com/static/cook/v1/cooksdk.js which we observed on visits to multiple domains: iqiyi.com, qiyi.com, zol.com, and pps.tv. The script, which appeared to be the result of au- tomatically bundling many related library modules together, was minified but not obfuscated. It provides a rare example in which the attribution logic is fairly obvious: the presence of specific artifacts directly triggers what appears to be bot labeling via string concatenation. Note that this example uses Window.Buffer, one of our “possible” bot artifacts, which implies execution in the Node.js environment. Code locality strikes again: the code immediately adjacent to this excerpt includes functions that collect attributes of a containing
35 1 detectExecEnv: function(){ 2 vare=""; 3 return 4 (window._phantom 5 ||/* more PhantomJS probes*/) 6 &&(e +="phantomjs"), 7 window.Buffer&&(e +="nodejs"), 8 window.emit&&(e +="couchjs"), 9 window.spawn&&(e +="rhino"), 10 window.webdriver&&(e +="selenium"), 11 (window.domAutomation 12 || window.domAutomationController) 13 &&(e +="chromium-based-automation-driver"),e 14 }, Listing 3.2 Artifact attribution in the wild
all fail, it reloads the current frame/page (presumably, with new content deemed too valuable for consumption by bots). Feature Test Vectors. An obfuscated suite of bot detection routines found in https://js.ad- score.com/score.min.js illustrates more sophisticated use of bot detection artifacts. A total of 101 lexically-equivalent variants of this script were loaded from that URL with different query strings; this form of script was encountered on 126 distinct visit domains. In total, it probes 27 of our definite or probable artifacts, all through property accesses obfuscated via indirection through a large string look-up table which is de-scrambled at run time. The script includes 3 suites of artifact tests: one for variants of Selenium (including Watir [Wat]) and two for Chromium-based embedded-browser toolkits (Awesomium and CEF [Cef]). These produce vectors of multiple results each, implying a sophisticated, multi-faceted scoring system on the backend. This script again highlights the code locality principle: a few seed artifact accesses reveal the location of other artifacts and sophisticated fingerprinting techniques.
1 /* Original obfuscated code excerpt*/ 2 _= window; 3 if(u82222.w(u82222.O(/*...*/))){} 4 else location[u82222.f(u82222.r(11)+/*...*/](); 5 /* Deobfuscated version*/ 6 if(_["phantom"] ||/* more PhantomJS probes*/ 7 ||_["Buffer"] ||_["emit"] ||_["spawn"] 8 ||_["webdriver"] ||_["domAutomation"] 9 ||_["domAutomationController"]){} 10 else location["reload"](); Listing 3.3 Bot deflection in the wild
36 3.5 Related Work
Security and Sandboxing. JStill [Xu13] used static code signatures to detect known classes of JS obfuscation commonly employed by malware. Revolver [Kap13] employed static lexical fingerprints and spatial clustering to detect and track the evolution of JS malware samples as their authors modified them to evade detection by the Wepawet browser honeypot/sandbox. Saxena et al. used off-line symbolic execution [Sax10] to discover cross-site-scripting (XSS) vulnerabilities in JS code from web applications. The Rozzle [Kol12] malware detection system employed a pragmatic form of symbolic execution to dramatically enhance the effectiveness of existing malware classifiers based on both dynamic and static analysis. Hulk [Kap14] employed JS dynamic analysis techniques to elicit and detect malicious behavior of extensions for the Chrome browser. Taint analysis of JS has been used to identify cross-site-scripting (XSS) vulnerabilities [Sto15; Lek17] or leaks of private data to third parties [Tra12; Che18]. Taint analysis typically depends on substantial patches to a fixed (and soon obsolete) version of a browser; an exception [Chu15] uses JS source rewriting to achieve inline flow monitoring without JS engine modifications, but the overhead is prohibitive. Ambitious forensic browsing record and replay systems built via browser modification include WebCapsule [Nea15] and JSgraph [Li18]. JSgraph in particular provides sophisticated causality tracking across related HTTP,DOM, and JS events (although it does not provide the breadth of API logging VV8 does). These systems provide impressive capabilities, but they quickly become obsolete as the upstream browser code bases rapidly evolve and the patches are left unmaintained (if made available at all). Published JS sandboxing systems include both in-band systems like JSand [Agt12] and Phung et al. [Phu09] and out-of-band systems like ConScript [Mey10]. Attempts [Ter12; Cao12] to fully sandbox JS execution inside a JS engine implemented in JS, while technically sound, inevitably exhibit unacceptable performance. Measurements. Richards’ ironically titled survey [Ric11] of real-world usage of JS’s infamous eval feature provides an exhaustive catalog of uses and abuses and prompted at least one direct follow-up mitigation effort [Jen12]. Nikiforakis’s measurement [Nik12] of remote JS script inclusions on top web sites, while not technically an analysis of JS code per se, clearly documented the dis- tributed nature of JS web applications and many practical trust and security issues raised by that structure. Mayer and Mitchell produced the influential FourthParty web measurement framework and demonstrated the value of comprehensive web measurements while measuring third-party web tracking [May12]. Acar et al. used in-browser instrumentation of select features to detect and measure browser fin- gerprinting with FPDetective [Aca13]. Englehardt and Narayanan’s survey of online trackers [Eng16] served as a showcase for the mature and popular [Mir17; Das18] OpenWPM web privacy measure- ment platform built around Firefox. Like FourthParty before it, OpenWPM favors the flexibility of JS-based instrumentation over the in-browser approach taken by VV8. For the specific measurement
37 goals of this paper, our in-browser approach provided coverage OpenWPM’s in-band instrumenta- tion could not match (Section 2.2). Merzdovnik et al. [Mer17] measured the effectiveness of tracker blocking tools like Ghostery and AdBlock Plus while visiting over 100,000 sites within the Alexa top 200,000 domains. Their focus was on identifying sources of 3rd-party tracking and measuring the success or failure of blockers; ours is on fine-grained feature usage attribution and analysis on a script-by-script basis. Lauinger et al. [Lau17] surveyed 133,000 top sites and discovered widespread use of outdated or vulnerable JS libraries using a browser automation system like ours but without instrumenting or logging API usage. Snyder’s measurement [Sny16] of JS browser API usage on top web sites found approximately 50% of the available features completely unused of the Alexa top 10K at the time of measurement. A follow-up work [Sny17a] explored the degree to which potentially dangerous or undesirable JS browser API features could be disabled to reduce the browser’s attack surface without disrupting the user’s web browsing experience.
3.6 Conclusion
We have made the case for choosing out-of-band over in-band JS instrumentation when measur- ing the web for security and privacy concerns. We also presented VisibleV8, a custom variant of Chrome for measuring native JS/browser features used on the web. VV8 is modern, stealthy, and fast enough for both interactive use and web-scale automation. Our implementation is a small, highly maintainable patch easily ported to new browser versions. The resulting instrumentation, hidden inside the JS engine itself, is transparent to the visited pages, performs as well or better than in-band equivalents, and provides fine-grained feature tracking by source script and security origin. With VV8 we have observed JS code loaded directly or by frames on 29% of the Alexa top 50k sites actively testing for common automated browser frameworks. As many web measurements rely on such tools, this result marks a concerning development for security and privacy research on the web. VisibleV8 has proven itself a transparent, efficient, and effective observation platform. We hope its public release contributes to the development of more next-generation web instrumentation and measurement tools for security and privacy research.
3.7 Availability
The VisibleV8 patches to Chromium, along with tools and documentation, are publicly available at:
https://kapravelos.com/projects/vv8
38 CHAPTER
4
TOWARDS REALISTIC AND REPRODUCIBLE WEB CRAWL MEASUREMENTS
Accurate web measurement is critical for understanding and improving security and privacy on- line. Implicit in these measurements is the assumption that automated crawls generalize to the experiences of typical web users, despite significant anecdotal evidence to the contrary. Anecdotal evidence suggests that the web behaves differently when approached from well-known measure- ment endpoints, or with well-known measurement and automation frameworks, for reasons ranging from DDOS detection, hiding malicious behavior, or bot detection. This work improves the state of web privacy and security by investigating how, and in what ways, privacy and security measurements change when using typical web measurement tools, compared to measurement configurations intentionally designed to match “real” web users. We build a web measurement framework encompassing network endpoints and browser configurations ranging from off-the-shelf defaults commonly used in research studies to configurations more representative of typical web users, and we note the effect of realism factors on security and privacy relevant measurements when applied to the Tranco top 25k web domains. We find that web privacy and security measurements are significantly affected by measurement vantage point and browser configuration, and conclude that unless researchers carefully consider if and how their web measurement tools match real world users, the research community is likely systematically missing important signals. For example, we find that browser configuration alone
39 can cause shifts in 19% of known ad and tracking domains encountered, and similarly affects the loading frequency of up to 10% of distinct families of JavaScript code units executed. We also find that choice of measurement network points have similar, though less dramatic, effects on privacy and security measurements. To aid the measurement replicability, and to aid future web research, we share our dataset and precise measurement configurations.
4.1 Introduction
Research into web security and privacy depends on web measurements for ground truth [May12; Roe12]. Automated, web-scale crawls have been used to estimate user tracking granularity [Eng16] and regulatory compliance [Fru15] among other important questions. Ahmad et al. found nearly 16% of papers recently published across many top security, privacy, and network measurement venues relying at least in part on data collected via automated web crawls [Ahm20]. Implicit in most automated web measurement work is the assumption that the web encountered through automated measurement is the same as the web encountered by typical web users, or at least similar enough so that the findings for the former generalize to the experiences of the latter. However, as crucial as this assumption is to much security and privacy work, we find that this assumption has not been systematically studied or assessed. More concerning, the related work that does exist suggests that the gap between the “mea- sured” web and the “experienced” web may be large. For example, Zeber et al. [Zeb20] compared results between automated crawls from different network endpoints against anonymized browsing sessions provided by volunteers, they found some dramatic mismatches between the crawls and user sessions in key privacy metrics (e.g., prominence of 3rd party domains contacted), but left unresolved the question of how much impact was attributable to sites deliberately discriminating against “unrealistic” clients. Similar work by Ahmad et al. [Ahm20] assessed the impact of user agent choice on web crawl observations, from primitive headless agents like cURL to sophisticated full-browser frameworks like OpenWPM [Eng16]. While these results again showed dramatic divergence in common security and privacy metrics, the authors’ emphasis was on experiment reproducibility, and their analysis did not attempt to quantify the direct effects of bot detection/discrimination on results. Other recent publications document that many web sites perform dynamic bot detection [Jue19] and that some malicious content actively evades visitors from non-residential networks [Vad19]. This work aims to improve the state of web privacy and security by investigating how the web changes when observed with typical web measurement techniques, compared to measurement configurations carefully designed to closely match those of typical web users. More specifically, we measure how choices in browser configuration (BC) and network vantage point (VP) affect common privacy and security metrics. We design browser and vantage point configurations to closely match those of typical web users and treat them as ground truth against which commonly used alternatives can be compared in a robustly controlled and repeated web crawl over the Tranco top 25K web
40 domains [Le 19]. This study considers three commonly used measurement vantage points (i.e., popular cloud provider, research university, residential ISP) and two common browser configurations (one with the default configuration of a popular browser automation framework, and the other configured to more closely approximate a standard desktop browser). We treat the measurements taken from the standard desktop browser BC and residential ISP VP as “realistic”, or ground truth (i.e., the web as en- countered by typical web users), and consider consistent differences in results between our ground truth configurations and the other VP and BC configurations to be a form of measurement bias. Higher levels of consistent difference between configurations are constitutes larger measurement bias, and thus a greater threat to validity to measurement studies employing unrealistic config- urations. Our analyses ignore ephemeral outliers and identify data points showing consistently significant differences among VP and BC variants across all repeated crawls. We believe our ap- proach establishes a lower bound on measurement bias (see Section 4.2.7) that can be expected from unrealistic web measurement methodology. We find significant, and sometimes dramatic, differences in common privacy and security measurements attributable to VP and BC selection. A partial list of this work’s findings include that certain commonly used measurement configurations introduce significant measurement bias regarding which domains are encountered, and how much traffic is sent to those domains. We find that measurements from cloud VP introduce higher measurement bias than other measured VPs. We also find that using non-realistic BCs introduces significant measurement bias regarding which well-known advertising and tracking domains are encountered; measurements taken with the default puppeteer1 configuration, for example, miss up to 19% of realistic advertising and tracking domains encountered. We observe that non-realistic choices in BC and VP can introduce similar measurement bias into which JavaScript libraries are observed on the web. We also present case studies demonstrating that non-realistic measurement configurations cause different patterns in JS API calls. Finally we use our findings to provide recommendations for future web privacy and security research, to maximize “realism” in measurement results. We provide more detailed guidance and discussion in Section 4.4, but in summary, we conclude that researchers should avoid lowest- common-denominator crawlers, such as stock Puppeteer driving headless Chromium, when assess- ing real-world security and privacy concerns. Contributions. Our core contributions include:
1. Comprehensive documentation and implementation details of our synchronized parallel web crawl methodology, measuring how the privacy and security characteristics of the web change under different measurement configurations.
2. The complete dataset generated by crawling the Tranco 25K top sites from 3 measurement vantage points, and under two representative browser configurations.
1https://github.com/puppeteer/puppeteer/
41 3. Conclusions and guidance on how future research should incorporate this work’s findings to improve how accurately findings from automated web measurements can generalize to real world, human browsing behavior.
4.2 Methodology
Our experiments center on simultaneous visits to top-ranked web sites by multiple clients differing in realism of network vantage point (VP) and browser configuration (BC), with controls in place to neutralize differences introduced by external factors such as available system resources, DNS resolu- tion, and sources of programmatic entropy available to client-side JS scripts. Recent work [Ahm20] has documented an unfortunate propensity of authors employing web crawls to under-specify the design parameters of their crawls, frustrating reproduction of results. Here we specify and justify our crawl design criteria in reproducible detail.
4.2.1 Approach to Realism
As we attempt to measure the extent to which automated web measurements can be distorted by unrealistic (i.e., non-human-like) crawlers, we face a challenge defining and deploying a “realistic” crawler. Ideally, we would compare a typical automated crawler directly against a live, human counterpart. Such an ideal experiment is impractical for several reasons: human volunteers do not scale well, and using real-world browsing data is fraught with ethical concerns, if it is even available at all. Our solution is to select VP and BC alternatives that can be reasonably ranked in order of relative realism based on known instances of adversarial response (e.g., bot detection, malicious cloaking). We expect the differences observed (if any) between lower-realism and higher-realism crawlers, if all other factors are controlled, to provide a lower bound for expected differences between typical crawlers and actual human users.
4.2.2 Realism Variables
Two variables control the range of realism attempted by our clients: the network endpoint from which we visit pages (vantage point, or VP) and the browser settings employed (browser configuration, or BC). Each target URL (Section 4.2.4) is visited to produce a page set, the result of visiting that URL simultaneously from each distinct VP/BC pairing.
4.2.2.1 Vantage Point (VP)
We collected data from three distinct, representative VPs: a major research university network, a nearby residential ISP network, and a popular cloud provider’s network (Amazon AWS). The university and residential endpoints are co-located in the same city. The cloud endpoint was placed in the cloud provider’s nearest available datacenter, which is within the same national border, in a neighboring province. The residential network provides the ostensible best-case in VP realism, as
42 University Primary Cluster
Randomized Pass #3 http://example.com/ http://example.com/ Top 25K Domains (university, naive) (university, stealth) Randomized. . . Pass #2 . . . Randomized Pass #1 http://example.com/ http://example.com/ . . .example.com (#562) (cloud, naive) (cloud, stealth) Cloud example.com (#562) example.com (#562) example.com ...... SOCKS5 . . . Relay . . .example.com (#562) . . .
http://example.com/ http://example.com/ (residential, naive) (residential, stealth)
Residential Outpost Cluster Figure 4.1 Workflow from Domain List to Target Server
it is used exclusively for end-user activities. The university network combines both end-user and infrastructure activities; its realism is presumed to fall somewhere between the residential and cloud extremes. The cloud network provides an expected worst-case in VP realism as its typical use is for infrastructure rather than end-user network access. Connectivity via these endpoints is achieved via implementation details discussed in Section 4.2.5. According to IP geolocation data provided by https://ifconfig.co, the university and resi- dential endpoints were 6 km apart, while the university and cloud endpoints were 375 km apart. Naturally, the distance separating the cloud endpoint from its counterparts raises concerns about the effect of geo-targeted web content. We address this concern in the discussion of our analysis approach (Section 4.3).
4.2.2.2 Browser Configuration (BC)
We crawl using two variants of Google Puppeteer controlling Chromium 80: a lower-realism naive variant and a higher-realism stealth variant. Relative realism is inferred from the ongoing arms race between developers of bots (automatic, non-human user agents) and bot detectors. The naive BC, running Chromium in headless mode using stock Puppeteer, is easy to detect as a bot thanks to identifying quirks [Vas18a] of Chrome’s headless mode, (e.g., “ChromeHeadless” in the User Agent string). The stealth BC presents a harder target by running the browser in non-headless mode and using a community-provided stealth plugin for Puppeteer that adds bot-detection countermeasures such as spoofing available media codecs to better match consumer devices and suppressing the tell-tale Navigator.webdriver attribute. The existence and continued maintenance of the stealth plugin, an explicit workaround for bot detection of headless Puppeteer crawlers, indicates that there exists some population of content in the wild for which our naive BC will be considered “unrealistic” and our stealth BC “realistic.”
4.2.2.3 Summary
Each of our page sets comprises 6 synchronized parallel visits to the same URL, one for each combination of our 3 network vantage points and 2 browser profiles representing a range of relative realism levels.
43 4.2.3 Control Constants
Our analysis of results across VP/BC depends on eliminating as many sources of irrelevant differences across clients as possible. To this end we aggressively homogenize all readily controllable aspects of our crawls across all clients.
4.2.3.1 Workflow & Timeouts
All page visits follow the same workflow and employ the same timeout limits for each phase. First, the browser is launched with a clean user profile (i.e., no cookies or cached content), instrumentation callbacks are established, and the browser is navigated to the target URL. If this initial navigation fails to successfully fetch an HTML document within the navigation timeout of TN = 30 seconds, the page visit is aborted. Otherwise, the browser is left idle, running JS code and firing timers as needed, for the loiter timeout of TL = 15 seconds. Once the loiter time is expended, the crawler begins to “tear down” the visit by first capturing a number of page artifacts (such as final DOM HTML and screenshot). If this “tear down” process exceeds the watchdog timeout limit of TD = 15 seconds, the page visit is aborted. (All non-aborted page navigations are considered successful, even if they never fire the official “load” event.) The theoretical maximum time taken per page, then, is 60 seconds. We discuss selection of these parameter values in Section 4.2.6
4.2.3.2 DNS Resolution
We configured all browsers at all endpoints to resolve DNS names using CloudFlare’s popular 1.1.1.1 resolver network [Clo]. Our experiment is not designed to measure the impact of DNS resolution on web content, so we did not perform A/B crawls with and without CloudFlare’s DNS service. Rather, our goal is to reduce potential noise caused by using different DNS servers from different providers with completely different priorities and quality of service.
4.2.3.3 JS Entropy Sources
Dynamic resource loading triggered by JS code has been known to rely on sources of randomness available to JS programmers, such as the Math.random API, or on the current timestamp as returned (with millisecond granularity) by new Date(). Whenever a new frame is created, our crawler pre- loads its execution context with a JS polyfill from the Google Catapult project’s Web Page Replay framework 2 that provides deterministic alternatives to these APIs.
4.2.3.4 Bandwidth/Latency
To compensate for differences in available bandwidth and typical latency between our vantage points, we used Chromium’s network throttling support to limit maximum throughput and mini- mum request latency to the lowest-common-denominator in our setup. Unsurprisingly, our network
2https://chromium.googlesource.com/catapult/
44 bottleneck was the residential vantage point, equipped with asymmetric bandwidth (200 Mbps down, 10 Mbps up) which had to be shared with two residents compelled to work from home thanks to the COVID-19 pandemic. We set bandwidth limits arithmetically, by dividing 50% of available residential bandwidth (each direction) among the workers deployed there and setting identical limits on all other workers. We measured the highest round-trip time for a simple HTTP request time experiment conducted through each VP and used the maximum of 64ms (from residential) as the minimum latency for all workers.
4.2.3.5 Summary
All anticipated entropy sources are held as constant as is practical across all visits: workflow time- outs, DNS resolution (via CloudFlare), JS entropy sources, and lowest-common-denominator band- width/latency throttling.
4.2.4 Web Site Selection
We visit the top 25,000 web domains as ranked by the Tranco list of top sites [Le 19] (snapshot 77PX). In keeping with our focus on web security and privacy measurements, our interest is primarily 3rd- party infrastructure content such as advertisement frameworks, trackers, and analytics scripts rather than 1st-party application content or behavior, so we do not expend resources crawling recursively into a website’s contents beyond the “landing page” provided by navigating to the domain name itself as an HTTP URL. Much web content is inherently dynamic [Ada09] (e.g., news headlines, advertisements) or even personalized (e.g., recommended content, advertisements). Furthermore, the web depends on a strictly-best-effort Internet, where ephemeral connectivity issues are common. Such ephemeral noise threatens our ability to isolate meaningful differences across endpoints. In addition to all the controls enumerated above, we combat such noise with repetition, visiting our itinerary 3 times and factoring consistency across repetitions into our analysis. Repetition count is not a well standardized parameter of web crawl methodologies. If mentioned at all, it is typically justified in relation to a particular metric or analysis technique [Sny16; Ahm20]. We chose 3 repetitions pragmatically: it provides some robustness against temporary connectivity issues and provides measurement of stability in observations across crawls while retaining modest resource overhead. Since some environmental factors outside our control (e.g., diurnal activity patterns affecting network load) directly relate to time, and since even responsible web crawling at slow rates might well result in IP blacklisting over the course of a week-long crawl, we decouple website popularity from time elapsed during the experiment by randomizing the order of domains visited at the start of each crawl repetition. This shuffling prevents diurnal patterns from coinciding with domain spacing in our crawl and gives us some confidence that any rank-based metrics used in our analysis are not accidental proxies for time-of-day or other temporal patterns.
45 4.2.4.1 Summary
We visit the Tranco top 25,000 domains in a series of 3 independent crawls. We randomize the order of site visits within each crawl to decouple our results from potential time-based confounding factors.
4.2.5 Implementation Details
Figure 4.1 illustrates the high-level design of our crawling experiment and hints at some of the implementation and infrastructure details briefly discussed here.
4.2.5.1 Infrastructure
All university and cloud visits were hosted on an 8-node Kubernetes cluster, comprising 352 total CPUs and 1.5TiB total RAM. Cloud visits were proxied through our Amazon AWS endpoint using Go Simple Tunnel’s3 SOCKS5-over-KCP low-latency encrypted transport mode. Given the asym- metric bandwidth available from the residential ISP,described in Section 4.2.2, we could not tunnel crawls through this endpoint, as the tunnel would be effectively throttled by the crippled upstream rate. Instead, we placed a single-node (16 CPUs, 32GiB RAM) Kubernetes cluster at the residential endpoint. Workers on this outpost cluster handled all visits from the residential vantage point. All workers in the experiment (in both clusters) were configured with CPU and memory limits derived to strictly prevent saturation of the outpost’s limited resources. Bulk data (e.g., HTTP response bodies, VisibleV8 trace logs) was stored in a local MongoDB server running alongside each cluster. Post-processed summary data was stored in a single PostgreSQL server colocated with the primary cluster: outpost post-processing jobs communicated with this server via persistent SSH tunnel.
4.2.5.2 DNS Customization
The Chromium browser does not provide runtime options to select a custom DNS resolver. Our workaround was to configure all Chromium instances, not just those visiting via the cloud endpoint, to use a local SOCKS5 proxy configured to use our desired remote DNS server for name resolution. This approach gave us easy control over DNS server selection without altering the originating IP address of the request, and had the additional benefit of normalizing connection overhead, proxy latency, and Chrome error reporting between the local and cloud endpoints.
4.2.5.3 Synchronization
Even with careful resource tuning and throttling, page sets will not remain synchronized across a multi-day crawl experiment without help. Unsynchronized tests over the top 1K sites revealed that page visits from the same set would be spaced far apart soon after the experiment began: e.g., by the time 400 domains had been processed, the residential visits were already over 30 minutes behind
3https://github.com/ginuerzh/gost
46 their cloud and university counterparts (which were less than a minute apart), despite uniform CPU and memory limits and no sign of network bandwidth saturation at any vantage point. To provide maximum comparability across control variables, we set out to synchronize visit starts within page sets to be nearly instantaneous. Workers in both clusters pulled page visit jobs from a single Redis-backed work queue hosted in the primary cluster. Outpost workers accessed this Redis server via a persistent SSH tunnel between the clusters. We augmented the off-the-shelf work queue logic with a custom synchronization barrier implemented using atomic counters and pub/sub notifications provided by the same central Redis server. Under this scheme, each job in a page set is tagged with a shared sync tag which serves as a Redis key for storing an atomic counter (initially 0). After pulling a job from the queue but before starting the visit, workers subscribe to notifications for that sync tag, atomically increment the sync counter and, if the returned value matches the expected total count (e.g., 6), publish a notification to release all workers waiting on that tag. With this implementation in place, 99.8% of all page sets in our primary experiment saw all 6 visits launched within a 1 second window, with a mean launch window of 91ms.
4.2.5.4 Summary
We split our infrastructure into a primary and an outpost cluster to work around asymmetric bandwidth limits for the residential vantage point. DNS resolution customization is controlled via local and remote SOCKS5 proxies, simplifying implementation and unifying behavior and reporting across all variants. Work is distributed and synchronized across the clusters via a central Redis server using standard work-queuing and custom barrier synchronization techniques.
4.2.6 Precautions & Pilot Experiments
As there is no universal “ground truth” for web crawl data collection, we can validate our system only in a precautionary, best-effort sense. We list here predictable threats to validity which we considered and mitigated, along with experimental confirmation of reasonable results.
Is the navigation timeout TN = 30s reasonable?. We believe so, for two reasons: the 30 second timeout is comparable to timeouts in similar work [Sny16], and it is longer than what we expect a typical user to tolerate, based both on widely agreed-upon web user experience guidelines [Gooa] and past user behavioral studies [Nah04].
Is the loiter timeout TL = 15s adequate to allow full page load before shutting down?. Yes. We performed a pilot experiment over the Tranco top 1K testing loiter times ranging from 15s to 60s in 5s increments and found no variation in how many pages achieved a full “page load” event. We did not test loiter times below 15s as the distribution of successful page load times indicated 15s to be the minimum reasonable lower bound. Do the network bandwidth and latency throttling controls distort page performance?. No. We tested the Tranco top 1K with and without bandwidth/latency throttling and found essentially no
47 difference in error rate and other core statistics. The results were in fact so similar we were tempted to eliminate the throttling from the experiment controls. But given past experience suggesting that top sites behave better than average, we left the controls enabled in the event that lower-ranked sites generate load such that the mismatch in available bandwidth between clusters might harm the comparability of results. Are the limited resources available at the residential outpost able to keep pace with their beefier counterparts?. Yes. We set CPU, memory, and bandwidth limits on all workers to lock total use of system resource below potential saturation for the lowest common denominator environment (the residential outpost). Pilot tests during heavy residential network usage (e.g., video conferencing) revealed neither impact to our collection speed or success nor degradation observable by the resi- dents. During this test we specifically monitored the upstream bandwidth used by post-processing when shipping data back to Postgres via SSH tunnel and found it well-constrained below a peak rate of 1.5Mbps without any specific limits or throttling being applied.
4.2.6.1 Primary Experiment Consistency
The primary experiment ran from 30 April to 9 May (2020). We queued a total of 450,000 page visits, of which 449,936 (99.99%) completed without fatal error. Of these completed pages, 375,246 (83.40%) were completely successful and 74,690 (16.60%) experienced some level of failure. Note that we are extremely conservative in labelling failure, including pages that loaded and collected content successfully but which failed to shut down collection in a timely manner and were thus forcibly aborted late in the page visit workflow. The reported failure rate is thus a lower bound on the number of visits producing useful data for analysis. We confirmed that our collection was free of any unexpected patterns in failure rates relating to Tranco rank, time of day, or day of experiment. The Tranco rank independence reassured us that our pilot experiments focused on the top 1K applied reasonably to the rest of the crawl. The time of day independence reassured us that our residential outpost workers were coexisting peacefully with the residential traffic. And the day of experiment independence reassured us that our endpoints were not subjected to effective blacklisting or other progressive service degradation over the course of the experiment, as confirmed independently by reputation monitoring provided by https: //hetrixtools.com/.
4.2.6.2 Summary
We justify our choice of timeouts from prior practice and specific testing. We verify that our results are not contaminated with noise from resource saturation within either cluster or from resource mismatch between clusters. We find that errors do not appear correlated to potential time- or rank-based confounding factors, implying that our observations are consistent and reasonably robust.
48 4.2.7 Quantifying Measurement Bias
Our analysis depends on quantifying how much crawlers, differing only in relative realism, record consistently, significantly different web measurement results. A specific measurement that consis- tently, significantly differs across a realism variable like VP or BC constitutes measurement bias.
4.2.7.1 Bias Scores
To facilitate comparing and reasoning about bias, we quantify it to produce a concrete score. We begin by aggregating an additive metric (e.g., total requests) grouped by an entity (e.g., eTLD+1 domain of an HTTP request URL) and a control variable (e.g., VP or BC). We then compute each entity’s bias scores, one for each distinct combination of control variables (e.g., stealth-vs.-naive, or residential-vs.-cloud). For a pair of metrics a and b , the score formula is R(log2 a log2 b ), where − R(x ) is the common integer rounding function (rounding away-from-0 at .5). Using differences of logarithms provides more descriptive power than a simple count difference while avoiding the ex- treme outliers likely when using ratios. A stealth-vs.-naive score of 2.0 for the domain example.com, for instance, shows that sites sent 4 as many 3rd party requests to example.com during stealth crawls than on naive crawls. A score× of 1.0 would indicate 2 as many requests on naive crawls compared to stealth crawls. A score close− to 0 (i.e., the majority× of scores in practice) indicates insignificant bias across our experiments for that domain.
4.2.7.2 Bias Consistency
Transient outliers are eliminated by independently scoring results from each of our 3 crawl rep- etitions and keeping only the entities (e.g., the 3rd party domains or JS script families) found in all 3 measurement sets. This intersection constructs a synthetic score set, where each datapoint is computed as the median of corresponding bias scores from the 3 measurement sets. Intersecting the measurement sets also provides insight into how consistent the bias scores are across crawl repetitions. For each median bias score recorded in the synthetic score set, we keep the count of distinct bias scores from which the median was picked, a value in the range [1,n] where n is the number of measurement sets (n = 3 in our experiment). Given a mean score-count Cm , a consistency score Cs can be computed, ranging from 0 (total inconsistency) to 1 (total consistency).
n 1 C 1 C ( ) ( m ) s = − n− 1 − − 4.2.7.3 Summary
We quantify magnitude and consistency scores of bias for identical datapoints across endpoints. Visualizations of these metrics are introduced and explained in Section 4.3.2.
49 Table 4.1 Some “refusenik” sites always fail navigation from a single configuration but not its’ comple- ments
VP/BC # of Refuseniks Cloud 72 Naive 69 Stealth 30 Residential 11 University 2
4.3 Results
We analyze the data collected via the experiments described in Section 4.2 to assess the impact of crawler realism on simple, quantifiable metrics such as volume of HTTP requests to 3rd party domains. We apply our quantified measurement bias (Section 4.2.7) methodology to progressively finer-grained breakdowns of the data collected in our crawls: first using all 3rd party HTTP requests, then such requests flagged by ad and tracker filter lists, then flagged requests divided by Same Origin Policy (SOP) isolation context, and finally considering some of the content loaded itself (i.e., JS script bodies). The results show a significant number of 3rd party domains and JS script families exhibiting consistently mismatched results across VP/BC, implying lack of crawler realism can significantly bias crawl results.
4.3.1 Refusenik Sites
We identified a small but impressive collection of “refuseniks”: sites that always failed to load from a particular vantage point (VP) or browser configuration (BC) but which never failed to load from complementary configurations (naive vs. stealth, for instance). The total number and category of these sites is provided in Table 4.1. The largest categories for which service was refused (cloud for VP,naive for BC) are intuitive, conforming to expectations that some sites aggressively block probably bots or crawlers altogether. The residential and university share is small enough to be within the realm of accident, but the number of stealth-refusing sites is puzzling. One possible explanation is that since each visit involves 2 parallel requests (one naive, one stealth) from each endpoint IP,with high likelihood that the stealthy request will be slightly later than the naive request (as headless browsers have lower startup overhead), the stealth requests are more likely to run afoul of over-aggressive per-IP rate limiting. We do not investigate farther, as the number of refuseniks is too low to impact our other measurements.
4.3.2 Volume Biases in HTTP Traffic
Defining and Visualizing Request Bias. The volume of HTTP requests sent during a page visit, broken down by the target domain (specifically the eTLD+1, or the public DNS suffix plus one
50 Stealth Only
R=C 94.8% / 97.0% (0.96) R/C U/C 4 10 U=C 95.5% / 97.5% (0.96) R/U
R=U 97.6% / 98.7% (0.96)
103
R
100
10 5 0 5 10 Bias Score Figure 4.2 Distributions of cross-VP request volume bias by 3rd-party domains (stealth BC only); nearly twice as many domains consistently favor residential VP over the cloud VP as vice versa (3.5% > 1.8%).
additional label), is a useful metric for quickly gauging both the richness of a page’s content and the potential advertising/tracking privacy footprint of a visit to that page. When applied to the volume of HTTP requests sent to 3rd party domains during page visits, the measurement bias methodology described in Section 4.2.7 identifies domains which (1) are contacted on each of our 3 experiments and which (2) serve significantly different levels of traffic to different VP/BC crawlers (i.e., are biased for/against a given configuration). Bias visualizations such as Figure 4.2 plot the distribution of bias scores along with total percentages of eTLD+1 domains having bias scores < 0, = 0, and > 0 for each curve, along with the percentage of total HTTP requests associated with that set of contexts, and the corresponding bias consistency scores. Logarithmic scale is used on the Y axis to keep the curves meaningful, as the central 0 column (i.e., the non-biased contexts) typically overpowers the tails containing the significantly biased entities. E.g., in Figure 4.2, the “R/C” curve plots bias scores for request volume per domain compared between residential (R) and cloud (C) VPs. 3.5% of the consistently-present domains exhibited bias scores > 0 (i.e., pro-residential) and 1.8% bias scores < 0 (i.e., pro-cloud). The pro-residential domains account for 1.7% of all requests observed, while their pro-cloud counterparts account for 1.4% of all requests. The pro-residential domain set score noticeably higher in consistency (0.77 vs. 0.59).
51 Residential Only
S=N 92.8% / 94.3% (0.97)
104
103
S
101
100
7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 Bias Score Figure 4.3 Distribution of stealth-vs.-naive traffic volume bias scores for 3rd-party domains (residential VP only); more symmetric than its cross-VP counterparts.
4.3.2.1 Measuring Request Bias
We find measurement bias in the volume of all requests per target eTLD+1 across both VP and BC (Figures 4.2& 4.3). Some of the pro-cloud biased domains probably serve ad content that geo- targets the cloud endpoint’s location. But the asymmetry of the anti-cloud bias (i.e., about twice as many domains show pro-residential bias as show pro-cloud bias) argues against geo-targeting, which ought to have roughly symmetric effect between two regions, as the defining factor. Cross-BC measurement bias has no obvious geo-targeting component, so the near parity in total domains showing pro-naive vs. pro-stealth bias is somewhat surprising. The opposing sides of the BC bias curves do show subtly asymmetric shape (which we note is strongest on the most realistic VP, residential): pro-stealth bias is concentrated among a smaller number of domains with higher bias scores, while pro-naive bias is concentrated among more domains with less pronounced bias scores. We find that domains showing significant measurement bias account for reasonable shares of overall request volume. The set of consistently-present (i.e., not ephemeral, or seen in only one sub- experiment) domains visualized in Figures 4.2& 4.3 represents 92-93% of all domains encountered in any of our sub-experiments, and these consistent domains account for 98-99% of all requests recorded. These numbers demonstrate that our bias set intersection successfully (Section 4.2.7) captures the core of domains that account for the overwhelming majority of traffic while eliminating transient red herrings. The share of total requests associated with the biased domains for each curve (the second percentage listed for each curve) is consistently lower than the corresponding share
52 Stealth Only 4 10 R=C 91.6% / 97.8% (0.95) R/C U/C U=C 92.7% / 98.0% (0.95) R/U
103 R=U 96.5% / 99.1% (0.94)
102 R
U
100
4 2 0 2 4 6 Bias Score rd Figure 4.4 Distributions of cross-VP ad/tracker request volume bias by 3 -party domains (stealth BC only); little change from global cross-VP distributions.
of domains, but never so much so as to be trivial. These domains showing cross-VP and cross-BP traffic volume measurement bias are clearly not dominant, high-traffic providers, but in aggregate they account for non-trivial volumes of traffic, especially considering cross-BC bias. We believe bot-detection, whether to shield proprietary data from scrapers or to avoid use- less/fraudulent advertising impressions, to be a significant source of the observed measurement bias, as predicted by our realism-by-proxy design argument (Section 4.2.1). Manual inspection of highly- Tranco-ranked domains present at both extremes of the VP/BC bias distribution curves generally supports this theory. The intersection of pro-stealth and pro-residential bias outlier domains ordered by Tranco rank revealed a number of high-profile brands and content providers within the top 25: usnews.com, accuweather.com, lego.com, lowes.com, dhl.com, hotels.com, expedia.com, and ti.com. Only a few similarly high-profile brands appear in the top-ranked 25 domains from the complementary intersection of pro-naive, pro-cloud bias outliers, e.g. amazon.de and audible.com. The notion that premium content providers, especially those serving dynamic and potentially pro- prietary price data, would show significant anti-bot bias is hardly surprising, nor is it of great interest to security and privacy researchers. A more useful question is whether known ad and tracking traffic exhibits similar biases.
53 Residential Only 104 S=N 81.3% / 91.0% (0.96)
103
2 10 S
101
100
4 2 0 2 4 6 8 Bias Score rd Figure 4.5 Distribution of stealth-vs.-naive ad/tracker traffic volume bias scores for 3 -party domains (residential VP only); BC bias is more common among these domains than the global population.
Table 4.2 Total HTTP requests by EasyList/EasyPrivacy match and frame context
Blocklists All Requests Main Frame 1st Party 3rd Party Involved Sub-frames Sub-frames
-/- 23,970,789 21,051,929 819,194 2,043,325 -/EP 5,470,406 3,855,858 227,974 1,333,700 EL/- 5,032,267 2,085,177 647,012 2,197,150 EL/EP 1,142,609 581,767 71,877 475,750
4.3.2.2 Request Bias by Known Ad/Tracker Domains
Of more concern to security and privacy researchers, we find traffic volume measurement bias patterns more pronounced among requests flagged by one or both of the popular, community- maintained EasyList (EL) and EasyPrivacy (EP) filter lists. In our experiment roughly a third of all recorded requests matched blocking rules in one or both of these lists; see Table 4.2. Figures 4.4& 4.5 illustrate the shifts in measurement bias distributions across VP (using stealth BC) and bias across BC (using residential VP). The maximum bias score values shrink, but the total share of biased domains increases to nearly 20% of those contacted in all crawls. Furthermore, the consistency of cross- BC bias distributions for flagged requests increases over that of the global cross-BC distributions (0.89 > 0.81 and 0.82 > 0.72). As when considering all requests, we find biased domains to account for a respectable share of
54 Main Frame Only 1st-Party Sub-frame Only 3rd-Party Sub-frame Only 104 S=N 82.7% / 82.0% (0.97) S=N 49.6% / 33.4% (0.88) S=N 69.0% / 82.3% (0.89)
103
2 10 S
100
4 2 0 2 4 6 8 8 6 4 2 0 2 4 6 6 4 2 0 2 4 6 Bias Score Bias Score Bias Score
rd Figure 4.6 Distributions of stealth-vs.-naive ad/tracker traffic volume bias scores for 3 -party domains (residential VP only) broken down by browser frame context; sub-frames show radically different (and less intuitive) BC bias distributions than main frames.
all EL/EP-flagged requests. The set of all consistently-present domains visualized in Figures 4.4& 4.5 accounts for only 85-89% of all domains associated with any EL/EP requests, but in every case these consistent domains account for over 99% of all EL/EP requests. Once again, the filtering effect of intersecting results across multiple crawls eliminates significant chaff from the results. Domains showing high bias cross-VP saw both a relative domain share increase and a relative request volume share decrease. But domains showing high bias cross-BC showed a more intuitive correlation between request volume and domain share, with both increasing significantly over the all-requests distributions. Again, these domains clearly do not dominate traffic volume, but particularly in the case of BC bias outliers, they account for a non-trivial amount of requests flagged by our filter lists. When considering only likely advertising and tracker content, the puzzle of the BC curves’ near symmetry is amplified. It makes sense for advertisers or trackers to show pro-stealth bias as a result of detecting and avoiding an obviously automated browser. But the nearly equal share of domains showing apparent pro-naive bias is counter-intuitive. Some portion of this activity, especially that with low bias scores, is probably still related to defensive analytic behavior triggered by the presence of a headless client. But the presence of more extreme bias scores do not fit this explanation as well, given that such fingerprinting and reporting behavior ought to require significantly less request volume than serving typical ads or monitoring user activity over time. A partial explanation may be displacement of content providers that more aggressively block bots by those which do not during naive crawls. There is a statistically significant (though modest in effect size) difference in distribution of Tranco ranks between the pro-stealth and pro-naive sets of biased domains across all VPs, with the median rank of the pro-stealth bias set being consistently higher (i.e., a lower number) than its pro-naive counterpart. E.g., considering only residential VP data, the median naive rank = 18,360.0, the median stealth rank = 12,198.5, and a Mann-Whitney one-tailed test finds significant difference with P < 0.0014. Other VPs given slightly different values but show the same direction and scale of gap in median ranks, the same order of magnitude for
4 U = 427615.5, n1 = 801, n2 = 882
55 P , and near-identical effect size (0.60). The modest effect size is unsurprising given the significant overlap in distributions, but the bias is unmistakable and consistent. We note that the rank skew between stealth and naive biased-domain sets is likewise visible (and statistically significant to the same level, albeit with slightly smaller effect size) in the bias sets derived from all, as opposed to only requests flagged by filter lists.
4.3.2.3 Request Bias & Frame Context
Continuing to consider only requests tagged by filter lists for ad and tracker content, we find more dramatic and surprising shifts in measurement bias when breaking down comparisons by browser 5 frame/security contexts and security origins. Here we consider only cross-BC measurement bias within the residential VP,as shown in Figure 4.6. Cross-VP measurement bias patterns are unchanged from previous analyses and are not discussed further. Note that these traffic share percentages are not restricted to a particular frame context but are computed globally for all requests tagged by our filter lists. Sub-frames, of both first and third party domain origin, account for only 22% of all requests but 44% of requests matched by EasyList or EasyPrivacy rules. Unsurprisingly given their overall traffic share (Table 4.2), main frame bias score distributions closely follow the overall distributions. Sub-frame BC bias scores, however, both grow in overall share and swing counter-intuitively to the pro-naive side, probably in part because the much smaller share of traffic being considered is more readily influenced by popular outliers. The pattern of lower Tranco ranks (higher popularity) found in the pro-stealth distributions vs. their pro-naive counterparts, present in both previous breakdowns, evaporates for sub-frames. Many factors contribute: the relatively small set of domains involved, the extreme mismatch in set size, and the fact that, on manual inspection, we found a fair number of extremely high-ranked domains had crept into the pro-naive bias set (e.g., 4 Google domains, including google.com, in the top 10 pro-naive outliers). The presence of “heavy-hitter” domains is a first in our outlier analysis so far, and is underscored by traffic share analysis. The set of domains consistent across all crawls for main frame EL/EP requests comprised 83-84% of all domains associated with any EL/EP request, much like the previous all-frames breakdown. As seen above, though, the sub-frames are different beasts altogether. The total set of consistent domains considered here comprises only 9-11% of all EL/EP-associated domains, but these account for an impressive 89-90% of all EL/EP flagged requests (obviously, when considering all frames). That large share of traffic going to a small subset of domains is not surprising when considering the presence of dominant players like Google. What is surprising is how decisively skewed this bias distribution is to the pro-naive side. We suspected that the presence of Google and other top-tier players in this small pocket of bias might be related to CAPTCHA deployment, but a search of the
5All frames are associated with a security origin URL scheme, hostname, and port. It may match the origin of the main document (a 1st party frame) or not (a 3rd party frame), in which case the browser’s Same Origin Policy (SOP) will restrict its access to the main frame’s contents. 3rd party frames commonly host advertising and tracking content.
56 31,787 distinct stemmed URLs (consisting only of hostname eTLD+1 and path component, sans query string) within this context and BC yielded only a single hit on any obvious variant of the word “CAPTCHA,” in a single URL requested by a single site, once per visit. Of course, there is no reason to believe adversarial content will always advertise itself as such in domain names or URL paths, and it remains plausible that at least some of this pro-naive phenomenon is related to active adversarial response to suspected bots.
4.3.2.4 Summary
Around 5% of content-providing domains show significant measurement bias across VP,clearly favoring non-cloud endpoints. From our most realistic VP (residential), measurement bias across BC among HTTP traffic domains is more prominent, accounting for over 7% of domains and over 5% of total HTTP traffic volume. Request volume measurement bias becomes notably more pronounced among domains flagged by filter lists as serving ads and trackers, with nearly 20% of domains’ traffic strongly correlated to choice of BC.
4.3.3 Content-Level Biases in JavaScript
4.3.3.1 Biases in Scripts Loaded
We find that the loading and execution of JS script families shows measurement bias patterns comparable to domain bias in requests. We define a "script family" to be a set of JS scripts observed loading and executing by the VisibleV8 [Jue19] JS API tracing system that all share a common lexical hash. We compute lexical hashes by tokenizing all JS scripts using the industry standard Esprima JS parser and computing the SHA256 hash of the resulting sequence of token type names. Lexical hashes thus ignore variance in whitespace, comments, identifiers, and atomic values like number or string literals. We computed 258,236 total distinct lexical hashes from 1,517,281 total distinct scripts, not including 4,364 distinct scripts that failed tokenization because of syntactic irregularities (such scripts are excluded from lexical hash based analysis). To facilitate correlating script loading and HTTP traffic patterns, we consider only scripts loaded via URL (as opposed to eval or similar means). For reference, 87.9% of script families we observed were loaded at least once via a script URL (either the source of a