A Promising Direction for Web Tracking Countermeasures
Total Page:16
File Type:pdf, Size:1020Kb
A Promising Direction for Web Tracking Countermeasures Jason Bau, Jonathan Mayer, Hristo Paskov and John C. Mitchell Stanford University Stanford, CA fjbau, jmayer, hpaskov, [email protected] Abstract—Web tracking continues to pose a vexing policy progress towards standardization in the World Wide Web problem. Surveys have repeatedly demonstrated substantial Consortium has essentially stalled [14], [15]. consumer demand for control mechanisms, and policymakers While Do Not Track remains pending, there has been worldwide have pressed for a Do Not Track system that effec- tuates user preferences. At present, however, consumers are left a renewal of interest in technical countermeasures against in the lurch: existing control mechanisms and countermeasures web tracking. Mozilla [16] and Microsoft [17], [18] have have spotty effectiveness and are difficult to use. already shipped new anti-tracking features, and Apple has We argue in this position paper that machine learning could indicated openness to improving its longstanding counter- enable tracking countermeasures that are effective and easy measures [19]. The chairman of the United States Federal to use. Moreover, by distancing human expert judgments, machine learning approaches are both easy to maintain and Trade Commission recently departed from the agency’s palatable to browser vendors. We briefly explore some of longstanding focus on Do Not Track and signaled that it the promise and challenges of machine learning for tracking views technical countermeasures as a promising direction countermeasures, and we close with preliminary results from for consumer control [14]. a prototype implementation. Moreover, even if Do Not Track is widely adopted, some websites may ignore or deceptively misinterpret the I. MOTIVATION signal. Rigorous technical countermeasures would provide a necessary backstop for Do Not Track. It is a transformative time for web privacy. But there’s a problem: existing technical countermeasures Just a decade ago, web tracking received scant scrutiny are grossly inadequate. Web browsers and extensions have from researchers, policymakers, and consumers. There were long offered a range of web tracking countermeasures, repre- few third-party services following users about the web, senting a variety of approaches to tracking detection, mitiga- they were concentrated almost exclusively in the behavioral tion, and accommodation. The most effective [1] solutions— advertising market, and they relied solely on easy-to-block manually-curated block lists of tracking content—are dif- identifier cookies.1 In the intervening years, tracking has ficult to maintain and difficult to use. Meanwhile, the boomed: there are hundreds of companies, dozens of busi- most usable [20] countermeasures—those built into web ness models, and myriad tracking technologies in use [1]. browsers—are among the least effective. Though several Web users are concerned: surveys have consistently browser vendors seek to provide user control over web demonstrated that consumers want control over web tracking tracking, they are loathe to individually identify companies [2], [3], [4], [5], [6]. Not coincidentally, governments have and invite the business and legal scrutiny of picking and taken steps to intervene. Beginning in 2010, policymakers choosing among marketplace participants. Browser vendors in the United States [7], [8], Canada [9], and the European consequently rely on inaccurate heuristics to identify track- Union [10], [11], [12] issued vocal calls for new technical ers (e.g. comparing domains). Owing to the prevalence of approaches to consumer choice about web tracking. The false positives, browser countermeasures are limited to small policy discourse has thus far largely centered on Do Not interventions (e.g. blocking cookies). Track [13], an initiative that combines preference-signaling Machine learning charts a promising new course. It could mechanisms with a substantive compliance policy and out- provide the accuracy of a curated block list—allowing for of-band regulatory enforcement. rigorous privacy measures (e.g. HTTP blocking). And it dis- As of early 2013, however, Do Not Track remains a tances expert human judgments from identification, reducing nascent proposal. Companies and trade groups involved in costs and potential risks and allowing usable implementa- web tracking have vigorously opposed the concept, and tions by browser vendors. We believe a machine learning approach that identifies tracking websites is now viable, and 1Detailed definitions of “third party” and “tracking” are hotly contested. For purposes of this position paper, we mean simply unaffiliated websites in Section II we sketch possible architectures and associated and the collection of a user’s browsing history. challenges. Section III closes with preliminary results from a prototype implementation. that it does not set cookies. Conversely, consider the case of a website that uses outsourced—but carefully siloed— II. MACHINE LEARNING analytics from a third-party domain. The analytics content In this section, we briefly review the necessary design may appear to be highly intrusive, setting unique cookies components for a machine-learning system that identifies and inquiring about browser features. Without knowing that web trackers. We discuss possible architectures for data the analytics service is directly embedded by first parties collection and what data should be collected. We then turn and never draws in other third parties (Figure 1), it might to sources of training data and, finally, machine learning be erroneously classified as tracking. output. Some trackers may attempt to evade detection by tuning their behavior. Efforts at masking tracking are fundamen- A. Data Collection Architecture tally limited, however, by the need to provide content on There is a continuum of possible architectures for con- many websites, collaborate with other trackers, and invoke tinually collecting the web measurements needed to identify particular browser functionality. A technological cat-and- trackers. At one end, a centralized crawler could periodically mouse game between new tracking techniques and detecting visit websites, much like modern approaches to web search. these techniques would also seem to favor detection, due to At the other end, a decentralized reporting system could much less vested infrastructure. Furthermore, as we discuss collect information from users’ browsers. A crawler has in Section II-D, adjustments in practices to circumvent the advantage of easy implementation and disadvantages countermeasures may both be impractical and subject a of possibly unrealistic browsing patterns and special-case business to media, business, and legal penalties. behavior by websites. Crowdsourced data provides the ben- Web browser instrumentation would be sufficient to cap- efit of closely modeling user experience, with drawbacks of ture both categories of data. The FourthParty platform [22] privacy risks and potential gaming. offers a first step, though modifications are necessary to The two poles are hardly exclusive; a tracking detection properly attribute content that is inserted by a script. system could make use of both crawled and crowdsourced C. Labeled Training Sets data in varying degrees. We believe a centralized approach is most feasible in the near term, with gradual and careful Curated block lists are one possible source of training expansion into decentralized collection. data. Lists vary in quality; some are very comprehensive [1]. B. What Data to Collect Industry-provided lists are another potential source and At a high level of generality, there are two categories of possibly less objectionable for browser vendors. The online data to collect. advertising industry’s trade group, for example, maintains a First, information that reflects the relationships among list of member company domains. web content. Effective tracking requires observing a user’s D. Machine Learning Output behavior across multiple websites, and many trackers col- laborate to share data (e.g. advertising exchanges). The The machine-learning system must be able to provide a Document Object Model (DOM) hierarchy provides a start- real-time judgment about whether particular web content is ing point, reflecting the context that web content is loaded tracking the user. These determinations might be entirely into. The DOM’s value is limited, however, in that its precomputed and static, or they might be dynamically gen- purpose is to reflect layout and security properties—-not erated by a classifier within the browser. In our view, a static provide privacy-relevant attribution about how content came list is preferable in the near term owing to simpler deploy- ment and compatibility with existing block list mechanisms. to be loaded. A machine learning system would have to 2 account for script embeds, redirects, and other dynamic We believe domain-level granularity would be sufficient mechanisms that introduce content into a webpage. in these static lists, for the following reasons. Second, information that reflects the properties of partic- First, tracking often occurs on a dedicated domain. Exclu- ular web content. Type, size, cache expiry, and many more sively third-party businesses usually only have the one do- features vary between trackers and non-trackers. Similarly, main. Firms that operate both first-party and third-party web usage of various browser features differs by a website’s line services tend to separate those services by domain, reflecting of