Tracing Information Flows Between Ad Exchanges Using Retargeted
Total Page:16
File Type:pdf, Size:1020Kb
Tracing Information Flows Between Ad Exchanges Using Retargeted Ads Muhammad Ahmad Bashir, Sajjad Arshad, William Robertson, and Christo Wilson, Northeastern University https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/bashir This paper is included in the Proceedings of the 25th USENIX Security Symposium August 10–12, 2016 • Austin, TX ISBN 978-1-931971-32-4 Open access to the Proceedings of the 25th USENIX Security Symposium is sponsored by USENIX Tracing Information Flows Between Ad Exchanges Using Retargeted Ads Muhammad Ahmad Bashir Sajjad Arshad William Robertson Northeastern University Northeastern University Northeastern University [email protected] [email protected] [email protected] Christo Wilson Northeastern University [email protected] Abstract geted based on sensitive Personally Identifiable Informa- tion (PII) [44, 4] or specific kinds of browsing history Numerous surveys have shown that Web users are con- (e.g., visiting medical websites) [41]. Furthermore, some cerned about the loss of privacy associated with online users are universally opposed to online tracking, regard- tracking. Alarmingly, these surveys also reveal that peo- less of circumstance [46, 60, 14]. ple are also unaware of the amount of data sharing that occurs between ad exchanges, and thus underestimate the One particular concern held by users is their digi- privacy risks associated with online tracking. tal footprint [33, 65, 58], i.e., which first- and third- In reality, the modern ad ecosystem is fueled by a flow parties are able to track their browsing history Large- of user data between trackers and ad exchanges. Al- scale web crawls have repeatedly shown that trackers are though recent work has shown that ad exchanges rou- ubiquitous [24, 19], with DoubleClick alone being able tinely perform cookie matching with other exchanges, to observe visitors on 40 of websites in the Alexa Top- these studies are based on brittle heuristics that cannot 100K [11]. These results paint a picture of a balkanized detect all forms of information sharing, especially under web, where trackers divide up the space and compete for adversarial conditions. the ability to collect data and serve targeted ads. In this study, we develop a methodology that is able However, this picture of the privacy landscape is at to detect client- and server-side flows of information be- odds with the current reality of the ad ecosystem. Specif- tween arbitrary ad exchanges. Our key insight is to lever- ically, ad exchanges routinely perform cookie matching age retargeted ads as a tool for identifying information with each other, to synchronize unique identifiers and flows. Intuitively, our methodology works because itre- share user data [2, 54, 21]. Cookie matching is a pre- lies on the semantics of how exchanges serve ads, rather condition for ad exchanges to participate in Real Time than focusing on specific cookie matching mechanisms. Bidding (RTB) auctions, which have become the domi- Using crawled data on 35,448 ad impressions, we show nant mechanism for buying and selling advertising inven- that our methodology can successfully categorize four tory from publishers. Problematically, Hoofnagle et al. different kinds of information sharing behavior between report that users navely believe that privacy policies pre- ad exchanges, including cases where existing heuristic vent companies from sharing user data with third-parties, methods fail. which is not always the case [32]. We conclude with a discussion of how our findings and methodologies can be leveraged to give users more Despite user concerns about their digital footprint, we control over what kind of ads they see and how their in- currently lack the tools to fully understand how informa- formation is shared between ad exchanges. tion is being shared between ad exchanges. Prior empiri- cal work on cookie matching has relied on heuristics that look for specific strings in HTTP messages to identify 1 Introduction flows between ad networks [2, 54, 21]. However, these heuristics are brittle in the face of obfuscation: for exam- People have complicated feelings with respect to online ple, DoubleClick cryptographically hashes their cookies behavioral advertising. While surveys have shown that before sending them to other ad networks [1]. More fun- some users prefer relevant, targeted ads to random, un- damentally, analysis of client-side HTTP messages are targeted ads [60, 14], this preference has caveats. For insufficient to detect server-side information flows be- example, users are uncomfortable with ads that are tar- tween ad networks. 1 USENIX Association 25th USENIX Security Symposium 481 In this study, we develop a methodology that is able users, that also enable publishers to earn revenue. Sur- to detect client- and server-side flows of information be- veys have shown that users are not necessarily opposed tween arbitrary ad exchanges that serve retargeted ads. to online ads: some users are just opposed to track- Retargeted ads are the most specific form of behavioral ing [46, 60, 14], while others simply desire more nu- ads, where a user is targeted with ads related to the exact anced control over their digital footprint [4, 41]. How- products she has previously browsed (see § 2.2 for defi- ever, existing tools (e.g., browser extensions) cannot dis- nition). For example, Bob visits nike.com and browses tinguish between targeted and untargeted ads, thus leav- for running shoes but decides not to purchase them. Bob ing users with no alternative but to block all ads. Con- later visits cnn.com and sees an ad for the exact same versely, our results open up the possibility of building running shoes from Nike. in-browser tools that just block cookie matching, which Our key insight is to leverage retargeted ads as a mech- will effectively prevent most targeted ads from RTB auc- anism for identifying information flows. This is possi- tions, while still allowing untargeted ads to be served. ble because the strict conditions that must be met for a Open Source. As a service to the community, we retarget to be served allow us to infer the precise flow have open sourced all the data from this project. This of tracking information that facilitated the serving of the includes over 7K labeled behaviorally targeted and retar- ad. Intuitively, our methodology works because it relies geted ads, as well as the inclusion chains and full HTTP on the semantics of how exchanges serve ads, rather than traces associated with these ads. The data is available at: focusing on specific cookie matching mechanisms. To demonstrate the efficacy of our methodology, we http://personalization.ccs.neu.edu/ conduct extensive experiments on real data. We train 90 personas by visiting popular e-commerce sites, and then 2 Background and efinitions crawl major publishers to gather retargeted ads [9, 12]. Our crawler is an instrumented version of Chromium that In this section, we set the stage for our study by providing records the inclusion chain for every resource it encoun- background about the online display ad industry, as well ters [5], including 35,448 chains associated with 5,102 as defining key terminology. We focus on techniques and unique retargeted ads. We use carefully designed pattern terms related to Real Time Bidding and retargeted ads, matching rules to categorize each of these chains, which since they are the focus of our study. reveal 1) the pair of ad exchanges that shared informa- tion in order to serve the retarget, and 2) the mechanism 2.1 Online isplay dvertising used to share the data (e.g., cookie matching). In summary, we make the following contributions: Online display advertising is fundamentally a matching We present a novel methodology for identifying problem. On one side are publishers (e.g., news web- • information flows between ad networks that is sites, blogs, etc.) who produce content, and earn revenue content- and ad exchange-agnostic. Our methodol- by displaying ads to users. On the other side are adver- ogy allows to identify four different categories of in- tisers who want to display ads to particular users (e.g., formation sharing between ad exchanges, of which based on demographics or market segments). Unfortu- cookie matching is one. nately, the online user population is fragmented across hundreds of thousands of publishers, making it difficult Using crawled data, we show that the heuristic • for advertisers to reach desired customers. methods used by prior work to analyze cookie Ad networks bridge this gap by aggregating inventory matching are unable to identify 31 of ad exchange from publishers (i.e., space for displaying ads) and fill- pairs that share data. ing it with ads from advertisers. Ad networks make it Although it is known that Google’s privacy policy • possible for advertisers to reach a broad swath of users, allows it to share data between its services [26], while also guaranteeing a steady stream of revenue for we provide the first empirical evidence that Google publishers. Inventory is typically sold using a Cost per uses this capability to serve retargeted ads. Mille (CPM) model, where advertisers purchase blocks Using graph analysis, we show how our data can of 1000 impressions (views of ads), or a Cost per Click • be used to automatically infer the roles played (CPC) model, where the advertiser pays a small fee each by different ad exchanges (e.g., Supply-Side and time their ad is clicked by a user. Demand-Side Platforms). These results expand d Exchanges and Auctions. Over time, ad net- upon prior work [25] and facilitate a more nuanced works are being supplanted by ad exchanges that rely understanding of the online ad ecosystem. on an auction-based model. In Real-time Bidding (RTB) Ultimately, we view our methodology as a stepping exchanges, advertisers bid on individual impressions, in stone towards more balanced privacy protection tools for real-time; the winner of the auction is permitted to serve 2 482 25th USENIX Security Symposium USENIX Association 1) Impression 2) RTB Ads & $$$ 3) Ad User Publisher SSP Ad Exchange DSPs Advertisers Figure 1: The display advertising ecosystem.