Tracing Information Flows Between Ad Exchanges Using Retargeted Ads Muhammad Ahmad Bashir, Sajjad Arshad, William Robertson, and Christo Wilson, Northeastern University https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/bashir
This paper is included in the Proceedings of the 25th USENIX Security Symposium August 10–12, 2016 • Austin, TX ISBN 978-1-931971-32-4
Open access to the Proceedings of the 25th USENIX Security Symposium is sponsored by USENIX r c ng nform t on lows etween Ad c nges s ng Ret rgeted Ads
Muhammad Ahmad Bashir Sajjad Arshad William Robertson Northeastern University Northeastern University Northeastern University ahmad ccs neu edu arshad ccs neu edu wkr ccs neu edu Christo Wilson Northeastern University cbw ccs neu edu
A str ct geted based on sensitive Personally Identifiable Informa- tion (PII) [44, 4] or specific kinds of browsing history Numerous surveys have shown that Web users are con- (e g , visiting medical websites) [41]. Furthermore, some cerned about the loss of privacy associated with online users are universally opposed to online tracking, regard- tracking. Alarmingly, these surveys also reveal that peo- less of circumstance [46, 60, 14]. ple are also unaware of the amount of data sharing that occurs between ad exchanges, and thus underestimate the One particular concern held by users is their digi- privacy risks associated with online tracking. tal footprint [33, 65, 58], i e , which first- and third- In reality, the modern ad ecosystem is fueled by a flow parties are able to track their browsing history Large- of user data between trackers and ad exchanges. Al- scale web crawls have repeatedly shown that trackers are though recent work has shown that ad exchanges rou- ubiquitous [24, 19], with DoubleClick alone being able tinely perform cookie matching with other exchanges, to observe visitors on 40 of websites in the Alexa Top- these studies are based on brittle heuristics that cannot 100K [11]. These results paint a picture of a balkanized detect all forms of information sharing, especially under web, where trackers divide up the space and compete for adversarial conditions. the ability to collect data and serve targeted ads. In this study, we develop a methodology that is able However, this picture of the privacy landscape is at to detect client- and server-side flows of information be- odds with the current reality of the ad ecosystem. Specif- tween arbitrary ad exchanges. Our key insight is to lever- ically, ad exchanges routinely perform cookie matching age retargeted ads as a tool for identifying information with each other, to synchronize unique identifiers and flows. Intuitively, our methodology works because itre- share user data [2, 54, 21]. Cookie matching is a pre- lies on the semantics of how exchanges serve ads, rather condition for ad exchanges to participate in Real Time than focusing on specific cookie matching mechanisms. Bidding (RTB) auctions, which have become the domi- Using crawled data on 35,448 ad impressions, we show nant mechanism for buying and selling advertising inven- that our methodology can successfully categorize four tory from publishers. Problematically, Hoofnagle et al different kinds of information sharing behavior between report that users na vely believe that privacy policies pre- ad exchanges, including cases where existing heuristic vent companies from sharing user data with third-parties, methods fail. which is not always the case [32]. We conclude with a discussion of how our findings and methodologies can be leveraged to give users more Despite user concerns about their digital footprint, we control over what kind of ads they see and how their in- currently lack the tools to fully understand how informa- formation is shared between ad exchanges. tion is being shared between ad exchanges. Prior empiri- cal work on cookie matching has relied on heuristics that look for specific strings in HTTP messages to identify ntrod ct on flows between ad networks [2, 54, 21]. However, these heuristics are brittle in the face of obfuscation: for exam- People have complicated feelings with respect to online ple, DoubleClick cryptographically hashes their cookies behavioral advertising. While surveys have shown that before sending them to other ad networks [1]. More fun- some users prefer relevant, targeted ads to random, un- damentally, analysis of client side HTTP messages are targeted ads [60, 14], this preference has caveats. For insufficient to detect server side information flows be- example, users are uncomfortable with ads that are tar- tween ad networks.
1 USENIX Association 25th USENIX Security Symposium 481 In this study, we develop a methodology that is able users, that also enable publishers to earn revenue. Sur- to detect client- and server-side flows of information be- veys have shown that users are not necessarily opposed tween arbitrary ad exchanges that serve retargeted ads. to online ads: some users are just opposed to track- Retargeted ads are the most specific form of behavioral ing [46, 60, 14], while others simply desire more nu- ads, where a user is targeted with ads related to the exact anced control over their digital footprint [4, 41]. How- products she has previously browsed (see § 2.2 for defi- ever, existing tools (e g , browser extensions) cannot dis- nition). For example, Bob visits nike.com and browses tinguish between targeted and untargeted ads, thus leav- for running shoes but decides not to purchase them. Bob ing users with no alternative but to block all ads. Con- later visits cnn.com and sees an ad for the exact same versely, our results open up the possibility of building running shoes from Nike. in-browser tools that just block cookie matching, which Our key insight is to leverage retargeted ads as a mech- will effectively prevent most targeted ads from RTB auc- anism for identifying information flows. This is possi- tions, while still allowing untargeted ads to be served. ble because the strict conditions that must be met for a Open Source. As a service to the community, we retarget to be served allow us to infer the precise flow have open sourced all the data from this project. This of tracking information that facilitated the serving of the includes over 7K labeled behaviorally targeted and retar- ad. Intuitively, our methodology works because it relies geted ads, as well as the inclusion chains and full HTTP on the semantics of how exchanges serve ads, rather than traces associated with these ads. The data is available at: focusing on specific cookie matching mechanisms. To demonstrate the efficacy of our methodology, we http://personalization.ccs.neu.edu/ conduct extensive experiments on real data. We train 90 personas by visiting popular e-commerce sites, and then 2 Background and efinitions crawl major publishers to gather retargeted ads [9, 12]. Our crawler is an instrumented version of Chromium that In this section, we set the stage for our study by providing records the inclusion chain for every resource it encoun- background about the online display ad industry, as well ters [5], including 35,448 chains associated with 5,102 as defining key terminology. We focus on techniques and unique retargeted ads. We use carefully designed pattern terms related to Real Time Bidding and retargeted ads, matching rules to categorize each of these chains, which since they are the focus of our study. reveal 1) the pair of ad exchanges that shared informa- tion in order to serve the retarget, and 2) the mechanism 2.1 Online isplay dvertising used to share the data (e g , cookie matching). In summary, we make the following contributions: Online display advertising is fundamentally a matching We present a novel methodology for identifying problem. On one side are publishers (e g , news web- • information flows between ad networks that is sites, blogs, etc ) who produce content, and earn revenue content- and ad exchange-agnostic. Our methodol- by displaying ads to users. On the other side are adver- ogy allows to identify four different categories of in- tisers who want to display ads to particular users (e g , formation sharing between ad exchanges, of which based on demographics or market segments). Unfortu- cookie matching is one. nately, the online user population is fragmented across hundreds of thousands of publishers, making it difficult Using crawled data, we show that the heuristic • for advertisers to reach desired customers. methods used by prior work to analyze cookie Ad networks bridge this gap by aggregating inventory matching are unable to identify 31 of ad exchange from publishers (i e , space for displaying ads) and fill- pairs that share data. ing it with ads from advertisers. Ad networks make it Although it is known that Google’s privacy policy • possible for advertisers to reach a broad swath of users, allows it to share data between its services [26], while also guaranteeing a steady stream of revenue for we provide the first empirical evidence that Google publishers. Inventory is typically sold using a Cost per uses this capability to serve retargeted ads. Mille (CPM) model, where advertisers purchase blocks Using graph analysis, we show how our data can of 1000 impressions (views of ads), or a Cost per Click • be used to automatically infer the roles played (CPC) model, where the advertiser pays a small fee each by different ad exchanges (e g , Supply-Side and time their ad is clicked by a user. Demand-Side Platforms). These results expand d Exchanges and uctions. Over time, ad net- upon prior work [25] and facilitate a more nuanced works are being supplanted by ad e changes that rely understanding of the online ad ecosystem. on an auction-based model. In Real-time Bidding (RTB) Ultimately, we view our methodology as a stepping exchanges, advertisers bid on individual impressions, in stone towards more balanced privacy protection tools for real-time; the winner of the auction is permitted to serve
2 482 25th USENIX Security Symposium USENIX Association 1) Impression 2) RTB Ads & $$$
3) Ad User Publisher SSP Ad Exchange DSPs Advertisers
Figure 1: The display advertising ecosystem. Impressions and tracking data flow left-to-right, while revenue and ads flow right-to-left. an ad to the user. Google’s DoubleClick is the largest ad vertisers to collect users’ browsing history. All major ad exchange, and it supports RTB. exchanges, like DoubleClick and Rubicon, perform user As shown in Figure 1, there is a distinction between tracking, but there are also companies like BlueKai that Supply-side Platforms (SSPs) and Demand-side Plat- just specialize in tracking. forms (DSPs) with respect to ad auctions. SSPs work ook e tc ng During an RTB ad auction, DSPs with publishers to manage their relationships with mul- submit bids on an impression. The amount that a DSP tiple ad exchanges, typically to maximize revenue. For bids on a given impression is intrinsically linked to the example, OpenX is an SSP. In contrast, DSPs work with amount of information they have about that user. For advertisers to assess the value of each impression and example, a DSP is unlikely to bid highly for user u optimize bid prices. MediaMath is an example of a DSP. whom they have never observed before, whereas a DSP To make matters more complicated, many companies of- may bid heavily for user v who they have recently ob- fer products that cross categories; for example, Rubicon served browsing high-value websites (e g , the baby site Project offers SSP, ad exchange, and DSP products. We TheBump.com). direct interested readers to [45] for more discussion of However, the Same Origin Policy (SOP) hinders the the modern online advertising ecosystem. ability of DSPs to identify users in ad auctions. As shown in Figure 1, requests are first sent to an SSP which for- rgeted Ad ert s ng wards the impression to an exchange (or holds the auc- tions itself). At this point, the SSP’s cookies are known, Initially, the online display ad industry focused on but not the DSPs. This leads to a catch-22 situation: a generic brand ads (e g , Enjoy Coca-Cola ) or conte DSP cannot read its cookies until it contacts the user, but tual ads (e g , an ad for Microsoft on StackOverflow). it cannot contact the user without first bidding and win- However, the industry quickly evolved towards behav ning the auction. ioral targeted ads that are served to specific users based on their browsing history, interests, and demographics. To circumvent SOP restrictions, ad exchanges and ad- vertisers engage in cookie matching (sometimes called r ck ng To serve targeted ads, ad exchanges and cookie syncing). Cookie matching is illustrated in Fig- advertisers must collect data about online users by track- ure 2: the user’s browser first contacts ad exchange ing their actions. Publishers embed JavaScript or invis- s.com, which returns an HTTP redirect to its partner ible tracking pixels that are hosted by tracking com- d.com. s reads its own cookie, and includes it as a pa- panies into their web pages, thus any user who visits rameter in the redirect to d. d now has a mapping from the publisher also receives third-party cookies from the its cookie to s’s. In the future, if d participates in an auc- tracker (we discuss other tracking mechanisms in § 3). tion held by s, it will be able to identify matched users Numerous studies have shown that trackers are perva- using s’s cookie. Note that some ad exchanges (includ- sive across the Web [38, 36, 55, 11], which allows ad- ing DoubleClick) send cryptographically hashed cookies to their partners, which prevents the ad network’s true 1) GET /pixel.jpg HTTP/1.1 cookies from leaking to third-parties. Cookie: id=123456 Ret rgeted Ads In this study, we focus on retar 2) HTTP/1.1 302 Found Location: d.com/trackpixel?id=123456 s.com geted ads, which are the most specific type of targeted display ads. Two conditions must be met for a DSP 3) GET /trackpixel?id=123456 HTTP/1.1 User Cookie: id=ABCDEF to serve a retargeted ad to a user u: 1) the DSP must know that u browsed a specific product on a specific e- 4) HTTP/1.1 200 OK d.com commerce site, and 2) the DSP must be able to uniquely identify u during an auction. If these conditions are met, Figure 2: SSP s matches their cookie to DSP d using an the DSP can serve u a highly personalized ad reminding HTTP redirect. them to purchase the product from the retailer. Cookie
3 USENIX Association 25th USENIX Security Symposium 483 matching is crucial for ad retargeting, since it enables pand on these results by comparing trackers across geo- DSPs to meet requirement (2). graphic regions [20], while Li et al. show that most track- ing cookies can be automatically detected using simple machine learning methods [42]. 3 elated Work Note that none of these studies examine cookie match- ing, or information sharing between ad exchanges. Next, we briefly survey related work on online advertis- Although users can try to evade trackers by clear- ing. We begin by looking at more general studies of the ing their cookies or using private/incognito browsing advertising and tracking ecosystem, and conclude with a modes, companies have fought back using techniques more focused examination of studies on cookie match- like Evercookies and fingerprinting. Evercookies store ing and retargeting. Although existing studies on cookie the tracker’s state in many places within the browser matching demonstrate that this practice is widespread (e.g., FlashLSOs, etags, etc.), thus facilitating regenera- and that the privacy implications are alarming, these tion of tracking identifiers even if users delete their cook- works have significant methodological shortcomings that ies [34, 57, 6, 47]. Fingerprinting involves generating a motivate us to develop new techniques in this work. unique ID for a user based on the characteristics of their browser [18, 48, 50], browsing history [53], and com- 3.1 Measuring the d Ecosystem puter (e.g., the HTML5 canvas [49]). Several studies have found trackers in-the-wild that use fingerprinting Numerous studies have measured and broadly character- techniques [3, 52, 35]; Nikiforakis et al. propose to stop ized the online advertising ecosystem. Guha et al. were fingerprinting by carefully and intentionally adding more the first to systematically measure online ads, and their entropy to users’ browsers [51]. carefully controlled methodology has been very influen- User Profiles. Several studies specifically focus on tial on subsequent studies (including this one) [27]. Bar- tracking data collected by Google, since their trackers et al. adscape ford take a much broader look at the to are more pervasive than any others on the Web [24, 11]. determine who the major ad networks are, what fraction Alarmingly, two studies have found that Google’s Ad of ads are targeted, and what user characteristics drive Preferences Manager, which is supposed to allow users et al. targeting [9]. Carrascosa take an even finer grained to see and adjust how they are being targeted for ads, personas look at targeted ads by training that embody actually hides sensitive information from users [64, 16]. e.g., specific interest profiles ( cooking, sports), and find This finding is troubling given that several studies rely that advertisers routinely target users based on sensitive on data from the Ad Preferences Manager as their source e.g., et al. attributes ( religion) [12]. Rodriguez measure of ground-truth [27, 13, 9]. To combat this lack of trans- et the ad ecosystem on mobile devices [61], while arras parency, Lecuyer et al. have built systems that rely on al. analyzed malicious ad campaigns and the ad networks controlled experiments and statistical analysis to infer that serve them [66]. the profiles that Google constructs about users [39, 40]. Note that none of these studies examine retargeted Castelluccia et al. go further by showing that adversaries ads; Carrascosa et al. specifically excluded retargets can infer users’ profiles by passively observing the tar- from their analysis [12]. geted ads they are shown by Google [13]. rackers and racking Mechanisms. To facilitate ad targeting, participants in the ad ecosystem must ex- 3.2 ookie Matching and etargeting tensively track users. Krishnamurthy et al. have been cataloging the spread of trackers and assessing the en- Although ad exchanges have been transitioning to RTB suing privacy implications for years [38, 36, 37]. Roes- auctions since the mid-2000s, only three empirical stud- ner et al. develop a comprehensive taxonomy of different ies have examined the cookie matching that enables these tracking mechanisms that store state in users’ browsers services. Acar et al. found that hundreds of domains (e.g., cookies, HTML5 LocalStorage, and Flash LSOs), passed unique identifiers to each other while crawling as well as strategies to block them [55]. Gill et al. use websites in the Alexa Top-3K [2]. Olejnik et al. no- large web browsing traces to model the revenue earned ticed that ad auctions were leaking the winning bid prices by different trackers (or aggregators in their terminol- for impressions, thus enabling a fascinating behind-the- ogy), and found that revenues are skewed towards the scenes look at RTB auctions [54]. In addition to ex- largest trackers (primarily Google) [24]. More recently, amining the monetary aspects of auctions, Olejnik et al. Cahn et al. performed a broad survey of cookie charac- found 125 ad exchanges using cookie matching. Finally, teristics across the Web, and found that less than 1 of Falahrastegar et al. examine the clusters of domains that trackers can aggregate information across 75 of web- all share unique, matched cookies using crowdsourced sites in the Alexa Top-10K [11]. Falahrastegar et al. ex- browsing data [21]. Additionally, Ghosh et al. use game
4 484 25th USENIX Security Symposium USENIX Association theory to model the incentives for ad exchanges to match In this section, we discuss the methods and data we use cookies with their competitors, but they provide no em- to meet this goal. First, we briefly sketch our high-level pirical measurements of cookie matching [23]. approach, and discuss key enabling insights. Second, we Several studies examine retargeted ads, which are di- introduce the instrumented version of Chromium that we rectly facilitated by cookie matching and RTB. Liu et use during our crawls. Third, we explain how we de- al identify and measure retargeted ads served by Dou- signed and trained shopper personas that view products bleClick by relying on uni ue AdSense tags that are em- on the web, and finally we detail how we collected ads bedded in ad URLs [43]. Olejnik et al crawled specific using the trained personas. e-commerce sites in order to elicit retargeted ads from those retailers, and observe that retargeted ads can cost advertisers over 1 per impression (an enormous sum, ns g ts nd A ro c considering contextual ads sell for <