Tracing Information Flows Between Ad Exchanges Using Retargeted Ads Muhammad Ahmad Bashir, Sajjad Arshad, William Robertson, and Christo Wilson, Northeastern University https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/bashir

This paper is included in the Proceedings of the 25th USENIX Security Symposium August 10–12, 2016 • Austin, TX ISBN 978-1-931971-32-4

Open access to the Proceedings of the 25th USENIX Security Symposium is sponsored by USENIX rcng nformton lows etween Ad cnges sng Retrgeted Ads

Muhammad Ahmad Bashir Sajjad Arshad William Robertson Northeastern University Northeastern University Northeastern University ahmadccsneuedu arshadccsneuedu wkrccsneuedu Christo Wilson Northeastern University cbwccsneuedu

Astrct geted based on sensitive Personally Identifiable Informa- tion (PII) [44, 4] or specific kinds of browsing history Numerous surveys have shown that Web users are con- (eg, visiting medical websites) [41]. Furthermore, some cerned about the loss of privacy associated with online users are universally opposed to online tracking, regard- tracking. Alarmingly, these surveys also reveal that peo- less of circumstance [46, 60, 14]. ple are also unaware of the amount of data sharing that occurs between ad exchanges, and thus underestimate the One particular concern held by users is their digi- privacy risks associated with online tracking. tal footprint [33, 65, 58], ie, which first- and third- In reality, the modern ad ecosystem is fueled by a flow parties are able to track their browsing history Large- of user data between trackers and ad exchanges. Al- scale web crawls have repeatedly shown that trackers are though recent work has shown that ad exchanges rou- ubiquitous [24, 19], with DoubleClick alone being able tinely perform cookie matching with other exchanges, to observe visitors on 40 of websites in the Alexa Top- these studies are based on brittle heuristics that cannot 100K [11]. These results paint a picture of a balkanized detect all forms of information sharing, especially under web, where trackers divide up the space and compete for adversarial conditions. the ability to collect data and serve targeted ads. In this study, we develop a methodology that is able However, this picture of the privacy landscape is at to detect client- and server-side flows of information be- odds with the current reality of the ad ecosystem. Specif- tween arbitrary ad exchanges. Our key insight is to lever- ically, ad exchanges routinely perform cookie matching age retargeted ads as a tool for identifying information with each other, to synchronize unique identifiers and flows. Intuitively, our methodology works because itre- share user data [2, 54, 21]. Cookie matching is a pre- lies on the semantics of how exchanges serve ads, rather condition for ad exchanges to participate in Real Time than focusing on specific cookie matching mechanisms. Bidding (RTB) auctions, which have become the domi- Using crawled data on 35,448 ad impressions, we show nant mechanism for buying and selling advertising inven- that our methodology can successfully categorize four tory from publishers. Problematically, Hoofnagle et al different kinds of information sharing behavior between report that users navely believe that privacy policies pre- ad exchanges, including cases where existing heuristic vent companies from sharing user data with third-parties, methods fail. which is not always the case [32]. We conclude with a discussion of how our findings and methodologies can be leveraged to give users more Despite user concerns about their digital footprint, we control over what kind of ads they see and how their in- currently lack the tools to fully understand how informa- formation is shared between ad exchanges. tion is being shared between ad exchanges. Prior empiri- cal work on cookie matching has relied on heuristics that look for specific strings in HTTP to identify ntrodcton flows between ad networks [2, 54, 21]. However, these heuristics are brittle in the face of obfuscation: for exam- People have complicated feelings with respect to online ple, DoubleClick cryptographically hashes their cookies behavioral advertising. While surveys have shown that before sending them to other ad networks [1]. More fun- some users prefer relevant, targeted ads to random, un- damentally, analysis of clientside HTTP messages are targeted ads [60, 14], this preference has caveats. For insufficient to detect serverside information flows be- example, users are uncomfortable with ads that are tar- tween ad networks.

1 USENIX Association 25th USENIX Security Symposium 481 In this study, we develop a methodology that is able users, that also enable publishers to earn revenue. Sur- to detect client- and server-side flows of information be- veys have shown that users are not necessarily opposed tween arbitrary ad exchanges that serve retargeted ads. to online ads: some users are just opposed to track- Retargeted ads are the most specific form of behavioral ing [46, 60, 14], while others simply desire more nu- ads, where a user is targeted with ads related to the exact anced control over their digital footprint [4, 41]. How- products she has previously browsed (see § 2.2 for defi- ever, existing tools (eg, browser extensions) cannot dis- nition). For example, Bob visits nike.com and browses tinguish between targeted and untargeted ads, thus leav- for running shoes but decides not to purchase them. Bob ing users with no alternative but to block all ads. Con- later visits cnn.com and sees an ad for the exact same versely, our results open up the possibility of building running shoes from Nike. in-browser tools that just block cookie matching, which Our key insight is to leverage retargeted ads as a mech- will effectively prevent most targeted ads from RTB auc- anism for identifying information flows. This is possi- tions, while still allowing untargeted ads to be served. ble because the strict conditions that must be met for a Open Source. As a service to the community, we retarget to be served allow us to infer the precise flow have open sourced all the data from this project. This of tracking information that facilitated the serving of the includes over 7K labeled behaviorally targeted and retar- ad. Intuitively, our methodology works because it relies geted ads, as well as the inclusion chains and full HTTP on the semantics of how exchanges serve ads, rather than traces associated with these ads. The data is available at: focusing on specific cookie matching mechanisms. To demonstrate the efficacy of our methodology, we http://personalization.ccs.neu.edu/ conduct extensive experiments on real data. We train 90 personas by visiting popular e-commerce sites, and then 2 Background and efinitions crawl major publishers to gather retargeted ads [9, 12]. Our crawler is an instrumented version of that In this section, we set the stage for our study by providing records the inclusion chain for every resource it encoun- background about the online display ad industry, as well ters [5], including 35,448 chains associated with 5,102 as defining key terminology. We focus on techniques and unique retargeted ads. We use carefully designed pattern terms related to Real Time Bidding and retargeted ads, matching rules to categorize each of these chains, which since they are the focus of our study. reveal 1) the pair of ad exchanges that shared informa- tion in order to serve the retarget, and 2) the mechanism 2.1 Online isplay dvertising used to share the data (eg, cookie matching). In summary, we make the following contributions: Online display advertising is fundamentally a matching We present a novel methodology for identifying problem. On one side are publishers (eg, news web- • information flows between ad networks that is sites, blogs, etc) who produce content, and earn revenue content- and ad exchange-agnostic. Our methodol- by displaying ads to users. On the other side are adver- ogy allows to identify four different categories of in- tisers who want to display ads to particular users (eg, formation sharing between ad exchanges, of which based on demographics or market segments). Unfortu- cookie matching is one. nately, the online user population is fragmented across hundreds of thousands of publishers, making it difficult Using crawled data, we show that the heuristic • for advertisers to reach desired customers. methods used by prior work to analyze cookie Ad networks bridge this gap by aggregating inventory matching are unable to identify 31 of ad exchange from publishers (ie, space for displaying ads) and fill- pairs that share data. ing it with ads from advertisers. Ad networks make it Although it is known that ’s privacy policy • possible for advertisers to reach a broad swath of users, allows it to share data between its services [26], while also guaranteeing a steady stream of revenue for we provide the first empirical evidence that Google publishers. Inventory is typically sold using a Cost per uses this capability to serve retargeted ads. Mille (CPM) model, where advertisers purchase blocks Using graph analysis, we show how our data can of 1000 impressions (views of ads), or a Cost per Click • be used to automatically infer the roles played (CPC) model, where the advertiser pays a small fee each by different ad exchanges (eg, Supply-Side and time their ad is clicked by a user. Demand-Side Platforms). These results expand d Exchanges and uctions. Over time, ad net- upon prior work [25] and facilitate a more nuanced works are being supplanted by ad echanges that rely understanding of the online ad ecosystem. on an auction-based model. In Real-time Bidding (RTB) Ultimately, we view our methodology as a stepping exchanges, advertisers bid on individual impressions, in stone towards more balanced privacy protection tools for real-time; the winner of the auction is permitted to serve

2 482 25th USENIX Security Symposium USENIX Association 1) Impression 2) RTB Ads & $$$

3) Ad User Publisher SSP Ad Exchange DSPs Advertisers

Figure 1: The display advertising ecosystem. Impressions and tracking data flow left-to-right, while revenue and ads flow right-to-left. an ad to the user. Google’s DoubleClick is the largest ad vertisers to collect users’ browsing history. All major ad exchange, and it supports RTB. exchanges, like DoubleClick and Rubicon, perform user As shown in Figure 1, there is a distinction between tracking, but there are also companies like BlueKai that Supply-side Platforms (SSPs) and Demand-side Plat- just specialize in tracking. forms (DSPs) with respect to ad auctions. SSPs work ooke tcng During an RTB ad auction, DSPs with publishers to manage their relationships with mul- submit bids on an impression. The amount that a DSP tiple ad exchanges, typically to maximize revenue. For bids on a given impression is intrinsically linked to the example, OpenX is an SSP. In contrast, DSPs work with amount of information they have about that user. For advertisers to assess the value of each impression and example, a DSP is unlikely to bid highly for user u optimize bid prices. MediaMath is an example of a DSP. whom they have never observed before, whereas a DSP To make matters more complicated, many companies of- may bid heavily for user v who they have recently ob- fer products that cross categories; for example, Rubicon served browsing high-value websites (eg, the baby site Project offers SSP, ad exchange, and DSP products. We TheBump.com). direct interested readers to [45] for more discussion of However, the Same Origin Policy (SOP) hinders the the modern online advertising ecosystem. ability of DSPs to identify users in ad auctions. As shown in Figure 1, requests are first sent to an SSP which for- rgeted Adertsng wards the impression to an exchange (or holds the auc- tions itself). At this point, the SSP’s cookies are known, Initially, the online display ad industry focused on but not the DSPs. This leads to a catch-22 situation: a generic brand ads (eg, Enjoy Coca-Cola) or conte DSP cannot read its cookies until it contacts the user, but tual ads (eg, an ad for Microsoft on StackOverflow). it cannot contact the user without first bidding and win- However, the industry quickly evolved towards behav ning the auction. ioral targeted ads that are served to specific users based on their browsing history, interests, and demographics. To circumvent SOP restrictions, ad exchanges and ad- vertisers engage in cookie matching (sometimes called rckng To serve targeted ads, ad exchanges and cookie syncing). Cookie matching is illustrated in Fig- advertisers must collect data about online users by track- ure 2: the user’s browser first contacts ad exchange ing their actions. Publishers embed JavaScript or invis- s.com, which returns an HTTP redirect to its partner ible tracking pixels that are hosted by tracking com- d.com. s reads its own cookie, and includes it as a pa- panies into their web pages, thus any user who visits rameter in the redirect to d. d now has a mapping from the publisher also receives third-party cookies from the its cookie to s’s. In the future, if d participates in an auc- tracker (we discuss other tracking mechanisms in § 3). tion held by s, it will be able to identify matched users Numerous studies have shown that trackers are perva- using s’s cookie. Note that some ad exchanges (includ- sive across the Web [38, 36, 55, 11], which allows ad- ing DoubleClick) send cryptographically hashed cookies to their partners, which prevents the ad network’s true 1) GET /.jpg HTTP/1.1 cookies from leaking to third-parties. Cookie: id=123456 Retrgeted Ads In this study, we focus on retar 2) HTTP/1.1 302 Found Location: d.com/trackpixel?id=123456 s.com geted ads, which are the most specific type of targeted display ads. Two conditions must be met for a DSP 3) GET /trackpixel?id=123456 HTTP/1.1 User Cookie: id=ABCDEF to serve a retargeted ad to a user u: 1) the DSP must know that u browsed a specific product on a specific e- 4) HTTP/1.1 200 OK d.com commerce site, and 2) the DSP must be able to uniquely identify u during an auction. If these conditions are met, Figure 2: SSP s matches their cookie to DSP d using an the DSP can serve u a highly personalized ad reminding HTTP redirect. them to purchase the product from the retailer. Cookie

3 USENIX Association 25th USENIX Security Symposium 483 matching is crucial for ad retargeting, since it enables pand on these results by comparing trackers across geo- DSPs to meet requirement (2). graphic regions [20], while Li et al. show that most track- ing cookies can be automatically detected using simple machine learning methods [42]. 3 elated Work Note that none of these studies examine cookie match- ing, or information sharing between ad exchanges. Next, we briefly survey related work on online advertis- Although users can try to evade trackers by clear- ing. We begin by looking at more general studies of the ing their cookies or using private/incognito browsing advertising and tracking ecosystem, and conclude with a modes, companies have fought back using techniques more focused examination of studies on cookie match- like Evercookies and fingerprinting. Evercookies store ing and retargeting. Although existing studies on cookie the tracker’s state in many places within the browser matching demonstrate that this practice is widespread (e.g., FlashLSOs, etags, etc.), thus facilitating regenera- and that the privacy implications are alarming, these tion of tracking identifiers even if users delete their cook- works have significant methodological shortcomings that ies [34, 57, 6, 47]. Fingerprinting involves generating a motivate us to develop new techniques in this work. unique ID for a user based on the characteristics of their browser [18, 48, 50], browsing history [53], and com- 3.1 Measuring the d Ecosystem puter (e.g., the HTML5 canvas [49]). Several studies have found trackers in-the-wild that use fingerprinting Numerous studies have measured and broadly character- techniques [3, 52, 35]; Nikiforakis et al. propose to stop ized the online advertising ecosystem. Guha et al. were fingerprinting by carefully and intentionally adding more the first to systematically measure online ads, and their entropy to users’ browsers [51]. carefully controlled methodology has been very influen- User Profiles. Several studies specifically focus on tial on subsequent studies (including this one) [27]. Bar- tracking data collected by Google, since their trackers et al. adscape ford take a much broader look at the to are more pervasive than any others on the Web [24, 11]. determine who the major ad networks are, what fraction Alarmingly, two studies have found that Google’s Ad of ads are targeted, and what user characteristics drive Preferences Manager, which is supposed to allow users et al. targeting [9]. Carrascosa take an even finer grained to see and adjust how they are being targeted for ads, personas look at targeted ads by training that embody actually hides sensitive information from users [64, 16]. e.g., specific interest profiles ( cooking, sports), and find This finding is troubling given that several studies rely that advertisers routinely target users based on sensitive on data from the Ad Preferences Manager as their source e.g., et al. attributes ( religion) [12]. Rodriguez measure of ground-truth [27, 13, 9]. To combat this lack of trans- et the ad ecosystem on mobile devices [61], while arras parency, Lecuyer et al. have built systems that rely on al. analyzed malicious ad campaigns and the ad networks controlled experiments and statistical analysis to infer that serve them [66]. the profiles that Google constructs about users [39, 40]. Note that none of these studies examine retargeted Castelluccia et al. go further by showing that adversaries ads; Carrascosa et al. specifically excluded retargets can infer users’ profiles by passively observing the tar- from their analysis [12]. geted ads they are shown by Google [13]. rackers and racking Mechanisms. To facilitate ad targeting, participants in the ad ecosystem must ex- 3.2 ookie Matching and etargeting tensively track users. Krishnamurthy et al. have been cataloging the spread of trackers and assessing the en- Although ad exchanges have been transitioning to RTB suing privacy implications for years [38, 36, 37]. Roes- auctions since the mid-2000s, only three empirical stud- ner et al. develop a comprehensive taxonomy of different ies have examined the cookie matching that enables these tracking mechanisms that store state in users’ browsers services. Acar et al. found that hundreds of domains (e.g., cookies, HTML5 LocalStorage, and Flash LSOs), passed unique identifiers to each other while crawling as well as strategies to block them [55]. Gill et al. use websites in the Alexa Top-3K [2]. Olejnik et al. no- large web browsing traces to model the revenue earned ticed that ad auctions were leaking the winning bid prices by different trackers (or aggregators in their terminol- for impressions, thus enabling a fascinating behind-the- ogy), and found that revenues are skewed towards the scenes look at RTB auctions [54]. In addition to ex- largest trackers (primarily Google) [24]. More recently, amining the monetary aspects of auctions, Olejnik et al. Cahn et al. performed a broad survey of cookie charac- found 125 ad exchanges using cookie matching. Finally, teristics across the Web, and found that less than 1 of Falahrastegar et al. examine the clusters of domains that trackers can aggregate information across 75 of web- all share unique, matched cookies using crowdsourced sites in the Alexa Top-10K [11]. Falahrastegar et al. ex- browsing data [21]. Additionally, Ghosh et al. use game

4 484 25th USENIX Security Symposium USENIX Association theory to model the incentives for ad exchanges to match In this section, we discuss the methods and data we use cookies with their competitors, but they provide no em- to meet this goal. First, we briefly sketch our high-level pirical measurements of cookie matching [23]. approach, and discuss key enabling insights. Second, we Several studies examine retargeted ads, which are di- introduce the instrumented version of Chromium that we rectly facilitated by cookie matching and RTB. Liu et use during our crawls. Third, we explain how we de- al identify and measure retargeted ads served by Dou- signed and trained shopper personas that view products bleClick by relying on uniue AdSense tags that are em- on the web, and finally we detail how we collected ads bedded in ad URLs [43]. Olejnik et al crawled specific using the trained personas. e-commerce sites in order to elicit retargeted ads from those retailers, and observe that retargeted ads can cost advertisers over 1 per impression (an enormous sum, nsgts nd Aroc considering contextual ads sell for <0.01) [54]. Although prior work has examined information flow be- mttons The prior work on cookie matching tween ad exchanges, these studies are limited to specific demonstrates that this practice is widespread. However, types of cookie matching that follow well-defined pat- these studies also have significant methodological limi- terns (see 3.2). To study arbitrary information flows tations, which prevent them from observing all forms of in a mechanism-agnostic way, we need a fundamentally information sharing between ad exchanges. Specifically: different methodology. We solve this problem by relying on a key insight: in 1. All three studies identify cookie matching by locat- most cases, if a user is served a retargeted ad, this proves ing uniue user IDs that are transmitted to multi- that ad exchanges shared information about the user (see ple third-party domains [2, 54, 21]. Unfortunately, 6.1.1). To understand this insight, consider that two this will miss cases where exchanges send permuted preconditions must be met for user u to be served a re- or obfuscated IDs to their partners. Indeed, Dou- target ad for shop by DSP d. irst, either d directly ob- bleClick is known to do this [1]. served u visiting shop, or d must be told this information 2. The two studies that have examined the behavior of by SSP s. If this condition is not met, then d would not DoubleClick have done so by relying on specific pay the premium price necessary to serve u a retarget. cookie keys and URL parameters to detect cookie Second, if the retarget was served from an ad auction, matching and retargeting [54, 43]. Again, these SSP s and d must be sharing information about u. If this methods are not robust to obfuscation or encryption condition is not met, then d would have no way of iden- that hide the content of HTTP messages. tifying u as the source of the impression (see 2.2). 3. Existing studies cannot determine the precise infor- In this study, we leverage this observation to reliably mation flows between ad exchanges, ie, which par- infer information flows between SSPs and DSPs, regard- ties are sending or receiving information [2]. This less of whether the flow occurs client- or server-side. The limitation stems from analysis techniues that rely high-level methodology is uite intuitive: have a clean entirely on analyzing HTTP headers. For example, browser visit specific e-commerce sites, then crawl pub- a script from t1.com embedded in pub.com may lishers and gather ads. If we observe retargeted ads, we share cookies with t2.com using dynamic AJAX, know that ad exchanges tracking the user on the shopper but the referrer appears to be pub.com, thus poten- side are sharing information with exchanges serving ads tially hiding t1’s role as the source of the flow. on the publisherside. Specifically, our methodology uses the following steps: In general, these limitations stem from a reliance on ana- 4.2: We use an instrumented version of Chromium lyzing specific mechanisms for cookie matching. In this • study, one of our primary goals is to develop a method- to record inclusion chains for all resources encoun- ology for detecting cookie matching that is agnostic to tered during our crawls [5]. These chains record the underlying matching mechanism, and instead relies the precise origins of all resource reuests, even on the fundamental semantics of ad exchanges. when the reuests are generated dynamically by JavaScript or Flash. We use these chains in 6 to categorize information flows between ad exchanges. etodolog 4.3: To elicit retargeted ads from ad ex- • changes, we design personas (to borrow termi- In this study, our primary goal is to develop a methodol- nology from [9] and [12]) that visit specific e- ogy for detecting flows of user data between arbitrary ad commerce sites. These sites are carefully chosen exchanges. This includes client-side flows (ie, cookie to cover different types of products, and include a matching), as well as server-side flows. wide variety of common trackers.

5 USENIX Association 25th USENIX Security Symposium 485 Web Page: a.com/index.html a.com/index.html 100 a.com/img.png 80 60 a.com/animate.js

40 a.com/cats.gif 20

b.com/adlib.js from Alexa Top-5K 0 (a) (b) To solve this issue, we make use of a heavily in- Figure 3: (a) DOM Tree, and (b) Inclusion Tree. strumented version of Chromium that produces inclu sion trees directly from Chromium’s resource loading code [5]. Inclusion trees capture the semantic inclu- 4.4: To collect ads, our personas crawl 150 pub- • sion structure of resources in a webpage (ie, which lishers from the Alexa Top-1K. objects cause other objects to be loaded), unlike DOM 5: We leverage well-known filtering techniues trees which only capture syntactic structures. Our in- • and crowdsourcing to identify retargeted ads from strumented Chromium accurately captures relationships our corpus of 571,636 uniue crawled images. between elements, regardless of where they are located (eg, within a single page or across frames) or how the relevant code executes (eg, via an inline