On the Privacy Implications of Real Time Bidding
Total Page:16
File Type:pdf, Size:1020Kb
On the Privacy Implications of Real Time Bidding A Dissertation Presented by Muhammad Ahmad Bashir to The Khoury College of Computer Sciences in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science Northeastern University Boston, Massachusetts August 2019 To my parents, Javed Hanif and Najia Javed for their unconditional support, love, and prayers. i Contents List of Figures v List of Tables viii Acknowledgmentsx Abstract of the Dissertation xi 1 Introduction1 1.1 Problem Statement..................................3 1.1.1 Information Sharing Through Cookie Matching...............3 1.1.2 Information Sharing Through Ad Exchanges During RTB Auctions....5 1.2 Contributions.....................................5 1.2.1 A Generic Methodology For Detecting Information Sharing Among A&A companies..................................6 1.2.2 Transparency & Compliance: An Analysis of the ads.txt Standard...7 1.2.3 Modeling User’s Digital Privacy Footprint..................9 1.3 Roadmap....................................... 10 2 Background and Definitions 11 2.1 Online Display Advertising.............................. 11 2.2 Targeted Advertising................................. 13 2.2.1 Online Tracking............................... 14 2.2.2 Retargeted Ads................................ 14 2.3 Real Time Bidding.................................. 14 2.3.1 Overview................................... 15 2.3.2 Cookie Matching............................... 16 2.3.3 Advertisement Served via RTB........................ 17 3 Related Work 18 3.1 The Online Advertising Ecosystem.......................... 18 3.2 Online Tracking.................................... 19 3.2.1 Tracking Mechanisms............................ 19 3.2.2 Users’ Perceptions of Tracking........................ 20 ii 3.2.3 Blocking & Anti-Blocking.......................... 21 3.2.4 Transparency................................. 22 3.3 Real Time Bidding and Cookie Matching...................... 23 4 Tracing Information Flows Between A&A companies Using Retargeted Ads 25 4.1 Methodology..................................... 26 4.1.1 Insights and Approach............................ 27 4.1.2 Instrumenting Chromium........................... 28 4.1.3 Creating Shopper Personas.......................... 29 4.1.4 Collecting Ads................................ 30 4.2 Image Labeling.................................... 31 4.2.1 Basic Filtering................................ 31 4.2.2 Identifying Targeted & Retargeted Ads................... 32 4.2.3 Limitations.................................. 35 4.3 Analysis........................................ 35 4.3.1 Information Flow Categorization....................... 36 4.3.2 Cookie Matching............................... 40 4.3.3 The Retargeting Ecosystem......................... 42 4.4 Summary....................................... 44 5 A Longitudinal Analysis of the ads.txt Standard 46 5.1 Background...................................... 48 5.1.1 Ad Fraud and Spoofing............................ 49 5.1.2 A Brief Intro to ads.txt .......................... 49 5.1.3 ads.txt File Format............................ 50 5.2 Related Work..................................... 51 5.2.1 Ad Fraud................................... 51 5.2.2 ads.txt Adoption............................. 52 5.3 Methodology..................................... 53 5.3.1 Collection of ads.txt Data........................ 53 5.4 Adoption of ads.txt ................................ 55 5.4.1 Publisher’s Perspective............................ 56 5.4.2 Seller’s Perspective.............................. 59 5.5 Compliance...................................... 63 5.5.1 Isolating RTB Ads.............................. 63 5.5.2 Compliance Verification Metrics....................... 63 5.5.3 Results.................................... 64 5.6 Summary....................................... 67 6 Diffusion of User Tracking Data in the Online Advertising Ecosystem 69 6.1 Related Work on Graph Models........................... 71 6.2 Methodology..................................... 72 6.2.1 Dataset.................................... 72 6.2.2 Graph Construction.............................. 73 6.2.3 Detection of A&A Domains......................... 75 iii 6.2.4 Coverage................................... 75 6.3 Graph Analysis.................................... 76 6.3.1 Basic Analysis................................ 76 6.3.2 Cores and Communities........................... 78 6.3.3 Node Importance............................... 79 6.4 Information Diffusion................................. 80 6.4.1 Simulation Goals............................... 80 6.4.2 Simulation Setup............................... 81 6.4.3 Validation................................... 87 6.4.4 Results.................................... 89 6.4.5 Random Browsing Model.......................... 95 6.5 Limitations...................................... 96 6.6 Sumamry....................................... 97 7 Conclusion 99 7.1 Contributions & Impact................................ 100 7.2 Limitations...................................... 102 7.3 Lessons Learned................................... 103 7.3.1 Future Directions............................... 105 Bibliography 108 8 Appendices 126 8.1 Clustered Domains.................................. 126 8.2 Selected Ad Exchanges................................ 126 iv List of Figures 2.1 The display advertising ecosystem. Impressions and tracking data flow left-to-right, while revenue and ads flow right-to-left........................ 15 2.2 Examples of (a) cookie matching and (b) showing an ad to a user via RTB auctions. (a) The user visits publisher p1 Ê which includes JavaScript from advertiser a1 Ë. a1’s JavaScript then cookie matches with exchange e1 by programmatically gener- ating a request that contains both of their cookies . (b) The user visits publisher p2, which then includes resources from SSP s1 and exchange e2 Ê–. e2 solicits bids Í and sells the impression to e1 ÎÏ, which then holds another auction Ð, ultimately selling the impression to a1 ...................... 16 4.1 (a) DOM Tree, and (b) Inclusion Tree......................... 28 4.2 Overlap between frequent A&A domains and A&A domains from Alexa Top-5K.. 29 4.3 Unique A&A domains contacted by each A&A domain as we crawl more pages.. 29 4.4 Average number of images per persona, with standard deviation error bars...... 31 4.5 Screenshot of our AMT HIT, and examples of different types of ads......... 33 4.6 Regex-like rules we use to identify different types of ad exchange interactions. shop and pub refer to chains that begin at an e-commerce site or publisher, respectively. d is the DSP that serves a retarget; s is the predecessor to d in the publisher-side chain, and is most likely an SSP holding an auction. Dot star (:∗) matches any domains zero or more times................................... 36 5.1 Adoption of the ads.txt standard by Alexa Top-100K publishers over time.... 56 5.2 Publisher adoption over alexa ranks.......................... 56 5.3 Number of ads.txt records per publisher...................... 56 5.4 Number of publisher using the same ads.txt file.................. 58 5.5 Clusters of jxj, where x is the # of publishers using the same file........... 58 5.6 Number of seller domains over time.......................... 59 5.7 Authorized sellers over Alexa............................. 59 5.8 Sellers across two snapshots.............................. 59 5.9 Number of sellers and associated publisher IDs (April 2019)............. 60 5.10 Sellers by publisher relationships (April 2019).................... 60 5.11 Number of unique publishers and total entries (across all publishers) for sellers... 60 5.12 Percentage of non-compliant (non-clustered) Seller/Buyer tuples per publisher... 64 v 5.13 Percentage of non-compliant (clustered) Seller/Buyer tuples per publisher...... 64 5.14 Average distance of buyer from first seller across all publishers. Distances are shown for both compliant and non-compliant tuples.................. 66 6.1 An example HTML document and the corresponding inclusion tree, Inclusion graph, and Referer graph. In the DOM representation, the a1 script and a2 img appear at the same level of the tree; in the inclusion tree, the a2 img is a child of the a1 script because the latter element created the former. The Inclusion graph has a 1:1 correspondence with the inclusion tree. The Referer graph fails to capture the relationship between the a1 script and a2 img because they are both embed- ded in the first-party context, while it correctly attributes the a4 script to the a3 iframe because of the context switch........................ 73 6.2 k-core: size of the Inclusion graph WCC as nodes with degree ≤ k are recursively removed........................................ 78 6.3 CDF of the termination probability for A&A nodes.................. 82 6.4 CDF of the mean weight on incoming edges for A&A nodes............. 82 6.5 Examples of my information diffusion simulations. The observed impression count for each A&A node is shown below its name. (a) shows an example graph with two publishers and two ad exchanges. Advertisers a1 and a3 participate in the RTB auctions, as well as DSP a2 that bids on behalf of a4 and a5. (b)–(d) show the flow of data (dark grey arrows) when a user generates impressions on p1 and p2 under three diffusion models. In all three examples, a2 purchases both impressions on behalf of a5, thus they both directly receive information.