Cookie Synchronization: Everything You Always Wanted to Know but Were Afraid to Ask Panagiotis Papadopoulos Nicolas Kourtellis Evangelos P
Total Page:16
File Type:pdf, Size:1020Kb
Cookie Synchronization: Everything You Always Wanted to Know But Were Afraid to Ask Panagiotis Papadopoulos Nicolas Kourtellis Evangelos P. Markatos Brave Software Telefonica Research, Spain FORTH-ICS, Greece panpap@brave:com nicolas:kourtellis@telefonica:com markatos@ics:forth:gr ABSTRACT 1 INTRODUCTION User data is the primary input of digital advertising, fueling the In the online era, where behavioural advertising fuels the major- free Internet as we know it. As a result, web companies invest a lot ity of Internet, user privacy has become a commodity that is be- in elaborate tracking mechanisms to acquire user data that can sell ing bought and sold in a complex, and often ad-hoc, data ecosys- to data markets and advertisers. However, with same-origin policy tem [17, 24, 35, 38]. Users’ personal data collected by IT companies and cookies as a primary identification mechanism on the web, each constitute a valuable asset, whose quality and quantity significantly tracker knows the same user with a different ID. To mitigate this, affect each company’s overall market value [40]. As a consequence, Cookie Synchronization (CSync) came to the rescue, facilitating an it is of no doubt that in order to gain advantage over their competi- information sharing channel between 3rd-parties that may or not tors, web companies such as advertisers and trackers participate have direct access to the website the user visits. In the background, in a user data collecting spree, aiming to retrieve as much infor- with CSync, they merge user data they own, but also reconstruct a mation as possible and form user profiles. These detailed profiles user’s browsing history, bypassing the same origin policy. contain personal data [12] such as interests, preferences, personal In this paper, we perform a first to our knowledge in-depth study identifying information, geolocations, etc. [5, 31, 44], and could be of CSync in the wild, using a year-long weblog from 850 real mobile sold to 3rd-parties for advertising or other purposes beyond the users. Through our study, we aim to understand the characteristics control of the user [30, 36, 37]. of the CSync protocol and the impact it has on web users’ privacy. Highlighting the importance of this data collection, web compa- For this, we design and implement CONRAD, a holistic mechanism nies have invested a lot in elaborate user tracking mechanisms. The to detect CSync events at real time, and the privacy loss on the user most traditional one includes the use of cookies: they have been side, even when the synced IDs are obfuscated. Using CONRAD, we commonly used in the Web to save and maintain some kind of state find that 97% of the regular web users are exposed to CSync: most on the web client’s side. This state has been used as an identifier to of them within the first week of their browsing, and the median authenticate users across different sessions and domains. Initially, userID gets leaked, on average, to 3.5 different domains. Finally, we 1st-party cookies were used to track users when they repeatedly see that CSync increases the number of domains that track the user visited the same site, and later, 3rd-party cookies were invented by a factor of 6.75. to track users when they move from one website to another. The same-origin policy (SOP) was invented a few years later [45] to CCS CONCEPTS restrict the potential amount of information trackers can collect about a user and share with other 3rd-party platforms. • Security and privacy → Web protocol security; • Networks To overcome this restriction, and create unified identifiers for → Network privacy and anonymity; • Social and professional each user, the ad-industry invented the Cookie Synchronization topics → Surveillance; (CSync) process [19]: a mechanism that can practically “circumvent” the same-origin policy, and allow web companies to share (syn- KEYWORDS chronize) cookies, and match the different IDs they assign for the Cookie Synchronization, Cross-domain tracking, HTTP Cookies same user while they browse the web. Sadly, recent results show that most of the 3rd-parties are involved in CSync: 157 of top 200 websites (i.e. 78%) have 3rd-parties which synchronize cookies with arXiv:1805.10505v3 [cs.IR] 25 Feb 2020 ACM Reference Format: Panagiotis Papadopoulos, Nicolas Kourtellis, and Evangelos P. Markatos. at least one other party, and they can reconstruct 62-73% of a user’s 2019. Cookie Synchronization: Everything You Always Wanted to Know But browsing history [11]. Furthermore, 95% of pages visited contain Were Afraid to Ask. In Proceedings of the 2019 World Wide Web Conference 3rd-party requests to potential trackers and 78% attempt to transfer (WWW ’19), May 13–17, 2019, San Francisco, CA, USA. ACM, New York, NY, unsafe data [46]. Finally, a mechanism for respawing cookies has USA, 11 pages. https://doi:org/10:1145/3308558:3313542 been identified, with consequences in the reconstruction of users’ browsing history, even if they delete their cookies [1]. Although past works highlight the use of CSync across the most This paper is published under the Creative Commons Attribution 4.0 International popular sites worldwide, they avoid diving into the details and (CC BY 4.0) license. Authors reserve their rights to disseminate the work on their fail to thoroughly explore this increasingly popular technique. In personal and corporate Web sites with the appropriate attribution. fact, the majority of related works perform their analysis using WWW ’19, May 13–17, 2019, San Francisco, CA, USA © 2019 IW3C2 (International World Wide Web Conference Committee), published crawled data by fetching the top Alexa websites. Consequently, under Creative Commons CC BY 4.0 License. little is known on how CSync works in the wild, how many of real ACM ISBN 978-1-4503-6674-8/19/05. users’ cookies are getting synced during their everyday browsing, https://doi:org/10:1145/3308558:3313542 and to what extend it affects the end users’ privacy. Importantly, website1.com new cookie (1) cookie sync (3) (1) GET new cookie (1) existing studies focus solely on desktop web; however, today more Browser (1) GET tracker.com/script.js Browser tracker.com/beacon.gif than 52.2% of all web traffic is generated through mobile phones (up Cookie: {cookie_ID=user123} from 50.3% in the previous year [42]). This traffic points to a whole new, unexplored ecosystem of mobile Web, where users browse (2) Response through their very personal devices, while employing a variety of Set-cookie: user123 tracker.com sensors to experience a highly personalized content, but always at tracker.com the expense of privacy. Thus, what is the impact of CSync in this website2.com newnew cookie ( cookie (2)2) Browser (1) GET new mobile ecosystem? Which are the basic characteristics of CSync, advertiser.com/adBanner.png and the mechanics used? How does CSync impact the user’s privacy (3) GET and anonymity on the mobile web? advertiser.com?syncID=user123&publisher=website3 Cookie: {cookie_ID=userABC} In this paper, we aim to answer these questions, by studying (2) Response advertiser.com Cookie Synchronization using a large, year-long dataset of 850 real Set-cookie: userABC advertiser.com mobile users, and exploring in depth its use and growth through time, dominant companies, along with its side-effects on user pri- new cookie (1) cookie sync (3) Browser (1) GET website3.com (1) GET cookie sync (3) vacy. The contributions of this work are as follows: tracker.com/script.js Browser tracker.com/beacon.gif Cookie: {cookie_ID=user123} • We design and implement CONRAD: (COokie syNchRonizA- (2) Response tion Detector) a holistic mechanism to detect at real time CSyncSet-cookie: user123 tracker.com events and the privacy loss on the user side, even when synced tracker.com new cookie (2) IDs are concealed or obfuscated by participating companiesBrowser aim-(1) GET advertiser.com/adBanner.png ing to reduce identifiability from traditional, heuristic-based (3) GET advertiser.com?syncID=user123&publisher=website3 detection algorithms. In such cases, when non-ID related fea- Cookie: {cookie_ID=userABC} (2) Response advertiser.com tures are used, our approach achieves high accuracy (84%-90%)Set-cookie: userABC advertiser.com and AUC (0.89-0.97). Figure 1: Example of advertiser.com and tracker.com synchronizing their cookieIDs. Interestingly, and without having any code in web- • We perform the first of its kind, large-scale, longitudinal study site3, advertiser.com learns that: (i) cookieIDs userABC==user123 of Cookie Synchronization on mobile users in the wild. We per- and (ii) userABC has just visited the given website. Finally, both do- form a passive data collection of the activity of 850 volunteering mains can conduct server-to-server user data merges. mobile users, lasting an entire year. This means that the data collected are not crawled, like in past studies, and therefore tracker.com but not from advertiser.com. Thus, advertiser.com does do not capture a distorted or biased picture of CSync on the not (and cannot) know which users visit website3.com. How- Web. Instead, these data provide a rare glimpse of CSync in the ever, as soon as the code of tracker.com is called, a GET request is mobile space, and an opportunity to study CSync and its impact issued by the browser to tracker.com (step 1), and it responds back on real mobile users’ privacy and anonymity. with a REDIRECT request (step 2), instructing the user’s browser • Using the proposed detection mechanism, we conduct an in- to issue another GET request to its collaborator advertiser.com this depth privacy analysis of CSync. Our results show that 97% of time, using a specifically crafted URL (step 3): regular web users are exposed to CSync.