Tracking of the Finnish Web Users

John Bailey

Department of Information Systems Science

Hanken School of Economics

Helsinki

2019

i

HANKEN SCHOOL OF ECONOMICS

Department of: Type of work: Information Systems Science Master’s Thesis

Author and student number: Date: John Bailey 021090 May 6th 2019

Title of thesis: Tracking of the Finnish Web Users

Abstract:

The development of the World Wide Web has always been a tightrope walk between the commercial needs of companies and the privacy rights of the web users. Understanding the behaviors and preferences of the web users provides the online ecosystem with valuable targeting data. The mechanism facilitating this data collection is called online tracking, with which small snippets of code enable the surveillance of users across the they visit. User acceptance of tracking and targeted advertising includes some interesting paradoxes, and global data breaches and misapplications have attracted growing media coverage and public outcry.

Online tracking prevalence has been researched from many perspectives, including the longitudinal and geographical, but very little is known about who is tracking and profiling the Finnish web users. As research has shown, there are geographical differences in tracking behavior and prevalence. Therefore, the aim of this thesis was to study the online tracking landscape from the perspective of a Finnish web user.

This thesis used ranking service Alexa’s top 500 websites Finnish users most frequently visit as a proxy for Finnish web usage. The sites were observed using the online tracking measurement tool Tracker Tracker, which documented the online trackers found on these websites. The resulting list of trackers was then enriched with organizational ownership data provided by the Disconnect dataset. The measurement found 466 unique trackers from 408 organizations used on 410 of the 500 websites.

The core findings of this thesis supported contemporary research, with having an overwhelming lead in tracker coverage, mostly through Google Analytics and Doubleclick, reaching a combined 75 % of the websites. The second most prevalent tracking organization was , reaching 46 % of the websites. Beyond the top 2 organizations, the competition was much tighter, followed by a long tail of organizations. There were also notable differences when comparing Finnish websites to non-Finnish sites, displaying some level of geographical preference in publishers’ choices of advertising platforms and analytical tools.

Keywords: online tracking, cookies, tracking prevalence, 3rd party tracking, targeted advertising, privacy, Finnish web users ii

SVENSKA HANDELSHÖGSKOLAN

Institution: Arbetets art: Informationsbehandling Magistersavhandling

Författare och studerandenummer: Datum: John Bailey 021090 6.5.2019

Avhandlingens rubrik: Tracking of the Finnish Web Users

Sammandrag:

Utvecklingen på webben har alltid varit en svår balansgång mellan företagens kommersiella behov och webbandvändarnas rätt till dataskydd. Goda insikter om användarnas beteende och preferenser har visat sig vara värdefulla i samband med riktade annonser och personifierade webbsidor. Dessa insikter är byggda på data insamlat genom webbspårning, där korta kodavsnitt kan möjliggöra en övervakning av användarna över flertal webbplatser. Användarnas åsikter och beteende angående webbspårning och riktade annonser möter inte alltid varann, men globala dataintrång och andra missbruk har nyligen lyfts upp i medierna och orsakat stark offentlig kritik.

Omfattningen av webbspårning har forskats från flera olika perspektiv, bland annat longitudinell och geografisk, men förståelsen om av vem och hur de finländska användarna spåras är snäv. I och med att tidigare forskning har visat en geografisk skiljaktighet inom webbspårning syftade avhandlingen att fylla denna kunskapslucka genom att undersöka webbspårning ur perspektivet av finländska webbanvändare.

Avhandlingen använde sig av Alexas topp 500 webbsidor som finska användare besöker för att representera den finländska webbanvändningen. Sidornas spårningsanrop analyserades och dokumenterades med mätverktyget Tracker Tracker och resultatet berikades med data från Disconnect, som länkade de iakttagna spårningsskripten till organisationerna bakom dem. Mätningen fann 466 unika spårningsskript av 408 organisationer på 410 av de 500 webbsidorna.

Avhandlingens resultat stödde tidigare forskning, med en suverän ledningsposition för Google, med spårningsskripten Google Analytics och Doubleclick i spetsen, observerades på 75 % webbsidorna. Facebooks spårningsverktyg var näst populärast med 46 %. Den påföljande rangordningen var skenbart jämnare och bildade slutligen en långt sluttande svans. Det tillkom även märkbara skillnader mellan finländska och icke-finländska sidor, vilket tydde på en geografisk preferens av annonsplatformer och analytiska verktyg på de finländska webbsidorna.

Nyckelord: webbspårning, spårningsanrop, kakor, tredje partens spårning, riktade annonser, dataskydd, finländska webbanvändare iii

CONTENTS

1 INTRODUCTION ...... 1 1.1 Aim of the study...... 2 1.2 Research questions ...... 2 1.3 Delimitations ...... 2 1.4 Key concepts ...... 3 1.5 Structure of the thesis ...... 5 2 EXISTING RESEARCH...... 7 2.1 What is online tracking? ...... 7 2.2 Online tracking methods ...... 10 2.3 Online tracking justification ...... 16 2.4 Online tracking prevalence ...... 19 2.4.1 Longitudinal research ...... 21 2.4.2 Regional research ...... 25 2.5 Tracking measurement tools ...... 27 2.6 Privacy ...... 30 2.6.1 Data collection ...... 32 2.6.2 Data processing ...... 35 3 RESEARCH METHODS ...... 39 3.1 Research design ...... 39 3.2 Input data gathering ...... 39 3.2.1 Alexa ...... 40 3.2.2 Disconnect ...... 41 3.3 Data collection and measurement tool ...... 41 3.3.1 Tracker Tracker ...... 42 3.3.2 Ghostery ...... 45 3.4 Data gathering considerations ...... 46 3.5 Data gathering process ...... 49 3.6 Data categorization ...... 50 3.7 Data interpretation choices ...... 50 3.8 Quality of the study ...... 51 4 RESULTS AND ANALYSIS ...... 53 4.1 Data description ...... 53 4.2 Tracker Prevalence ...... 55 iv

4.3 The Finnish digital footprint ...... 58 4.4 Tracking in Finland ...... 60 4.5 Other findings ...... 65 5 DISCUSSION ...... 67 5.1 Results and implications ...... 67 5.2 Limitations ...... 70 6 CONCLUSIONS...... 72 7 REFERENCES ...... 74

APPENDICES

Appendix 1 Tracking defense strategies ...... 82 Appendix 2 Tracking defense tools ...... 83 Appendix 3 Number of trackers per site ...... 84 Appendix 4 Tracker Reach ...... 97 Appendix 5 Tracking organization reach ...... 109

TABLES

Table 1 Tracking vs. non-tracking activities (Center for Democracy & Technology 2011) ...... 8 Table 2 Ghostery tracking categories (Ghostery 2019) ...... 50 Table 3 Number 3rd party scripts found during each top sites group and measurement run ...... 54 Table 4 Comparison to the maximum number of 3rd party scripts for each top sites group and measurement run ...... 54 Table 5 Share of sites with identified 3rd party scripts for each top sites group and measurement run ...... 54 Table 6 Tracker categories for the top 500 sites ...... 56 Table 7 Frequency table of tracker prevalence ...... 57 Table 8 Frequency table for tracking organization coverage ...... 59 Table 9 Average number of tracking scripts per category on Finnish and non- Finnish sites ...... 60 Table 10 Frequency table for number of trackers per site ...... 65

FIGURES

Figure 1 Data sharing settings in Google Analytics ...... 9 Figure 2 Example content from cookie written by upcommons.upc.edu ...... 11 Figure 3 1st party vs. 3rd party cookies ...... 12 Figure 4 3rd party tracking ...... 13 v

Figure 5 Summary of tracking mechanisms (Bujlow et al. 2017:5) ...... 14 Figure 6 Screenshot from PANOPTICLICK ...... 15 Figure 7 Top third party coverage on the top 1 million Alexa sites (Englehardt & Narayanan 2016:8) ...... 20 Figure 8 Organizations with the highest 3rd party presence on the top 1 million Alexa sites (Englehardt & Narayanan 2016:9) ...... 21 Figure 9 Historical 3rd party coverage by Lerner et al. (2016:1008) ...... 22 Figure 10 Distribution of 3rd party requests (Lerner et al. 2016:1007) ...... 22 Figure 11 Distribution of average number of trackers across websites (Karaj et al. 2018:7) ...... 24 Figure 12 Company tracker reach top 10 (Karaj et al. 2018:8) ...... 24 Figure 13 Tracker country location heatmap (Falahrastegar 2014B:7) ...... 25 Figure 14 Summary of 3rd party cookies used in the Finnish web (Ruohonen & Leppänen 2017:2) ...... 27 Figure 15 Tracking measurement tools (Bujlow et al. 2017:27) ...... 29 Figure 16 The Tracker Tracker tool's user interface ...... 43 Figure 17 The Tracker Tracker data output options ...... 44 Figure 18 Example Tracker Tracker result set converted from csv to Excel format ... 45 Figure 19 mashable.com in Chrome with the Ghostery extension open (screenshot from 17.9.2015) ...... 46 Figure 20 Simple ad serving process ...... 47 Figure 21 Variability of observed trackers on Alexa top 100 sites (Lerner et al. 2016:1002) ...... 48 Figure 22 Top 20 most prevalent trackers ...... 56 Figure 23 All 466 trackers’ prevalence; the long tail of 3rd party trackers ...... 57 Figure 24 Reach of the top 20 tracker organizations...... 59 Figure 25 Reach of the top 10 trackers on Finnish sites ...... 61 Figure 26 Top 10 most prevalent trackers on Finnish sites compared to non-Finnish sites ...... 61 Figure 27 Top 10 most prevalent trackers on non-Finnish sites compared to Finnish sites ...... 62 Figure 28 Reach of the top 10 tracking organizations on Finnish sites ...... 63 Figure 29 Top 10 most prevalent tracking organizations on Finnish sites compared to non-Finnish sites ...... 63 Figure 30 Top 10 most prevalent tracking organizations on non-Finnish sites compared to Finnish sites ...... 64 Figure 31 Top 20 most tracked sites ...... 65 Figure 32 The reach of the top 10 tracking organizations compared over the top sites groups ...... 66 Figure 33 Tracking defense strategies (Bujlow et al. 2017:21) ...... 82 Figure 34 Tracking defense tools (Bujlow et al. 2017:22) ...... 83

1

1 INTRODUCTION

When the commercial was taking its first steps, Steiner (1993) accidently created an adage in his New Yorker comic strip “On the Internet, nobody knows you’re a dog”. What was apparently (Cavna 2013) the result of a creative doodle could now be seen as an accurate contemporary view on the new technology that was the Internet. But we have come a long way since 1993, and with all the data, machine power, and algorithms available today, “they” would most certainly know you’re a dog, probably even before you knew it yourself.

The concept of if you are not paying for the product or service you are using, you are the product has become much more prevalent as the social media grows older. It is easy to enjoy all the free web content and services when the privacy cost of doing so is hidden behind the browser. The idea behind this concept is that companies provide products or services free of charge, as in no monetary compensation from the user, but then monetizes the user’s attention or behavior through other means. This is by no means a new phenomenon; rather it basically just describes the central business model for a substantial part of commercial media; the television viewer is paying for the programming with her attention during commercial breaks, and the same goes for the radio listener or the free newspaper reader. From the point of view of business advertising models, the Internet provides organizations with the opportunity to go beyond just the momentary attention of a fleeting eye on a web page, by enabling them to follow users’ behavior throughout their website, or in some cases, on other websites as well.

It is this trans-website behavioral tracking that has revealed great potential, both from the business opportunities it provides to the privacy risks it entails. So-called 3rd party tracking enables some organizations to follow users over a wide range of websites, creating a detailed profile of the user’s behaviors and preferences, which in turn can be used for targeted advertising or personalized service. Although this may sound like a positive development, the aggregation of large amounts of data comes with serious risks, not only to the individual, but also to society as a whole. Who uses what data, how, and to what end? Are web users even aware of the prevalence of these tracking mechanics that relentlessly follow them around the web?

Online tracking is by no means a new area of research in academia, as will be demonstrated in the literature overview of this thesis. The three key themes in tracking 2

research revolve around technical tracking mechanisms, tracking prevalence, and privacy impact. A subtheme under tracking prevalence is the variation measured between different geographical areas, exploring the distinct tracking practices and the organizations behind them. As there is currently no thorough research on which organizations track Finnish web users, this thesis aims to create a basic level of understanding in this research area. In order to uncover the current tracking landscape from a Finnish web user’s perspective, the research presented here employs an empirical analysis of the trackers and their owners, and how the Finnish sites might differ from their foreign counterparts.

1.1 Aim of the study

The aim of this master’s thesis is to describe the online tracking landscape and quantify the prevalence of mechanisms and organizations that track the online behavior of Finnish web users, in order to obtain a better understanding of the share of the Finnish digital footprint these organizations theoretically possess and whether their behavior differs on Finnish and non-Finnish websites.

1.2 Research questions

The research questions for this thesis are:

1) What are the most prevalent trackers on the websites which Finnish web users frequently visit?

2) How complete a digital footprint can organizations build from their trackers on websites Finnish web users frequently visit?

3) How does tracking differ between Finnish sites and non-Finnish sites?

1.3 Delimitations

The delimitations of this thesis are privacy regulations and the implementation of privacy protection. In regard to regulation, the regulative frameworks provided by the EU ePrivacy Directive and later also EU GDPR (enforced on May 25th, 2018; after the empirical data collection of this thesis, but before the analysis) are not evaluated in this thesis, and the data collected for the thesis is taken as is. The same approach is used for analyzing the possible implementation of privacy protection on the websites, which the regulative frameworks require. Some regulative influences are discussed as part of data gathering limitations, but further investigation is left for and suggested as future research. From the tracking organization’s point of view, this thesis does not consider 3

the privacy, data protection, or terms of service that the websites or tracking organizations provide in regard to their tracking practices, nor the possible usage limitations they set for any collected data. Roesner, Kohno & Wetherall (2012) used a similar approach, as they “… analyze trackers according to the capabilities granted by the behaviors we observe and not, for example, the privacy policies of the tracking sites” (Ibid:3). Further, although discussed in conjunction with privacy issues (2.6.2 Data processing), no actual data use scenarios are implied beyond the ability to build a digital footprint of the tracked users.

1.4 Key concepts

This section contains some of the key concepts and definitions of the chosen research area, as used by the author in this thesis.

• Cookie: In the context of web technology, a cookie is a small text file that a saves on the user’s computer. When a user browses the web, the websites they visit can send additional information to the user’s browser in order to provide a stateful (continuous) experience across different pages on the site. Every time the user visits a website, the data contained in the cookie is sent to the website server when the browser requests the page content and can therefore be used to decide what content is to be sent back to the browser. Examples of data saved in cookies are shopping cart content, login or authentication information, progress in a multi-page process, identifiers for website analytics, preferences for website personalization, etc. Although the content of a cookie is a simple list of name-value pairs, the service it enables is crucial to most current websites.

• 1st-party cookies: A cookie provided by the website that the user is browsing is called a 1st-party cookie. These provide the key functionalities of stateful tracking as described in the above “Cookie” explanation.

• 3rd-party cookies: Cookies that originate from a domain other than the current website the user is visiting are called 3rd-party cookies. If a user is for example browsing the website abc.org and sees an ad from xyz.org, a cookie provided by xyz.org is defined as a 3rd-party cookie.

• 3rd-party tracking: 3rd-party cookies make it possible to track users across different websites. If a user visits the website abc.org and sees an ad from xyz.org, a 3rd-party cookie can be set on the user’s computer. If the user then 4

visits def.org and sees another ad from xyz.org, the same 3rd-party cookie from xyz.org is activated and can be used to create a historical view of pages and websites that the user has visited. In addition to just relying on cookies for user data, 3rd-party tracking is also possible through many other web components that are used to create the web pages a user browses. Shared JavaScript libraries or font packages, that are loaded from 3rd-parties like content delivery networks (CDNs), are examples of these 3rd-party web components that enable tracking.

• User: The person who browses the web, is tracked, and most probably also exposed to online advertising, is called the User.

• Publisher: The entity that provides the website and content that the user visits is called the Publisher. The publisher supplies the ad-driven online economy with users. Larger publishers usually provide their user inventory. Media companies are the most evident examples of publishers.

• Advertiser: The entity that wants to advertise is called the Advertiser. The advertiser creates demand for suitable users that it wants to communicate with.

• Ad network: An ad network is an entity that creates a marketplace that tries to match the publisher’s supply with the advertiser’s demand for relevant users. In the online space, ad networks are usually large interconnected systems that provide ad buying through user interfaces or machine interfaces (APIs) for publisher and/or advertiser integration. Many, especially smaller, ad networks also integrate with each other for improved reach.

• Targeted advertising: Targeted advertising is online advertising that aims to select users based on their traits or on the context of the ad placement they are observing. The traits can be demographic (age, gender, income, etc.), behavioral (buyer profile, purchase intent, browser history, previous response to ads, etc.), or any other suitable data point that is available to the advertiser. The context can be anything from the text on the page the user is viewing, to the device they are viewing it on, even to the weather in the location they are in. 3rd-party tracking is the key enabler of behavior-based targeted advertising, as it provides the data needed for profile analysis. These profiles can then be used to create audiences by expanding the profiles to users who share the behavior of the 5

already identified profiles. Contextual targeting is usually based on data provided by the publisher or the website.

• OBA, Online Behavioral Advertising: Online behavioral advertising is a subset of targeted advertising where the user’s web browsing history and possibly other data is used to construct a detailed profile of the user for ad targeting. The more comprehensive a view of the user’s online activities is available, the better and more accurate the resulting profile becomes. This is why 3rd-party tracking and the reach thereof is most relevant for succeeding with behavioral advertising.

• Programmatic advertising: Programmatic advertising describes a method of direct ad buying, e.g. buying through a machine interface. Instead of calling the publisher to ask for 1000 impressions (users seeing the ad) and sending over the ad material, the ad buyer can use an ad buying user interface to directly purchase the impressions for the selected campaign. The possible targeting options or audiences are provided by the publisher, at their discretion.

• RTB, Real-time bidding: Real-time bidding is a subset of programmatic buying, where ad buying is facilitated through instant speed auctions on an impression by impression basis. As a user visits a website, a request for an ad is sent to a real-time auction. Any user profile and context data that the publisher wants to share with the auction can be included in the request. The advertiser’s ad buying platforms automatically decide if and how much they wish to bid, based on data provided by the publisher and/or any data they already have of the user. This process is then duplicated for every ad space on the web page where the publisher decides to use real-time buying. Real-time buying is the key enabler with which advertisers can utilize their user data for a more efficient ad buying strategy. They can collect and analyze user data by deploying their own 3rd-party trackers, or more customarily, use the data provided by their ad buying platforms or data aggregators.

1.5 Structure of the thesis

This thesis is structured around four main parts: related research overview, research methods used, results presentation, and finally the discussion of findings and concluding thoughts. 6

The first part of the thesis, Chapter 2, consists of four related research themes. The first theme describes the online tracking landscape from a technical and business justification point of view. The second theme presents a literature overview of related tracking prevalence research. The third theme defines some of the tool requirements for tracking research. The fourth theme describes the underlying complexity of some of the privacy issues and concerns related to targeted advertising, online tracking, and surveillance in general.

The second part of the thesis, Chapter 3, describes the chosen research approach using three themes. The first theme presents the need and validation for using secondary data provided by the companies Alexa and Disconnect. The second theme presents the tracking measurement tool Tracker Tracker and the Ghostery browser extension. The third theme discusses the data gathering process and some key considerations and choices regarding the data evaluation process and analysis.

The third part of the thesis, Chapter 4, presents the empirical results in five sections. The first section consists of a descriptive analysis of the results and the quality of the data. The second section presents the results that answer the first research question of tracker prevalence. The third section presents the results that answer the second research question of tracking organizations and their coverage of the Finnish user’s digital footprint. The fourth section presents the results that answer the third research question concerning the possible difference between tracking on Finnish versus non-Finnish websites. The last, fifth section, presents some additional findings related to the research theme.

The fourth part of the thesis, Chapters 5 and 6, contains the discussion and conclusions. In Chapter 5, the research findings and implications are discussed more broadly and tied into the surrounding research landscape. Chapter 6 ends the thesis with concluding remarks and suggestions for future research. 7

2 EXISTING RESEARCH

This chapter consists of four related research themes. The first theme describes the online tracking landscape from a technical and business justification point of view (2.1 What is online tracking?, 2.2 Online tracking methods, 2.3 Online tracking justification). The second theme presents a literature overview of related tracking prevalence research (2.4 Online tracking prevalence). The third theme defines some of the tool requirements for tracking research (2.5 Tracking measurement tools). The fourth theme describes the underlying complexity of some of the privacy issues and concerns related to targeted advertising, online tracking, and surveillance in general (2.6 Privacy).

2.1 What is online tracking?

The Center for Democracy & Technology (2011) proposal paper discusses tracking in relation to the “Do Not Track” option, the mechanic that would enable users to signal their possible tracking dissent to the web pages they visit. The paper presents a short but descriptive overview of some of the complex issues surrounding the definition and scoping of online tracking. The authors define tracking as:

“Tracking is the collection and correlation of data about the Internet activities of a particular user, computer, or device, over time and across non-commonly branded websites, for any purpose other than fraud prevention or compliance with law enforcement requests.” (Center for Democracy & Technology 2011:3)

The paper (Ibid) continues with a recommendation on how to distinguish between activities that should and should not be considered as tracking in the context of “Do Not Track”, as seen in Table 1 on the following page.

8

Tracking Non-tracking

3rd party online behavioral advertising; 3rd party ad and content delivery; ad or using tracking data to create a profile for online ad content delivery that does not rely on the specific targeting and delivery. user and does not use any tracking activities described in the tracking section.

3rd party behavioral data collecting for first 3rd party reporting; the reporting of ad or other party uses; using tracking data to advertise on or content using a 3rd party, i.e. ad engagement, and personalize the content of a publisher’s own does not use any tracking activities described in the websites. tracking section.

Behavioral data collected by first parties 3rd party analytics; providing first party website and transferred to third parties in analytics and reports that does not use cross-site identifiable form; sharing first party tracking data aggregation and does not use any tracking data including user identifiers to third parties. activities described in the tracking section.

3rd party contextual advertising; delivering advertising based on the current website or web page, regardless of the user’s previous online behavior.

1st party data collection and use; using the data users have directly provided to deliver first party behavioral ads or personalized service without using any tracking activities described in the tracking section.

Federated identity transaction data; 3rd party user registration and authentication providers that do not use any tracking activities described in the tracking section.

Data collection required by law and for legitimate fraud prevention purposes; e.g. saving IP addresses, as long as it is not used for any tracking activities described in the tracking section.

Table 1 Tracking vs. non-tracking activities (Center for Democracy & Technology 2011)

The issue with the above list of tracking vs. non-tracking activities is that they all rely on an “as long as no tracking activities are used” clause for separation. The explanation to this is quite simple; the actual number of tracking activities and mechanics is too large and distinct to conform to any scoped explanation. For example, Google Analytics 9

provides 3rd party analytics, but also cross-domain analytics1, and it is even possible (and apparently recommended, as can be seen in Figure 1) to share the analytics data with Google. With all the possible scenarios, how would it even be possible for a user to identify and assess the actual tracking vs. non-tracking activities deployed by the websites they visit, the third parties present on those websites, or the large organizations that aggregate the data at the end of the line? The neutral approach, and the one chosen for this thesis, is to view all 3rd parties as possible trackers, as noted by Englehardt & Narayanan (2016).

Figure 1 Data sharing settings in Google Analytics2

1 https://support.google.com/analytics/answer/1033876?hl=en 2 https://analytics.google.com 10

3rd party services can also become regulatory risks for websites, depending on the content and context the web user is in. In 2015 one Finnish bank used Google Analytics in their online bank service, tracking everything the user did. Räisänen (2015) used demo credentials to gain access and analyze the data being shared. One of her findings was that the demo user’s account number was sent as a parameter to Google using only a simple hash to protect it. She wrote a 25-line program that could brute-force the real account number in 0.5 seconds. Overall, she claimed that the information shared with Google was in breach of the Finnish bank secrecy guidelines, in which even the existence of a customer relationship is considered secret. The bank later stopped using Google Analytics (S-Pankki 2015).

2.2 Online tracking methods

The first cookie, a small text file the browser saves on the user’s computer3 (example content shown in Figure 2), was developed by Lou Montulli of Communications Inc during the summer of 1994. The cookie was built to provide stateful storage space for user’s shopping carts in an E-commerce setting. The cookie would enable the user’s shopping cart to stay available even when the user left the website and came back, a big improvement over using forgetful URL methods i.e. saving shopping cart information in the URL (the text in the address bar of the browser). The development was rapid, and cookie capabilities were deployed in Netscape’s browsers from the beginning of October 1994. One core functionality of cookies was implemented from the beginning: domain matching. Domain matching requires the cookie and the website to share domains, i.e. example.com can read and write to a cookie set by example.com, but not one set by example2.com. This capability was developed as a privacy enabler, making sure that users were not trackable between different sites. (Shah & Kesan 2009)4

3 For a longer definition, see 1.4 Key concepts; 4 Shah & Kesan’s (2009) reference http://home.netscape.com/newsref/std/cookie_spec, a now unavailable web page, for Lou Montulli’s involvement and the timeline of cookie development. 11

Figure 2 Example content from cookie written by upcommons.upc.edu

Domain matching worked as planned, but as history has shown, it was not enough to protect the users from cross-domain tracking. If a website contained any 3rd party elements, such as pictures, font libraries, or other content components, the 3rd party could also write and read their own cookie. If the 3rd party’s elements were used on other websites as well, the 3rd party could track the users over these websites. This technical design, unforeseen by Netscape, eventually enabled the dawn of 3rd party tracking and advertising networks, through companies like DoubleClick. Even though the IETF5 identified the risks as early as in December 1995, common standards were not available until five years later. (Shah & Kesan 2009)

5 Internet Engineering Task Force - https://www.ietf.org/ 12

Figure 3 1st party vs. 3rd party cookies

Figure 3 shows the basic difference between cookie access for 1st party and 3rd party entities. School runs a website with content for students. They save authentication (login) data in cookies on the user’s machine but because of domain matching restrictions, no cookie data is available to Tracker. E-commerce runs a web store with authentication and shopping cart data saved in cookies. Tracker provides the web store with a web component (picture, code library, etc.) and can then store a unique user id in a cookie. Because of domain matching restrictions, E-commerce cannot access the unique user id and Tracker cannot access the authentication or shopping cart data. 13

Figure 4 3rd party tracking

Figure 4 visualizes the tracking capabilities of 3rd party cookies. Tracker provides web store with content and saves a unique user id on the user’s machine. When the same user later visits web store 2, Tracker identifies the user through the unique user id it has previously saved in the cookie on the user’s machine. Tracker can follow the user on all web sites that use content provided by Tracker and use the user’s browsing habits to create behavioral profiles. A real-world example of this scenario could be Google’s DoubleClick advertising network, that is able to track users globally by providing websites with ads and placing 3rd party cookies on the users’ machines.

14

Figure 5 Summary of tracking mechanisms (Bujlow et al. 2017:5)

Cookies are the easiest and most prevalent way of tracking users online but are far from the only ones. Figure 5 summarizes a 2017 snapshot of known tracking mechanisms described in the literature review by Bujlow, Carela-Español, Sole-Pareta & Barlet-Ros (2017). It should be emphasized that the 26 tracking mechanics listed are only the ones 15

previously identified by academia; there might be, and probably are (Sanchez-Rola & Santos 2018), countless mechanics still unidentified. One of the contemporarily most interesting tracking mechanisms is called Fingerprinting. Fingerprinting mechanics uses any information it can gather regarding the network, device, operating system, and browser, and tries to build a unique profile that only describes the user’s current setup. The amount of data the browser makes available can be surprising, and individual users might be more unique than they think. Figure 6 shows an example browser privacy test, demonstrating that the author’s browser provided enough data for a unique fingerprint amongst over 300 000 browsers that had taken the same test during the previous 45 days.

Figure 6 Screenshot from PANOPTICLICK6

Even though there are a plethora of options available, both Bujlow et al. (2017) and Li, Hang, Faloutsos & Efstathopoulos (2015) concluded that cookies are the most common

6 https://panopticlick.eff.org 16

way of tracking users. Li et al. (2015) saw technical maturity and legality as key factors for cookie prevalence as a tracking mechanism:

“Although HTTP cookies are not the only means with which 3rd party trackers keep track of users, they are the most popular. There are three reasons to this. Firstly, all browsers can accept and send cookies. Secondly, other non-HTTP cookies exist and can be used for tracking, but they are inefficient or will create legal issues for the entities who utilize them. Finally, even though third- party websites can track a user by their browser fingerprint [13], this method incurs a much higher overhead, thus is unlikely to adopted widely” (Li et al. 2015:4)

2.3 Online tracking justification

Several different approaches of how to describe the justification of online tracking was found in the literature review. Data security expert Bruce Schneier (2015) describes it:

“The primary goal of all this corporate Internet surveillance is advertising. There’s a little market research and customer service in there, but those activities are secondary to the goal of more effectively selling you things.” (Schneier 2015:32)

Or Solove (2004) over 10 years earlier:

“The Internet’s greater targeting potential and the fierce competition for the consumer’s attention have given companies an unquenchable thirst for information about web users.” (Solove 2004:23)

Bujlow et al. (2017) presented 10 applications for online tracking in their literature review:

1) User-oriented search; search results that are personalized based on what the user would most likely react to; partially represented by the term Filter Bubble, as defined by Pariser (2011).

2) Online advertising; the original reason for the development of online tracking, found on most current websites.

3) Web analytics and usability tests; understanding what users do on a specific website and how they react to changes, e.g. A/B-testing.

4) Assessing financial information; some financial companies use online information to assess creditworthiness, e.g. social media relationships, shopping habits.

5) Price discrimination; information such as the user’s location, demographic data, browser, OS, and work or private computer has been shown to influence product or service pricing.

6) Determining insurance coverage; combining online and offline information for insurance risk assessment. 17

7) Impact on the job market; employers making background checks on employees or applicants.

8) Government surveillance; explicit information requests by governmental agencies from corporations such as Google, and the more clandestine kind exposed by Edward Snowden.

9) Identity theft; information shared on Internet, especially social media sites such as LinkedIn, Google+, and Facebook, creates opportunities for identity theft.

10) 3rd party tracking; tracking performed by services other than the one the user is visiting, e.g. the Facebook “like” button or Google Analytics.

Mayer and Mitchell (2012) defined six broad business models for what they called 3rd party websites, i.e. websites that provide a service to other websites:

1) Advertising companies; companies that enable online ad serving through direct buying, advertising networks, or advertising exchanges.

2) Analytics services; 3rd party suppliers of website analytics services, such as Adobe or Google Analytics.

3) Social integration; social media sites providing sharing or other functionalities through embeddable website widgets or buttons.

4) Content providers; services that host embeddable content such as maps, videos, weather, e.g. YouTube or Google Maps.

5) Frontend services; services that host common JavaScript libraries or other APIs that speed up website loading.

6) Hosting platforms; services that provide platforms for content creation and sharing (e.g. Wordpress.com) or content distribution networks (e.g. Akamai) that help publishers share their own content globally.

The value of using user data on an individual level has always been a key advantage for online media, compared to traditional offline media. Already during the early days of the commercial Internet, Gallagher & Parsons (1997:9) presented a basic framework for “leveraging information technology to target online banner advertising more effectively to benefit both users (who would be exposed only to advertising that is very probably of interest to them) and advertisers (whose advertisements would reach only 18

those users who fit the target audience profile)”. They also identify the opportunity to use the same framework to automatically personalize web site content in order to minimize the need to search for the content on the web site. The wording they use is an interesting indication of the web surfing custom of 1997; using a web directory to find the web site, then searching the site for content. Gill, Erramilli, Chaintreau, Krishnamurthy, Papagiannaki & Rodriguez (2013:1) conceded that: “little is known about the economics of online advertising, chiefly the economics of collecting and using personal information of users for facilitating targeted advertising.” Their research suggests that Google is a dominant data aggregator, with Facebook taking second place, and that the top 5 % of aggregators account for 90 % of the advertising revenues.

Yan, Liu, Wang, Zhang, Jiang & Chen (2009) studied what effect behavioral targeting could have on the click-through-rate (CTR) of sponsored search ads. Most importantly they showed that “users who clicked the same ad will be more similar than the users who clicked different ads” (Ibid:270), which is one of the key ideas behind the effectiveness of targeted advertising. They also showed that it could be possible to increase the CTR by as much as 670%, if the targeted user segment was selected based on suitable clustering algorithms. The caveat here is of course that these suitable clusters were identified only after the ads were already run, choosing the clusters with the highest CTR, without any priori segmentation; optimization after the fact.

Beales’ (2010) early research into the value of behavioral targeting indicated that advertising rates (shared by the ad networks and publishers) more than doubled when providing behavioral targeting. This is on the other hand mirrored by more than doubling the advertising effectiveness, i.e. conversion rates, compared to regular non- targeted advertising. Mayer & Mitchell (2012) critiqued these results by pointing out that the participating companies were aware of the purpose of the study, the behavioral targeted ads were not tested against another targeted option, and that the resulting increase in effectiveness and price actually drives down the marginal value for the advertisers. Farahat & Bailey (2012) shared a critical view on the added value of targeting, as they also considered how the choice of target group might in itself skew the results. They even suggested that more sophisticated targeting might harm the advertiser and call for further research on the subject.

Beyond the financial cost/benefit analysis, Anand and Shachar (2009) suggested that the actual targeting of an ad can in itself serve as a net positive signal towards the consumer. This idea is constrained by the assumption that the consumer knows that the advertiser 19

is targeting them based on their interests; i.e. “why would they spend money if they did not expect it to work for me”. As the current European self-regulatory framework 7 8, developed for transparency and consumer-friendly principles, requires the use of an “Advertising Option Icon” in or near any advertising that uses behavioral targeting, compliancy might actually improve the ad effectiveness.

From the advertising industry point of view, IHS Markit (2017) analyzed the economic value of behavioral targeting on behalf of two European non-profit advertising lobbyist organizations, IAB Europe and EDAA. They reported that behavioral data is present in 66% of the €16 billion European digital display market, with an average efficiency (click- through-rate, CTR) uplift of 530%. Johnson (2013:2) came to the conclusion that “[…] improved targeting raises the profits of all firms. This is despite the facts that consumers endogenously adjust their advertising avoidance decisions and that competition may increase for some ad inventory.”

2.4 Online tracking prevalence

In the literature review it was found that Roesner et al. (2012) identified over 500 unique 3rd party trackers when analyzing the 500 most popular websites on Alexa9. They also found that the most prevalent cross-site tracker was the Google-owned Doubleclick advertising platform, which could record user visits from almost 40% of the top 500 pages. The tracker data gathered from Alexa’s top 500 sites was compared to actual user behavior, based on AOL search queries.

Li et al. (2015) on the other hand analyzed the Alexa top 10k sites using machine learning, and only found Google (.com and Doubleclick) on 25% of the sites. Compared to other sources, even earlier sources such as Roesner et al. (2012), this number seems low. Li et al. (2015) did point out that their numbers did not take into consideration Google Analytics, because:

“[…] by contract, Google Analytics provides statistics only to the 1st party websites and the cookies set by Google Analytics are always associated with the domains of the 1st party websites and therefore are not 3rd party cookies. Furthermore, the same user who visits different websites monitored by Google Analytics will likely receive different IDs, which makes tracking him or her non-trivial.” (Li et al. 2015:9)

7 https://www.iabeurope.eu/wp-content/uploads/2016/05/2013-11-11-IAB-Europe-OBA-Framework_.pdf 8 https://www.iab.com/wp-content/uploads/2015/06/OBA_OneSheet_Final.pdf 9 https://www.alexa.com/ 20

This does not, however, explain the difference in Google’s prevalence. Beyond that, cross- domain tracking has since been made available and it is possible to share Google Analytics data with Google, as discussed at the end of chapter 2.1 What is online tracking?.

Englehardt & Narayanan (2016) measured tracking behavior over a very large sample of 1 million top sites provided by Alexa using the OpenWPM web privacy measurement tool they created. They discovered an astonishing long tail of 81,000+ third parties that were present on at least two websites, out of which only 123 were found on more than 1 % of the sites. The results are segmented into Tracking Context and Non-Tracking Context, based on whether or not a consumer privacy tool would have blocked the resource. They do concede that “[e]very third party is potentially a tracker” (Ibid:7), but that they want to provide a more conservative definition. They also note that by including internal pages, instead of only the home pages of the 1 million websites, the average number of trackers would have increased from 22 to 34, indicating that their results were lower than what a real user would experience. Figure 7 shows the top third parties found in the 1 million sites, with Google represented in all top 5 trackers, and in 12 of the top 20. Figure 8 shows the top organizations behind the 3rd party resources, reiterating Google’s dominance, followed by Facebook, Twitter, Amazon, and Adnexus.

Figure 7 Top third party coverage on the top 1 million Alexa sites (Englehardt & Narayanan 2016:8) 21

Figure 8 Organizations with the highest 3rd party presence on the top 1 million Alexa sites (Englehardt & Narayanan 2016:9)

2.4.1 Longitudinal research

One of the most extensive longitudinal investigations of online tracking was presented by Lerner, Simpson, Kohno & Roesner (2016) in the suitably named Internet Jones and Raiders of the Lost Trackers: An Archeological Study of Web Tracking from 1996 to 2016. Analyzing the annually top-500 sites from the Wayback Machine archives, they created historical context for the growth of online tracking, both for the reach of different trackers and the number of separate trackers per web site. Using Wayback Machine10 as the data source was described to underestimate the overall situation, as some scripts and resource calls do not function properly. Even with this handicap, the growth of 3rd party requests was clearly indicated in their results.

10 https://archive.org/web/ 22

Figure 9 Historical 3rd party coverage by Lerner et al. (2016:1008)

Figure 9 shows how 3rd party coverage of top trackers has evolved between 1996 and 2016. The first panel shows the coverage of 3rd parties that are confirmed trackers, whereas the second panel shows the coverage for all 3rd parties, based on data from the Wayback Machine archives. The third panel, on the right, shows the contemporary 2016 view of confirmed 3rd party trackers. (Ibid)

Figure 10 Distribution of 3rd party requests (Lerner et al. 2016:1007)

The second prevalent longitudinal shift evident in the study by Lerner et al (2016) is the ever-increasing number of 3rd party requests per site, as can be seen in Figure 10. The maximum number of 3rd party requests, which was labeled as underestimated because of the Wayback Machine data, was by a site in 2015 that contacted 34 separate 3rd parties.

23

As a summary of their research, Lerner et al (2016) stated that:

“We have uncovered trends suggesting that tracking has become more prevalent and complex in the 20 years since 1996: there are now more unique trackers exhibiting more types of behaviors; websites contact increasing numbers of third parties, giving them the opportunity to track users; the scope of top trackers has increased, providing them with a broader view of user browsing behaviors; and the complexity and interconnectedness of the tracking ecosystem has increased markedly.” (Lerner et al 2016:1009)

Similar results are found in Krishnamurthy & Wills’ (2009) longitudinal privacy diffusion study, which shows that during the timeframe of October 2005 to September 2008, the top-10 most prolific 3rd party trackers’ prevalence grew from 40 % to 70 %. A key feature in the growing coverage of top trackers was also the acquisitions that reduced the number of independent trackers, increasing the dominance of the top five companies: Google, Omniture, Microsoft, Yahoo, and AOL.

During the writing of this thesis arguably the most comprehensive ongoing research was presented by Karaj, Macbeth, Benson & Pujol (2018), who all work for Cliqz GmbH, the company behind the the Cliqz11 privacy browser and the Ghostery12 browser privacy addon. They analyzed a dataset of over 780 million pageloads from a time period of 10 months using data provided by over 500 000 users of the Cliqz and Ghostery browser extensions. The main advantage their approach had was the use of actual user browsing data, compared to the use of automated scripts that try to mimic browsing behavior that almost all other studies utilize. The downside, as Karaj et al. (2018) noted, is the loss of data granularity enforced by privacy constraints on the user-generated browser data. Some of their core findings were:

• Average number of trackers per site, as shown in Figure 11 - “89% of the traffic to the top 600 websites, contains tracking. On average a website loads about 10 trackers, and for each page load, doing 33 requests.” (Ibid:9)

• Tracking reach of the tracker-owning companies, as shown in Figure 12 - “3rd party scripts owned by Google are present in about 80% of the measured web traffic, and operate in a tracking context for more than half that time. Facebook and Amazon follow next.” (Ibid:10). They also noted that as the data was from a mostly German userbase, some German services, such as InfOnline, ranked in the top 10.

11 https://cliqz.com/en/ 12 https://www.ghostery.com/ 24

Figure 11 Distribution of average number of trackers across websites (Karaj et al. 2018:7)

Figure 12 Company tracker reach top 10 (Karaj et al. 2018:8)

The key takeaway from the work of Karaj et al. (2018) is the WhoTracks.Me13 website, where the team continues to publish monthly datasets and analysis. It is a valuable resource for anyone who is interested in the current state of the tracking landscape.

13 https://whotracks.me 25

2.4.2 Regional research

In this section, the central tracking research with a regional or geographical focus is reviewed.

Falahrastegar, Haddadi, Uhlig & Mortier published two separate papers in 2014 analyzing the tracking ecosystems from a regional perspective. In the first study (Falahrastegar et al. 2014A) they looked at the top 500 sites of the USA, UK, Australia, China, Egypt, Iran, and Syria. Their results showed clear regional differences, but with significant cultural or language-based similarities. Google’s dominance was evident and two Google properties, namely DoubleClick and Google Analytics, were the only ones found in the top 20 of every country researched.

Figure 13 Tracker country location heatmap (Falahrastegar 2014B:7)

In their second paper (Falahrastegar et al. 2014B) the authors continued their work and examined 3rd party tracking of the top 500 sites in 29 countries divided into geographical regions: North America, South America, Europe, East Asia, Middle East, 26

and Oceania. The countries included Sweden and Norway, but not Finland. They found a very clear indication of strong local trackers in all regions and countries, as is shown in Figure 13. Overall, they found a strong presence of trackers from the USA, Russia, and Germany in most countries. In the Nordic countries, Sweden and Norway had a clearly similar tracker ecosystem. As for the actual trackers, Google-owned properties were by far the most prevalent in all regions, with Facebook, Amazon, Yahoo, and Twitter sharing the next spots. One clear exception to this rule was East-Asia, where Baidu and Sina were the top contenders, with Facebook and Twitter completely outside the top 20. (Ibid)

Fruchter, Miao, Stevenson & Balebako (2015) researched if and how different privacy regulations affect the amount of web tracking in four countries: the US, Japan, Germany and Australia. They were unable to identify any clear relationship between privacy regulations and tracking from their data, as for example the US has much more tracker activity than Japan, even though they have similar regulation. Their paper suggested cultural or societal reasons but stated that more research is needed to verify this.

Purra & Carlsson’s (2016) research on tracking and HTTPS shows that the top 10,000 sites in Sweden and Denmark have a similar tracker hierarchy as the global top 10,000 sites, dominated by Google and followed up by Facebook and Twitter. Facebook and Twitter were more prominent in the News media category of sites, but much less so in other categories.

A study by Ruohonen & Leppänen (2017) is currently the only academic research paper investigating tracking prevalence from a Finnish perspective. They measured the number of 3rd party cookies on the top 206 Finnish sites, as identified by the TNS Metrix media measurement service. Their results are surprising, as can be seen in Figure 14. The top 3rd party cookies are placed by rubiconproject.com, with the Google-owned doubleclick.net coming in at fourth place. As the authors themselves note, this does not follow previous global results. For some reason, the authors counted the number of individual cookies, as can be seen by the 1,500+ cookies set by rubiconproject.com on the only 206 sites in the data. Most other research in 3rd party prevalence measures the share of sites the 3rd party trackers are set on, not the number of cookies they set. It is not clear why this kind of cookie counting measurement was chosen, as one cookie per site is sufficient for tracking. By using a list provided by TNS Metrix, the authors made two choices, one apparently conscious, the other unclear: Finnish sites and Finnish sites working with TNS Metrix. Their research question set out their purpose to investigate “persistent third-party cookies used in the Finnish web”, meaning sites that “are 27

primarily designed for Finnish users, using Finnish as the primary language for the content” (Ibid:1). This choice is apparently due to their interest in a company-level view of 3rd party tracking, with company referencing Finnish media companies. This might also explain the use of TNS Metrix, as their ranking apparently only includes media sites.

Figure 14 Summary of 3rd party cookies used in the Finnish web (Ruohonen & Leppänen 2017:2)

Whereas Ruohonen & Leppänen (2017) measured the number of cookies which 3rd parties placed on Finnish media sites, this thesis takes a different approach. The focus of this thesis is on the Finnish web user and on investigating which tracker domains and tracking organizations are prevalent on the sites that Finnish web users visit. By using a site list provided by Alexa, it is possible to identify the Finnish user’s global digital footprint, not limiting the data set only to include Finnish language media sites. In the results, the difference between Finnish and non-Finnish sites will also be explored.

2.5 Tracking measurement tools

In order to measure and analyze the prevalence of the online tracking landscape, a certain set of tools or services is required. When Englehardt, Eubank, Zimmerman, Reisman & Narayanan (2015) developed OpenWPM14, “a flexible, modular web privacy

14 https://github.com/mozilla/OpenWPM 28

measurement platform” (Ibid:1), they identified three main components that make up the infrastructure of automated web privacy measurement:

1) Simulating users; being able to simulate web browsing behavior by visiting and interacting with websites.

2) Recording observations; being able to track and record how the websites react to a user browsing “(tracking, content, personalization)” (Ibid:2). Also defined as browser instrumentation.

3) Analysis; being able to analyze the previously recorded content, identify correlations, and provide conclusions.

Simulating users (1) can further be divided into two sub-categories: user type and browser type. User type describes the actual user running the browser, which can be an actual person (crowdsourced study) or a bot (automated scripts). Surprisingly, both approaches have their own ethical challenges; crowdsourcing user data can bring issues of privacy, whereas bots can cause ad costs and misrepresented campaign metrics. Using real users provides a more accurate picture of real browsing habits, whereas automated bots can be experimentally configured to help identify website reactions to very small browser or behavioral changes. Many websites also deploy some sort of bot detection, i.e. identify and possibly block bot-based clients, in order to reduce the negative impact on advertising visibility, to prohibit fraud, or for content extraction. This can impact the results depending on whether the measurements were gathered by actual people or by automated scripts. Browser type describes the technical framework of the browser used to visit websites. There are three main browser types: HTTP libraries, like curl or wget, that can download websites but do not support even basic browser functionalities such as JavaScript; lightweight browsers, like PhantomJS, that support most browser functionalities but do not run plugins; and actual browsers, such as Chrome or Firefox, that provide the full browsing experience. (Englehardt et al. 2015)

Recording observations (2) can be implemented through three main solutions: browser extensions, modified browser source code, and data collection layers, e.g. proxies. By using a browser extension, it is possible to use the website’s JavaScript for measurement or page interactions. By altering the browser source code, it is possible to gain access to all of the browser’s capabilities, not only JavaScript. Finally, by implementing a proxy or other similar data capturing service, it is possible to record and even alter all the network traffic between the browser and the website. (Englehardt et al. 2015) 29

The analysis (3) component is very use case-dependent. For larger or automated continuous research, a suitable data structure and automated analytics scripts are not only needed, but essential for comparable results between runs. In the case of their complementary work, Englehardt & Narayanan (2016) expressed an intent to continue working on their analysis platform, aiming to implement a service where analysts can save and share their scripts and results.

Most tracker research follows the architecture described above, although the tools might differ. For example, Falahrastegar (2014A) used a combination of Python scripts, a Chrome extension, and Linux command line tools in order to gather their data for manual analysis. Roesner et al. (2012) developed a Firefox add-on (extension) called TrackingTracker that supported both automation and data collection, for their research on detecting 3rd party tracking on the web. Later, in Lerner et al. (2016), some of the same researchers developed a tool called TrackingExcavator, that could be used to gather a longitudinal view of how tracking has changed from 1996 to 2016 by using the Internet Archive’s Wayback Machine as a source for historical website content. Bujlow et al. (2017) listed some examples of current measurement tools in Figure 15, below.

Figure 15 Tracking measurement tools (Bujlow et al. 2017:27)

The tracker measurement tool choices and reasoning for this thesis are presented and further described in chapter 3.3 Data collection and measurement tool. 30

2.6 Privacy

The other side of the online tracking economy is represented by the users that are tracked and targeted for advertising and personalization. There is a continuous “arms race” between the companies that want to use the data they have access to and the governments and agencies that try to find the right approaches for the difficult and layered online privacy topic. And in the middle of all of this is the consumer or user, trying to make sense of it all.

The Finnish Office of the Data Protection Ombudsman15 offers the following guidelines on data protection:

“Everyone has the right to the protection of personal data concerning him or her. Data protection is a fundamental right that safeguards the rights and freedoms of data subjects when personal data is processed.” (Data protection Ombudsman 2019A)

”All data related to an identified or identifiable person are personal data.” (Data protection Ombudsman 2019B)

“The processing of personal data refers to activities such as the collection, storage, use, transfer and disclosure of personal data. All activities involving personal data, from the planning of processing to the erasure of personal data, constitute processing of personal data.” (Data protection Ombudsman 2019C)

Mendel, Puddephatt, Wagner, Hawtin & Torres (2012) see privacy as a fundamental right supporting many other rights, but like many others, have difficulty actually defining what privacy necessitates. They present a duality of content for online privacy: what information is considered private or public and how this information is used, by whom and in which ways. They also present five new challenges that the Internet facilitates: 1) the collection of new kinds of information, previously unavailable, 2) the tracking of location as personal information, 3) large scale analytics by governments and private companies, 4) new free-to-use business models with data commercialization, and 5) the multifaceted challenges of regulating the global Internet. All of the above challenges can be seen as the technical and economical enablers for the themes discussed in this thesis; it is easier to collect, analyze and use user data, while it is more and more difficult to legislate because of the global span and the fast rate of technical progression.

This thesis does not take any specific stance on the nature of privacy or other risks involved with the tracking and gathering of user data, but key issues can be found in what Solove (2004) called The Aggregation Effect and Uncertain Future Uses. The Aggregation Effect describes how the value of information grows in a non-linear fashion

15 https://tietosuoja.fi 31

as the amount of information grows. As the separate information transactions seem so small on the individual level, it is difficult for the user to correctly evaluate the scope and availability of his or her personal information. Or as Cohen (2000) famously said: “A comprehensive collection of data about an individual is vastly more than the sum of its parts”. This is even more true today than when Cohen wrote that line, as the data sharing and crunching capabilities currently available are vastly superior. Solove’s (2004) Uncertain Future Uses describes the risks involved with the uncertainty of potential future use of an individual’s personal information. It is impossible to guarantee, despite possible legislative efforts, that data gathered today will never be used beyond what the current privacy rules and regulations demand, be it because of data breaches, clandestine corporate concepts, or simple merger and acquisition.

A very interesting phenomenon within the privacy realm is the so-called privacy paradox. The privacy paradox describes the contradictory findings between how concerned people claim to be with regard to their privacy vis a vis their actual behavior. Brown (2001) was one of the first studies in which this phenomenon presented itself, when people who were worried about privacy, and about who was using their data on the Internet, were at the same time willing to use loyalty cards for offline purchases. An in- depth discussion describing the privacy paradox can be found in Barnes (2006), looking at the difference between what is understood as a public versus private space, in relation to the then new and rapidly growing social media. In a comprehensive literature review by Barth & De Jong (2017), the authors identify many academically suggested explanations for the privacy paradox, mainly falling under two themes: 1) risk-benefit calculations and 2) little to no risk assessment. The first theme contains explanations relating to rational choice, resource exchange, immediate gratification, and habits. The second theme contains explanations relating to conformity, peer pressure, and incomplete information.

The EMC Privacy Index (EMC 2014), surveying global consumer attitudes on privacy, presents the privacy paradox as a key finding. Three separate paradoxes are presented: 1) consumers want all the benefits of new technologies, but do not want to sacrifice their privacy for it; 2) despite their expressed valuation of their privacy, consumers take almost no action to protect it, but rather place the responsibility on companies and government; 3) consumers share vast amounts of personal information on social media sites, even though they do not trust them to protect their privacy. Some examples of the 32

privacy paradox are presented at the end of the next chapter, 2.6.1 Data collection, in research on how users value their privacy.

Finnish people are by no means isolated from the privacy paradox, as can be seen in a Finnish report on online privacy and anonymity by Sirkkunen & Haara (2017). Of the respondents 68 % worried about the growing amount of online data tracking and 76 % wanted to have a better understanding of what data was collected and for what purpose. At the same time, not many took the time even to acquaint themselves with the terms of use of the services they use (Facebook 63 %, Google 40 %, 38 %, and WhatsApp 36 %), and even these numbers were seen as implausibly high by experts used in the report. Most respondents felt that they must give away their data in order to be able to use the services (64 %), and that the data would be collected anyway, whatever they do (69 %).

2.6.1 Data collection

“Since technique is omnipresent and, as an old sociological wisdom says, humans are a data producing animal, all tracks and traces left behind by human beings can be and are collected. As we have seen, since information is money and power, government and industry are using almost all means at their disposal for this data collection regime.” (Holvast 2009:37-38)

In their book, “Liquid surveillance: A conversation”, Bauman & Lyon (2012) discuss their views of the state of surveillance in our modern world. In their post-panoptical world, we all voluntarily leak personal information, be it for economical gain or just plain negligence or indifference. This information then becomes the fodder that feeds the economic machinery:

“[…] what Bauman calls the post-panoptical world of liquid modernity much of the personal information vacuumed so vigorously by organizations is actually made available by people using their cellphones, shopping in malls, travelling on vacation, being entertained or surfing the internet. We swipe our cards, repeat our postcodes and show our ID routinely, automatically, willingly.” (Bauman & Lyon 2012:17)

“The gear for the assembly of DIY, mobile and portable, single-person mini-panopticons is of course commercially supplied. It is the would-be inmates who bear responsibility for choosing and purchasing the gear, assembling it and putting it into operation. Though the monitoring, collating and processing of the volatile distribution of individual synoptical initiatives once again requires professionals; but it is the ‘users’ of the services of Google or Facebook who produce the ‘database’ – the raw material which professionals remould into Gandy’s ‘targeted categories’ of prospective buyers – through their scattered, apparently autonomous yet synoptically pre-coordinated actions.” (Bauman & Lyon 2012:66)

This goes back to the notion of “if you are not paying for the product, you are the product”. As Hal Varian (2014), chief economist at Google, describes: “Google runs about 10,000 experiments a year in search and ads. There are about 1,000 running at any one time, and when you access Google you are in dozens of experiments” (Ibid:7). 33

This raises the question: how many of Google’s users are aware that they are taking part in dozens of experiments, or that Google also lets their partners, advertisers and publisher, experiment on their platform? At least for Finnish users, of whom almost 60% have skipped Google’s terms of service, this might come as a surprise (Sirkkunen & Haara 2017). The same disinterest for the privacy policies was identified earlier by Tuunainen, Pitkänen & Hovi (2009) in the context of Finnish users on the still quite new Facebook social media platform. Social media privacy research by Liu, Gummadi, Krishnamurthy & Mislove (2011) similarly suggests that most users leave their privacy settings untouched (64 %) and the settings they have only sometimes (37 %) match their expectations. Continuing with Facebook, Kramer, Guillory & Hancock (2014) can be seen as an example of the scary possibilities that social media enables. Working at Facebook, they notoriously tested their research hypothesis regarding emotional contagion by manipulating the Facebook content stream called the “” of almost 700 000 users without their knowledge, successfully influencing their emotional states.

Zuboff (2015) describes the opportunistic behavior of the technology firms and the “asymmetry of knowledge and rights” as a backdrop for what she calls surveillance capitalism. The speed at which the online surveillance landscape has changed was missed by all but the experts, and it took a long time before social pushback and lawsuits emerged, leaving the field open for exploitation. Just by looking up the market value of companies like Google or Facebook, it easy to see the role and value of Zuboff’s surveillance capitalism. Zuboff sees the development of what she calls an intelligent global organism, the Big Other:

“Data about the behaviors of bodies, minds, and things take their place in a universal real-time dynamic index of smart objects within an infinite global domain of wired things. This new phenomenon produces the possibility of modifying the behaviors of persons and things for profit and control. In the logic of surveillance capitalism there are no individuals, only the world- spanning organism and all the tiniest elements within it”. (Zuboff 2015:85)

From a practical use case point of view, an earlier survey by Turow, King, Hoofnagle, Bleakley & Hennessy (2009) showed a clear attitude against targeted online advertising among US users: 66% did not want ads tailored based on behavior on the current website the user is visiting. That number increased to 84% when they were told that the data used to tailor the ad was collected on other websites as well. Furthermore, 83% did not want their news content to be tailored based on their behavior on the current and other websites. This practice could enable the creation of personal “Filter Bubbles”, as defined in the book by Pariser (2011), where the online world becomes a self-fulfilling echo chamber, showing only content that induces a positive reaction. McDonald & Cranor’s 34

(2010) similar work found that only a comparable 45% of US users would like to receive tailored online advertising, whereas 40% of respondents self-reported that they “would change their online behavior if advertisers were collecting data” (Ibid:1). The respondents understood that advertising made it possible to provide free content, but preferred that it should be done without their data. This will most likely not be the case, as Lerner et al (2016:1009) found that “[…] over time, more third parties are in a position to gather and utilize increasing amounts of information about users and their browsing behaviors.”

A contradiction to the above findings on claimed privacy issues in online advertising can be found from studies on the valuation of personal information, e.g. selling or protecting personal information. Spiekermann, Grossklags & Berendt (2001) demonstrated that even though people might report privacy as an important issue to them, their behavior does not mirror this concern. Even users identified as “privacy aware” shared this trait. Grossklags & Acquisti (2007) found that people are willing to share their personal information for very little compensation and are much more likely to sell their data than to pay to protect it. In a study by Beresford, Kübler & Preibusch (2012), users were presented with two online stores from which to purchase a DVD, one with the regular price and the other with a 1 Euro discount but requiring more personal information. Almost all participants that purchased the DVD chose the discounted version requiring the additional information. Surprisingly, after removing the 1 Euro discount, half of the purchases still went through the store requiring extra information, even though the users were dissatisfied with divulging more information. Tsai, Egelman, Cranor & Acquisti (2011) on the other hand concluded that users were more likely to purchase from online stores with salient privacy information and were even prepared to pay a premium price for improved privacy. Carrascal, Riederer, Erramilli, Cherubini & de Oliviera (2013) discovered that users value offline personal information and certain online data, such as photos or financial transactions, higher than merely the fact that they had visited certain websites. They also identified a mismatch in the user attitudes concerning the utilization of their personal information: “[…] while users are overwhelmingly in favor of exchanging their PI in return for improved online services, they are uncomfortable if these same providers monetize their PI” (Ibid:1). All this supports the concept of the privacy paradox, as discussed in the previous section.

35

2.6.2 Data processing

“In the Information Age, personal data is being combined to create a digital biography about us. Information that appears innocuous can sometimes be the missing link, the critical detail in one’s digital biography, or the key necessary to unlock other stores of personal information.” (Solove 2004:44)

Solove’s (2004) Aggregation Effect and Uncertain Future Uses, presented in the previous section, are incorporated in Solove (2006), where he introduces five forms of information processing, i.e. the storage and use of collected data:

1) Aggregation; the gathering and combining of personal information from many different sources, creating an even more comprehensive profile of the person. The capabilities to aggregate data have grown rapidly in scope and ease of use, in the wake of new computer technologies. Similar to the Aggregation Effect of Solove (2004).

2) Identification; creating a link from the personal information to the actual person, making it true information.

3) Insecurity; when aggregated and identified personal information is available, security risks regarding how the information is handled and protected arise. This puts people at risk for future harm as, for example, these kinds of data collections are treasure troves for identity thieves.

4) Secondary use; using data for purposes or in ways that were not originally intended or approved by the user. If one does not have confidence that the data consent given is binding the data gatherer, what is the role of consent? Similar to the Uncertain Future Use of Solove (2004).

5) Exclusion; not informing users about or giving users access to their data, or not accepting corrections to the data. Not being sure what is known, who knows, and what is done with this information places individuals in a state of uncertainty and vulnerability.

All of the five forms of data processing presented by Solove (2006) function as a background to why the theme of tracking and data discussed in this thesis is relevant to the Internet audience as a whole. The rest of this section presents examples of research and points of view that illustrate this relevance. 36

Using data that Netflix16 published for their 2005 “Netflix Prize”, a crowdsourced competition to improve their recommendation algorithm, Narayanan & Shmatikov (2008) showed how easy it is to de-anonymize allegedly anonymous data. With only the movie titles, eight user assigned ratings and approximate dates (+/- 14 days) of those ratings, they were able to uniquely distinguish 99 % of the users in the Netflix data. With a 3-day margin on the date of the ratings, only 2 ratings were needed for a 68 % identification. Further, by obtaining a small sample of only 50 users on the online movie rating platform IMDB17, they were able to identify two of the users as belonging to the Netflix data, with a very high statistical probability. In a similar but more hands-on fashion, Barbaro, Zeller & Hansell (2006) used what AOL18 had published as anonymized search engine data to show how easily search terms can be used to reidentify people. They found user 4417749, a 62-year-old widow in a town called Lilburn (Georgia, USA), based simply on her search history. Further, as Jones, Kumar, Pang & Tomkins(2007:5) concluded: “Removing classes of identifying terms such as names, digits and places is not sufficient to prevent attacks which use a combination of techniques.”. Considering the fact that these re-identifications of users were made by researchers and journalists with what was thought to be anonymous data, one can only imagine how simple re- identification would be for those entities with access to the full data.

Kosinski, Stillwell & Graepel (2013) used 58 000 volunteers to demonstrate how easy it is to predict psychodemografic profiles based on Facebook likes. By accessing the volunteer’s Facebook likes, the researchers were able to automatically and accurately predict sexual preference (88 % for men, 75 % for women), ethnicity, (African American vs. Caucasian American 95 %), political parties (Democrat vs. Republican 85 %), and religion (Christianity vs. Islam 82 %), among others.

“In essence, our machines, armed with our data, can increasingly figure things out about us beyond any previous level, and completely unaccounted for in law, policy or even basic awareness among the general public. What this kind data analysis can reveal about a person is not only far reaching; it may include information that the person herself did not know.” (Tufekci 2015:211)

16 https://www.netflix.com/ 17 https://www.imdb.com/ 18 https://www.aol.com/ 37

The Center for Democracy & Technology (2011) define 3rd party online behavioral advertising as:

“[…] the collection of data about a particular user, computer, or device, regarding web usage over time and across non-commonly branded websites for the purpose of using such data to predict user preferences or interests and to deliver advertising to that individual or her computer or device based on the preferences or interests inferred from such web viewing behaviors.” (Center for Democracy & Technology 2011:5)

Online behavioral advertising, or OBA, is a complex use case both technically and from a user perspective. Ur, Leon, Cranor, Shay & Wang (2012) report that “participants found OBA smart, useful, scary, and creepy at the same time” (Ibid:1) and that “many participants liked that OBA would show them more useful ads, yet they were concerned about privacy” (Ibid:6). Rao, Schaub & Sadeh (2015) found that users feel both surprised and concerned when shown the content and extent of profile data linked to cookies on their devices. Johnson (2013) found that users’ attitudes towards targeted advertising may be quasiconvex, U-shaped, where early targeting efforts might annoy users but that eventually users would appreciate the increased ad accuracy. Watson (2014) describes this as the “Uncanny Valley of Personalization”, when one becomes unsure whether the targeting algorithms are wrong or if they know something one does not know oneself: “Personalized ads and experiences are supposed to reflect individuals, so when these systems miss their mark, they can interfere with a person’s sense of self.” (Ibid). The term “Uncanny Valley of Personalization” is a play on the concept of “Uncanny Valley”, first described by the Japanese roboticist Masashiro Mori in 1970 (Mori, MacDorman & Kageki 2012:98): “a person's response to a humanlike robot would abruptly shift from empathy to revulsion as it approached, but failed to attain, a lifelike appearance.”

Further, in regard to data accuracy, Rao et al. (2015) studied cookie-based behavioral profiles from Bluekai Registry, Google Ad Settings, and Yahoo Ad Interests. The term “Cookie-based” entails that the data that has been gathered can be linked to a specific cookie on a user’s device and that that the user can access the profile data using that same device, without the need to log in. Only a small number (6%) of the participants in the small online survey documented by Rao et al. (2015) felt that their profiles were accurate, whereas 17% reported empty profiles and 45% reported inaccurate profiles. Their research also presented, based on semi-structured in-person interviews, an interesting duality regarding the quality of the profile data: in general, accurate profile data caused concern whereas inaccurate data did not. Even when presented with sensitive data (credit and income), the inaccuracy made the respondent feel unconcerned. Only one of the participants noted that “he would be concerned about incorrect data if it was used 38

to make adverse decisions about him” (Ibid:7). Watson (2014) describes a virtual self, living in the aggregated data centers as the “data doppelgänger”, composed of “my browsing history, my status updates, my GPS locations, my responses to marketing mail, my credit card transactions, and my public records” (Ibid). 39

3 RESEARCH METHODS

This chapter describes the chosen research approach using three themes. The first theme presents the need and validation for using secondary data provided by Alexa and Disconnect (3.2 Input data gathering). The second theme presents the tracking measurement tool Tracker Tracker and the Ghostery browser extension (3.3 Data collection and measurement tool). The third theme discusses the data gathering process and some key considerations and choices regarding the data evaluation process and analysis (3.4 Data gathering considerations, 3.5 Data gathering process, 3.6 Data categorization, 3.7 Data interpretation choices).

3.1 Research design

The purpose of this thesis was to examine the prevalence of online trackers on sites that Finnish web users most frequently visit. The research is comparable to many other tracking studies, albeit with a different user-based geographical scoping. Thus, the empirical research of the thesis follows a similar approach to that of current online tracking research, as described in many references in 2.4 Online tracking prevalence (e.g. Roesner et al. 2012; Falahrastegar et al. 2014A; Falahrastegar et al. 2014B; Englhardt & Narayanan 2016; Karaj et al. 2018). The approach is quantitative in nature, as it “examines relationships between variables, which are measured numerically and analysed using a range of statistical and graphical techniques” (Saunders, Lewis & Thornhill 2015:166).

The research approach involved three main choices:

1) Target group selection; Which sites to investigate.

2) What to measure; Which tracking mechanisms to identify and which tracking measurement tool/s to use.

3) Which relationships to look for; How the data is analyzed.

These choices and their reasoning are constrained by the overall research aim in each case. The choices made for this thesis and their implications are described in following sections of this chapter.

3.2 Input data gathering

The research approach of this thesis requires two sets of secondary data, as described by Saunders et al. (2015): the list of the top 500 sites which Finns frequent and the 40

connection between a tracker and the organization that owns it. For the purposes of this thesis, Alexa was chosen for the top 500 sites list and Disconnect was chosen for the tracker-owner relationship. The next two sub-sections will describe and validate these two data choices.

3.2.1 Alexa

Alexa.com is an Amazon-owned web service that provides tools and data for marketing and analytics purposes. One key functionality for academic research is the worldwide tracking of popular websites called Top Sites. These provide a daily updated list of top sites globally or per country. The Alexa top sites list is based on web usage from panel members and certified contributing sites19. The top site lists are the most utilized site lists in tracking research because, as Libert (2015) describes it:

“The best population of sites was determined to be the top one million sites list published by Alexa. Alexa is a subsidiary of Amazon who provides website traffic metrics and rankings derived from a “panel of toolbar users which is a sample of all [I]nternet users”. The degree to which Alexa’s data accurately represents popularity on the web is debatable, as Alexa Toolbar users are by no means a random sample of all Internet users. However, it has become common practice for researchers to use the Alexa list in the absence of other, better lists […]” (Libert 2015:4)

When comparing tracker popularity rankings between Alexa’s top 500 sites and sites identified through AOL search query responses, Roesner et al. (2012:10) found that “…in general, the ranking of the trackers in the top 500 corresponds with how much real users may encounter them”. This would indicate that the Alexa top 500 list can be used as a decent web usage proxy when analyzing tracker prevalence. Another source for top websites of Finnish users could have been the now discontinued TNS Metrix20 (or the new media measurement service provided by FIAM21), but they only provide data on websites that have added the TNS tracker code, which means mainly Finnish media websites. In order to get an accurate picture of the Finnish user’s browsing habits, including websites that are non-media or non-Finnish, the Alexa dataset is clearly a better option. The data which TNS Metrix provides can still be used to validate the internal ranking of Finnish media websites, i.e. the top 10 websites on TNS can be found in a similar order in the data provided by Alexa.

Beyond the research of Roesner et al. (2012) and Libert (2015), it is clear that Alexa has become the main source for site lists when it comes tracker research; e.g. Casteluccia,

19 https://support.alexa.com/hc/en-us/articles/200449744-How-are-Alexa-s-traffic-rankings-determined- 20 http://tnsmetrix.tns-gallup.fi/public/ 21 https://fiam.fi/tulokset/ 41

Grumbach & Olejnik (2013), Falahrastegar et al. (2014A, 2014B, 2016), Fruchter et al. (2015), Metwalley, Traverso, Stefano, Mellia, Miscovic & Baldi (2015), and Lerner et al. (2016).

3.2.2 Disconnect

In order to map the tracking landscape based on what trackers can be found on websites, it is crucial to identify the organizations behind the trackers. For some trackers, ownership is quite evident based on the name, e.g. Google Analytics is owned by Google. For other, well known trackers, the link can easily be created with some industry knowledge, e.g. DoubleClick is also owned by Google. But this is definitely not the case for all trackers, especially if the measurement tool used only provides the tracker domain and tracking script, e.g. “https://trackerexampledomain.com/tracking.script”.

Libert (2015) performed manual detective work in order to identify the tracker owners, but for this thesis, a pre-compiled list was chosen. The link between trackers and their owners can be found using the source code of a tracking detection application called Disconnect. Disconnect is open source and has published some of their applications on GitHub. For this thesis, the Disconnect tracking protection list, available on GitHub22, is used to identify ownership of the tracker’s domains and then to cross-reference the data from the measurement tool. Similar approaches have also been used in previous research: Englehardt & Narayanan (2016) combined the manual work by Libert (2015) and the Disconnect list for mapping domains to organizations, whereas Falahrastegar et al. (2014B) used similar reference data available from the Firefox addon Collusion.

3.3 Data collection and measurement tool

The core architecture choices related to data collection for tracking measurement research are discussed in 2.5 Tracking measurement tools. A perfect measurement tool should be able to simulate a user browsing the site, record the observations of 3rd party trackers, and analyze the results. There were a few different solutions available, but most only handle some of the architecture needed, and the work of implementing the more elaborate tools would place them out of the scope of a master’s thesis; setting up a virtual machine, installing and configuring a collection and measurement tool like OpenWPM would put the focus of the thesis in the technical realm, instead of focusing on the chosen research questions. Using a browser, like Chrome or Firefox, with a suitable browser

22 https://github.com/disconnectme/disconnect-tracking-protection 42

extension and manually surveying the chosen sites would have taken too much time and would have introduced an unnecessary risk of data corruption, when manually recording the results of each site.

In order to find a suitable balance between gathering the empirical data and the actual analysis, the measurement tool should be easy to implement and should focus on automating the first two architectural components, the site browsing and observation recording. As the research conducted in this thesis does not have a longitudinal component, there was little reason to build a library of or a framework for analytics scripts that would support repeated runs. Based on these prerequisites, the Tracker Tracker measurement tool was identified as the most suitable tool for this thesis, as it provided an easy-to-use web interface for 3rd party tracking data collection and measurement, and could automatically browse up to 100 sites, with subpages, in one request. The next section describes the tool and how it was used.

3.3.1 Tracker Tracker

The Tracker Tracker tool23 24 was created by participants at the Digital Methods Initiative25 Winter School 201226. They built a web service based on the phantomJS scripting language using the Ghostery browser extension for tracker detection. As such, it does not “click on any buttons, accepts no cookies, and cannot move beyond logins, cookies, or paywalls”. (Digital Methods Initiative 2014)

The Tracker Tracker tool gives anyone with a web browser and a list of websites the possibility to identify 3rd party calls on those websites. The tool then gives the user the choice of how the result data should be formatted and a simple view of previously run tests. The rest of this section describes how to use the Tracker Tracker tool and the results it provides.

23 https://tools.digitalmethods.net/beta/trackerTracker/ 24 https://wiki.digitalmethods.net/Dmi/ToolTrackerTracker 25 https://wiki.digitalmethods.net/Dmi/DmiAbout 26 https://wiki.digitalmethods.net/Dmi/DmiWinterSchool2012TrackingTheTrackers 43

Figure 16 The Tracker Tracker tool's user interface

The Tracker Tracker web user interface can be seen in Figure 16. The tool is used as follows:

1) Write or paste the full URL of the site, sites, or pages to examine

2) Select how deep the measurement should go, i.e. how many in-site links the tool should follow. This is a good addition, as some trackers are not used on the front page. 44

3) Name the result set, making it easier to identify and retrieve from the ‘Past Jobs’ menu item.

4) Run the tool from the “Track trackers” -button

5) When the results are ready, the Output link turns dark and displays the different result format options, as shown in Figure 17 below

Figure 17 The Tracker Tracker data output options

The Tracker Tracker tool offers 6 different result formats:

1) CSV (exhaustive); All tracker data in a tab-delimited csv file. This format gives the user the most room to maneuver, as it retains all data from the Ghostery output. This is the format used in this thesis and presented in Figure 18.

2) CSV (per host); Only one row per URL is returned, listing the trackers that were observed with a counter of how many times separate trackers were observed.

3) GEFX (Gephi); The Graph Exchange XML Format that describes network structures. Used by the Gephi project, which is a data visualization tool27

4) HTML; Displays the same data as the CSV (exhaustive) format, but in HTML

5) # of trackers per host (HTML); Similar to CSV (per host) but also returns how many times the trackers were observed per URL examined. Shown in HTML

6) # of trackers per host (CSV); Like CSV (per host) but also returns how many times the trackers were observed per URL examined

27 https://gephi.org/gexf/format/ 45

Figure 18 Example Tracker Tracker result set converted from csv to Excel format

Figure 18 shows an example CSV (exhaustive) result set converted from tab-delimited csv to Excel. The columns that are relevant for this thesis are:

• host; The web site examined

• path; The page path under the web site that is examined.

• type; The 3rd party resource’s category, as determined and categorized by the Ghostery tool. This categorization is further discussed in 3.6 Data categorization.

• patterns; The pattern the Ghostery tool used to identify the 3rd party resource. In this thesis, the patterns are cleaned up and reformatted to contain only their domains, e.g. “google-analytics\.com\/analytics.js” becomes “google- analytics.com”. The resulting domain information is then used to identify ownership using the Disconnect dataset, as described in 3.2.2 Disconnect.

• name; The 3rd party resources name, as determined and labeled by Ghostery

• affiliation; the naming of this column might indicate some sort of relationship or ownership of the tracker, but no data was available in any of the results checked.

3.3.2 Ghostery

Ghostery is a privacy extension for modern web browsers. It detects 3rd party JavaScript and other tracking tools based on a database of known trackers and enables the user to block them if needed. Figure 19 shows an example screenshot of the Ghostery extension running on a Chrome browser identifying 93 trackers on the mashable.com frontpage. 46

Figure 19 mashable.com in Chrome with the Ghostery extension open (screenshot from 17.9.2015)

During the writing of this thesis, the team behind Ghostery have themselves published a paper on tracking prevalence (Karaj et al. 2018) and set up a website for their continuing work on the subject28. The website might be of great utility for future researchers of tracking prevalence, as discussed at the end of 2.4 Online tracking prevalence.

Even though Ghostery is used for ad and tracker blocking, it itself shares data with 3rd parties. The data on which trackers users meet online was anonymized and sold to companies that use the data for web security and optimization (Greenberg 2016). As is the case with most ad and tracker blockers, at least those run by businesses, there needs to be a business model in the background. In Ghostery’s case data, sharing is voluntary and can be turned off.

3.4 Data gathering considerations

When using a stripped-down browser, like Tracker Tracker’s PhantomJS utilized in this thesis, Englehardt et al. (2015:4) recommended running “explicit checks to see if it accurately mimics a full-fledged browser”. In preparation for this thesis, when searching for a suitable tracking measurement tool, Tracker Tracker’s capabilities were compared to a full-fledged Chrome browser with the Ghostery extension installed. The

28 https://whotracks.me/ 47

results proved to be similar enough to be a reliable source of tracking data. Before the actual data gathering runs, the Tracker Tracker tools results for 10 separate sites were compared to the results of Chrome with Ghostery run on the same sites, with similar results. All measurement tools come with some technical limitations and Tracker Tracker is no exception to this rule.

It is also important to note that when analyzing website content and especially linked content such as 3rd party requests, the results are always snapshots of that exact moment, browser, web page, and other environmental factors. Even the page content can vary for multiple reasons, from simple edits or adaptive content, but for this thesis the central issue is with the way in which online advertising is deployed.

Figure 20 Simple ad serving process

When a browser loads up a web page that contains advertising elements, the advertising content is usually requested from an ad server. Figure 20 depicts a simplification of the ad serving process:

1) The web page is loaded and a script requests content from the ad server.

2) The ad server decides what content is to be shown concerning that specific page to that specific browser and returns the ad URL to the browser. 48

3) The browser requests the advertising element content from the ad content provider

4) The advertising content is returned, and the browser populates the advertising element.

Because of the nature of the online advertising architecture, the list of 3rd party trackers can differ between every page refresh. When a web page is loaded, programmatic advertising tags will request a real-time auction for the available ad spaces. As the winners of these auctions can vary between each request, there is theoretically no way to identify all possible participants of the auction unless one has direct access to the actual auction system. Depending on how the ad server is set up, the advertising content shown on the page can also come from multiple sources. The content might be hosted on the ad server or on another ad server owned by yet another 3rd party, possibly operated by the advertiser or their advertising or media agency.

Another factor to consider is frequency capping, which sets a maximum number of ad views for a certain browser. Theoretically, when the cap is reached, the same ad is not shown to that browser again, giving way for another advertiser to be shown and a tracking cookie to be set. Lerner et al. (2016) also identified the issue of 3rd party variation in their longitudinal research based on Wayback Machine data when analyzing the same Alexa Top 100 pages three times during August-September 2015 shown in Figure 21. They stated that “…variation between runs even a week apart is notable…” (Ibid:1002).

Figure 21 Variability of observed trackers on Alexa top 100 sites (Lerner et al. 2016:1002)

Finally, it is also necessary to acknowledge the page/content structure of websites and the trackers within. The 1 million site census by Englehardt & Narayanan (2016) only measured the front pages of these sites, but according to their preliminary analysis, the inclusion of 4 subpages could have increased the average number of 3rd parties per site 49

by over 50 % (34 vs. 24 trackers). The top 20 trackers were found on 6 % to 57 % more sites when also measuring the 4 subpages.

3.5 Data gathering process

The data gathering process for this thesis required many separate steps using different data, technologies, and tools. Following the data and tool choices described in previous sections of this chapter, the data gathering process for this thesis is presented below:

1) Download the Alexa Finnish top 500 list; Sign up for Amazon AWS and create a suitable user and policy29. Create an API call script or modify one of the examples provided by Amazon30. As the Alexa Top Sites API31 only returns results for 100 sites at a time, the script needs to make 5 incremental requests in order to get the full top 500 data.

2) Download the Disconnect tracker-owner reference files; Download the services.json files from GitHub32 and convert the json into a suitable format, if necessary. For this thesis, the data analysis was performed in Excel, which required considerable manual work before the .json data could be used.

3) Run Tracker Tracker; Use the sites returned by the Alexa Top Sites API in the first step. As with the Amazon API, Tracker Tracker also limits the service to 100 sites per request, so five separate runs are needed for the full top 500 sites. In order to sample some of the trackers that are not available on the front page, remember to choose to measure subpages as well. The data in this thesis was gathered using five subpages per host. Finally, in order to alleviate some of the issues identified in section 3.4 Data gathering considerations, related to rotating of ads or ad auctions, it is recommended to run the measurements more than once in order to trigger a larger number of trackers. The data in this thesis was gathered by running the Tracker Tracker tool five times for all of the top 500 sites, resulting in a total of 25 runs: 5 repetitions for each of the 5 runs of the max 100 sites per run.

29 https://docs.aws.amazon.com/AlexaTopSites/latest/MakingRequestsChapter.html 30 https://aws.amazon.com/code/alexa-web-information-service-query-example-in-php/ 31 https://aws.amazon.com/alexa-top-sites/ 32 https://github.com/disconnectme/disconnect-tracking-protection/blob/master/services.json 50

4) Combine the csv files into one large Excel file; Add the data acquired from the previous steps into an aggregated Excel file: 1) site ranking information, 2) Tracker ownership information, and 3) tracking information per site.

3.6 Data categorization

Roesner et al. (2012) created tracking categories based on technical tracker behavior. They manually identified five different behaviors, taking into account possible cross-site functions and the possibility of users visiting the same domain that the tracker resides on. Lerner et al. (2016) built on this categorization and introduced an additional category called Referred Analytics Tracking, as a combination of the Analytics and Referred tracker categories. Considering the limitations and choices described in previous sections of this chapter, this thesis uses the prebuilt tracker categorization provided by Ghostery. As the categories are not seen as key data points vis a vis the research questions or their answers, this is a choice of convenience. At the time when the data gathering was implemented, Ghostery segmented trackers into 8 separate categories. These categories were built based on common tracking behaviors and the trackers were assigned to the most suitable category based on the perceived purpose of the tracker. The Ghostery tracker segments are described in Table 2.

Tracker Category Tracker Purpose Advertising Provides advertising or advertising-related services such as data collection, behavioral analysis or retargeting. Comments Enables comments sections for articles and product reviews. Customer Interaction Includes chat, email messaging, customer support, and other interaction tools. Essential Includes tag managers, privacy notices, and technologies that are critical to the functionality of a website. Adult Advertising Delivers advertisements that generally appear on adult content sites. Site Analytics Collects and analyzes data related to site usage and performance. Social Media Integrates features related to social media sites. Audio/Video Player Enables websites to publish, distribute, and optimize video and audio content.

Table 2 Ghostery tracking categories (Ghostery 2019)

3.7 Data interpretation choices

For the data generated by Tracker Tracker to be able to answer the research questions of this thesis, some additional labeling and interpretation is required. 51

As Englehardt & Narayanan (2016) claimed, all 3rd parties can be used for tracking and are therefore counted as trackers in this thesis.

Sites (e.g. google.com, facebook.com) that are owned by companies identified as trackers (e.g. Google, Facebook) are manually identified as being tracked by said parties. As these are 1st party sites in relation to the tracking companies, tracking measurement tools would not, and correctly so, register any 3rd party trackers on these sites.

To answer the third research question, regarding differences between tracking on Finnish sites and non-Finnish sites, the top 500 sites were manually labeled according to three rules:

• Content is available in at least one of the three official languages (Finnish, Swedish, Sami).

• Content originally created for the Finnish market, not only translated.

• Possible physical presence in Finland.

Examples:

• xxl.fi = Finnish; Even though the company behind XXL is Norwegian, the site is in Finnish, has content for the Finnish market, a local organization and physical presence in Finland.

• google.fi = Non-Finnish; Even though the site is in Finnish and Google has a local organization, the site is only a translated version of the site google.com.

• ikea.com = Non-Finnish; Even though the content is in Finnish, represents the Finnish market, and IKEA has local presence, ikea.com is a multilanguage and multimarket site.

As is evident, the Finnish/Non-Finnish labeling rules are not without issues, but a best interpretation was used to categorize the top 500 sites.

3.8 Quality of the study

The quality of quantitative research is guided by the scientific canons of inquiry, reliability and validity. Reliability describes the ability to replicate the research design and to obtain the same results. Validity can further be divided into two parts: internal validity and external validity. Internal validity requires establishing a statistical causal 52

relationship between two variables but is not applicable to this thesis because of its exploratory nature. External validity concerns with the generalizability of the research findings beyond the chosen settings of the study. (Saunders et al. 2015:202-204)

The reliability of this study is dependent on two data gathering-related limitations that are difficult to circumvent if trying to replicate the study:

1) The sites and web pages measured will not be identical to the versions measured for this study.

2) The advertising and other content shown on the web pages will not be the same as the content measured for this study.

If the study could be replicated using exactly the same websites and 3rd party calls as the original used, then the results would be the same. Otherwise, because of the issues discussed in 3.4 Data gathering considerations, even running the same data gathering twice in succession will most probably not deliver the same results. The ever-changing nature of the websites and real-time advertising auctions make every measurement a temporally anchored snapshot of the current state. This should not be seen as a reliability issue with this study, as the same limitations bind all tracker prevalence studies.

As for external validity, the findings of this study are only generalizable to the extent that the scope of this thesis overlaps with that of other research. For example, other tracker prevalence research using sites that are also measured in this study should have similar findings, bearing in mind the same limitations described above in relation to reliability. 53

4 RESULTS AND ANALYSIS

This chapter presents the empirical results in five sections. The first section consists of a descriptive analysis of the results and the quality of the data (4.1 Data description). The second section presents the results that answer the first research question of tracker prevalence (4.2 Tracker Prevalence). The third section presents the results that answer the second research question of tracking organizations and their coverage of the Finnish user’s digital footprint (4.3 The Finnish digital footprint). The fourth section presents the results that answer the third research question of the possible difference between tracking on Finnish versus non-Finnish websites (4.4 Tracking in Finland). The last, fifth section, presents some additional findings related to the research theme (4.5 Other findings).

4.1 Data description

The data was collected with the Tracker Tracker tool by running five separate requests for five subsets of 100 sites between 02:40 19.8.2017 and 02:16 20.8.2017, when the tool was using a tracker database from March 24th, 2017. The combined csv files produced by the Tracker Tracker tool contained 89,001 lines of data, with 88,976 lines of site tracking data and 25 lines (5 separate runs for each incremental top 100 group) of csv column headers. 466 unique 3rd party trackers from 408 organizations were identified on the top 500 sites, and after removing duplicate site-tracker combinations, 7253 site-tracker pairs were left for analysis. Only 17 tracking organizations used more than one tracker script, the top represented by Google (19 tracker scripts), Microsoft, Facebook, and Yahoo (6 tracker scripts), and Adobe and Yandex (5 tracker scripts). The other 391 organizations were only observed to use one tracker script per organization.

This result is in line with that of Roesner et al. (2012), who found over 500 unique 3rd party trackers on their top 500 Alexa sites, but as the study was published 5 years before the data for this thesis was gathered and measured the global top 500 rather than the Finnish top 500, this similarity is approximate at best.

The tracking data from the measurement runs had very high variations, as can be seen in Table 3 and Table 4. Table 3 illustrates the number of 3rd party trackers found during each of the 25 measurement runs, while Table 4 shows a comparison between each run and the maximum number of scripts found for each top sites group. For example, the number of trackers identified in the top 400 sites varies between 5854 (100%) in run 3 and an 82 % smaller 1050 in run 2. 54

Sites Run 1 Run 2 Run 3 Run 4 Run 5 Top 100 2110 4221 1721 4045 4520 Top 200 4312 4938 3573 3204 2339 Top 300 4100 5232 4991 2807 2893 Top 400 2221 1050 5854 4317 1352 Top 500 4516 2437 3165 2552 5253 Table 3 Number 3rd party scripts found during each top sites group and measurement run

Sites Run 1 Run 2 Run 3 Run 4 Run 5 Top 100 47 % 93 % 38 % 89 % 100 % Top 200 87 % 100 % 72 % 65 % 47 % Top 300 78 % 100 % 95 % 54 % 55 % Top 400 38 % 18 % 100 % 74 % 23 % Top 500 86 % 46 % 60 % 49 % 100 % Table 4 Comparison to the maximum number of 3rd party scripts for each top sites group and measurement run

The number of sites on which Tracker Tracker found 3rd party trackers also varied considerably between runs, as can be seen in Table 5. For example, 95 % of the top 400 sites had 3rd party trackers identified during run 3, whereas only 83 % had tracker data in run 2. This explains some of the variation in Table 3 and Table 4, but not the cause. Only 47 sites had 3rd party scripts successfully identified in all the five runs of which they were a part.

Sites Run 1 Run 2 Run 3 Run 4 Run 5 Top 100 88 % 94 % 87 % 94 % 94 % Top 200 92 % 94 % 90 % 89 % 87 % Top 300 90 % 93 % 93 % 87 % 88 % Top 400 87 % 83 % 95 % 90 % 84 % Top 500 92 % 88 % 89 % 88 % 94 % Table 5 Share of sites with identified 3rd party scripts for each top sites group and measurement run

A manual analysis of the result data suggests that the high data variation was most probably due to a combination of reasons:

• The scripts running the Tracker Tracker tool fail to contact the web sites. This explains the wide variation of sites without 3rd party scripts, shown in Table 5.

• Dynamic content on the web pages led to the Tracker Tracker tool following different links when exploring the 5 subpages, as assigned in the tool’s settings. Different subpages can contain a completely different set of scripts. 55

• As previously discussed in 3.4 Data gathering considerations, advertising mechanisms can result in a different set of 3rd party scripts.

• 1st party cookie privacy or other page scripts that the Tracker Tracker tool cannot bypass.

• Some form of bot shields, denying automated browsers access to the sites.

90 sites, out of the 500 analyzed, contained no 3rd party trackers in the data produced by Tracker Tracker. Some of these trackerless sites, like .org or governmental sites (vero.fi, suomi.fi), are credible results. Others, like the media site anna.fi, the movie rating site rottentomatoes.com, or the travel site momondo.fi, are clearly results of some of the issues discussed above. A manual measurement using Chrome and the Ghostery extension identified 3rd party trackers on each of these pages. As a consequence, the results presented in this thesis should be seen as the minimum level of tracking imposed on Finnish web users. Beyond these examples, some sites, such as google.com or googleusercontent.com owned by Google or t.co owned by Twitter, contain no 3rd party trackers, as the tracking organization can utilize 1st party tracking instead.

4.2 Tracker Prevalence

This chapter presents the data analysis needed to answer the first research question: “What are the most prevalent trackers on the websites which Finnish web users frequently visit?”.

As can be seen in Table 6, advertising trackers were by far the most prevalent trackers identified, as categorized by Ghostery. Out of the 466 separate tracking scripts identified, 308 (66 %) were advertising trackers, with site_analytics and customer_interaction scripts trailing far behind. Compared to the results of Karaj et al. (2018), with the advertising category only representing 42 % of the trackers, these results are more skewed towards advertising.

56

Category Number of trackers advertising 308 site_analytics 79 customer_interaction 26 social_media 18 essential 16 pornvertising 10 audio_video_player 5 comments 4 Total 466 Table 6 Tracker categories for the top 500 sites

Google dominated the most prevalent tracker list, with three out of four most prevalent trackers. The full top 20 tracker prevalence is shown in Figure 22. Google Analytics and the Google-owned DoubleClick have a 20 % lead over the next most prevalent tracker, Facebook Connect. Except for the most prevalent tracker, Google Analytics, the rest of the top 20 trackers were all advertising trackers. Within the site_analytics category, Google Analytics had a very dominant position with 65 % of sites utilizing the tool. The second most prevalent site_analytics tracker was TNS, with only a 15 % share. These results are similar to those of other research (Englehardt & Narayanan 2016; Macbeth 2017; Karaj et al 2018) as to the top trackers, although with higher prevalence.

70,0 % 60,0 % 50,0 % 40,0 % 30,0 %

Site Site coverage 20,0 % 10,0 % 0,0 %

Figure 22 Top 20 most prevalent trackers 57

After the top trackers, there is a sharp decline in the number of sites using each tracking script. Google Analytics and DoubleClick were the only trackers to reach over 60% of the sites, with Facebook Connect reaching 41 % and Google Tag Manager 32 % of the sites. No other tracker was found on over 30 % of the sites, as is shown in Table 7 which shows the prevalence frequency in 10 % bins. The majority of trackers, 92 %, had a prevalence of less than 10 % and 139 trackers (30 %) were only found on single sites, somewhat mirroring the long tail found by Englehardt & Narayanan (2016), if not in numbers, then at least in shape. This, the long tail of tracking scripts, is further illustrated in Figure 23. The full list of trackers and their reach can be found in Appendix 4.

Prevalence Number of trackers >90 % 0 80-89 % 0 70-79 % 0 60-69 % 2 50-59 % 0 40-49 % 1 30-39 % 1 20-29 % 8 10-19 % 23 <10 % 431 Total 466 Table 7 Frequency table of tracker prevalence

70,0 %

60,0 %

50,0 %

40,0 %

30,0 % Site Site coverage 20,0 %

10,0 %

0,0 % Position 50 100 150 200 250 300 350 400 450

Prevalence

Figure 23 All 466 trackers’ prevalence; the long tail of 3rd party trackers 58

4.3 The Finnish digital footprint

This chapter presents the data analysis needed to answer the second research question: “How complete a digital footprint can organizations build from their trackers on websites Finnish web users frequently visit?”.

In order to understand the reach of the organizations tracking Finnish users, the overall coverage of the trackers linked to the organizations is analyzed. In addition to the trackers on other sites, the organization’s own sites were also considered to be “tracked” by the organizations, as described in 3.7 Data interpretation choices. For example, Google owns many sites in the top 100, which were then manually identified as a part of Google’s reach for this analysis. All in all, 408 tracking organizations were identified.

After seeing the site coverage results in the previous section, it comes as no surprise that Google and Facebook had the widest reach when measuring the sites that Finns frequent the most, as can be seen in Figure 24. Google’s dominance is clear, with the ability to track Finnish users on 75 % of the pages. Facebook’s reach is 46 %, while the third most prevalent tracking organization AppNexus, an online advertising platform, had a 29 % reach. Once again, the results are similar to those of previous research, although there are also some differences. For example, Englehardt & Narayanan (2016) found Amazon and Oracle in the top 10, whereas Macbeth (2017) and Karah et al. (2018) had Amazon and Yandex. Compared to these studies, the high prevalence of AppNexus is noticeable. As for the dominant position of Google, almost all research shows the same results, including the three mentioned above. Purra & Carlsson (2016) reported that Google trackers have a coverage of over 70 % in all their domain categories, including .se and .dk sites (Swedish and Danish), supporting the results of this thesis. 59

80%

70%

60%

50%

40%

Site Site coverage 30%

20%

10%

0%

Figure 24 Reach of the top 20 tracker organizations

The rest of the identified organizations had below 25 % reach, with only 27 organizations having the ability to track Finnish users on over 10 % of the sites. 381 (93 %) of the organizations have a reach of less than 10 %, with 126 (31 %) only tracking one site of the top 500 measured. The share of sites tracked per organization is shown in Table 8, indicating an even more skewed long tail compared to the one presented for separate tracking scripts (Figure 23 and Table 7). The full list of tracking organizations and their reach can be found in Appendix 5.

Share of sites Number of organizations >90 % 0 80-89 % 0 70-79 % 1 60-69 % 0 50-59 % 0 40-49 % 1 30-39 % 0 20-29 % 7 10-19 % 18 <10 % 381 Total 408 Table 8 Frequency table for tracking organization coverage 60

4.4 Tracking in Finland

This chapter presents the data analysis needed to answer the third research question: “How does tracking differ between Finnish sites and non-Finnish sites?”.

The top 500 sites were labeled as Finnish or non-Finnish according to the rules presented in 3.7 Data interpretation choices. Of the 500 sites, 187 were labeled Finnish and 313 were labeled non-Finnish. Of the 90 sites with no 3rd party tracking scripts identified, 31 were Finnish and 59 were non-Finnish.

Category Finnish Non-Finnish advertising 10.27 10.59 site_analytics 2.27 1.93 social_media 1.19 0.96 essential 0.72 0.42 customer_interaction 0.10 0.16 pornvertising 0.01 0.15 comments 0.02 0.11 audio_video_player 0.00 0.03 Total 14.57 14.36 Table 9 Average number of tracking scripts per category on Finnish and non-Finnish sites

Table 9 shows the difference between tracker prevalence on the 187 Finnish sites compared to the 313 non-Finnish sites. Overall, the number of trackers per site was almost the same; 14.57 for the Finnish and 14.36 for the non-Finnish sites, which is considerably above the 10 trackers per site average measured by Karaj et al. (2018). The order of tracker category prevalence was the same between both groups, and the top three most used tracker categories showed little difference. Advertising was by far the largest category, with site_analytics, social_media and essential coming far behind. A proportionally large difference was found in the smaller groups, especially pornvertising and comments, which might indicate a difference of content type in the Finnish sites compared to the non-Finnish. 61

80% 70% 60% 50% 40%

30% Site Site coverage 20% Share Finnish 10% Share Non-Finnish 0%

Figure 25 Reach of the top 10 trackers on Finnish sites

Figure 25 shows the top 10 most prevalent trackers on Finnish sites. The top three are the same as the result presented in 4.2 Tracker Prevalence for overall tracker reach: Google Analytics, DoubleClick, and Facebook Connect. These trackers were also much more prevalent on Finnish sites compared to non-Finnish sites. From the fourth position (Google Tag Manager) forward, the Finnish tracker prevalence order differs from the total top 500.

80% 70% 60% 50% 40% 30%

Site Site coverage 20% Share Finnish 10% Share Non-Finnish 0% Difference

Figure 26 Top 10 most prevalent trackers on Finnish sites compared to non-Finnish sites

In order to illustrate the different tracker findings, Figure 26 shows a top 10 of trackers that were most prevalent in Finnish sites compared to non-Finnish sites. This shows that 62

not only did local (Nordic) trackers such as Adform, TNS, Frosmo Optimizer, and Enreach enjoy a higher reach, but that Facebook Connect, Facebook Custom Audiences, Google Tag Manager, and Google Analytics were used more on Finnish sites than on non- Finnish sites. The website tracking tools (e.g. heatmaps or A/B testing) Crazy Egg and Hotjar were fourth and ninth, respectively, possibly indicating some difference in the kind of content between the Finnish and non-Finnish labeled sites, or even some preference or customs by Finnish web developers.

20% 18% 16% 14% 12% 10% 8% Share Finnish

Site Site coverage 6% Share Non-Finnish 4% Difference 2% 0%

Figure 27 Top 10 most prevalent trackers on non-Finnish sites compared to Finnish sites

Figure 27 turns the results around and presents a top 10 of trackers that were most prevalent in non-Finnish sites compared to Finnish sites. The trackers represented here were clearly more prevalent on non-Finnish sites, with none of the trackers having over 10 % reach on Finnish sites and only LiveRamp having over 5 %. The Russian focused Yandex.Metrics and TopMail, the email tracking tool, had no presence on Finnish pages. 63

90%

80%

70%

60%

50%

40% Finnish Site Site coverage 30% Non-Finnish 20%

10%

0%

Figure 28 Reach of the top 10 tracking organizations on Finnish sites

Figure 28 shows the top 10 tracking organizations for Finnish sites and their prevalence on non-Finnish sites. As with the results from separate trackers, this figure shows a clear uplift for Facebook, Adform, and TNS. Although Google’s uplift was not as great, it kept the lead and still had the widest reach.

70%

60%

50%

40%

30% Finnish

Site Site coverage Non-Finnish 20% Difference 10%

0%

Figure 29 Top 10 most prevalent tracking organizations on Finnish sites compared to non-Finnish sites 64

Figure 29 shows a top 10 of tracker organizations that were most prevalent in Finnish sites compared to non-Finnish sites. As with the results from separate trackers, this figure shows a preference for the local (Nordic) organizations such as Adform, TNS, Frosmo Optimizer, Enreach, and Leiki. Of the big tracking organizations, Facebook had a clear preference on Finnish sites, with a smaller uplift for comScore and Adobe. The website tracking tools Crazy Egg and Hotjar were third and eighth respectively, slightly increasing their position, as Google’s overall prevalence difference was leveled out.

20% 18% 16% 14% 12%

10% Finnish 8%

Non-Finnish Site Site coverage 6% Difference 4% 2% 0%

Figure 30 Top 10 most prevalent tracking organizations on non-Finnish sites compared to Finnish sites

Figure 30 turns the results around and presents a top 10 of tracking organizations that were most prevalent in non-Finnish sites compared to Finnish sites. As with the results from the separate trackers, the organizations presented here were clearly more prevalent on non-Finnish sites, with none of the organizations having over 10 % reach on Finnish sites and only Rapleaf and Microsoft having over 5 %. Yandex and Mail.ru had no presence on Finnish pages, with all tracked sites being Russian or with Russian content. Amazon.com was only present on one site labeled Finnish, lataayoutube.com, which was an uncertain labeling choice and apparently an outlier from the tracking point of view. When comparing the top 20 tracking organizations from this thesis (Figure 24) to that of Macbeth (2017), seven of the ten organizations in Figure 30 are missing the former, while present in the latter. The three exceptions are Rapleaf, Krux, and Econda, which are not present in either top 20. This is a strong indication of different tracking behavior on Finnish sites compared to non-Finnish sites. 65

4.5 Other findings

This section will present other relevant findings made during the analysis needed to answer the research questions.

120

100

80

60

40 NumberTrackers of

20

0

Figure 31 Top 20 most tracked sites

Number of trackers Number of sites >120 0 110-119 1 100-109 0 90-99 1 80-89 3 70-79 7 60-69 9 50-59 14 40-49 13 30-39 30 20-29 52 10-19 90 0-9 280 Total 500 Table 10 Frequency table for number of trackers per site 66

Figure 31 shows the top 20 most tracked sites, i.e. the sites with most tracker scripts. The most tracked site was gsmarena.com, with 113 separate tracking scripts observed. Overall, it is news and gaming sites that dominate this list, with 8 sites being news- or weather-related and 5 sites gaming-related. Table 10 shows the bigger picture and the long tail of tracking. As 90 sites had no observed trackers, that leaves 190 sites with 1-9 trackers and 220 with over 10 trackers. This result indicates higher average trackers per site than in the study of Karaj et al. (2018), as presented in Figure 11. The full list of sites and the number of trackers observed on them can be found in Appendix 3.

100

90

80

70

60

50 100 200 40

Numbersites of 300 30 400

20 500

10

0

Figure 32 The reach of the top 10 tracking organizations compared over the top sites groups

Figure 32 shows the variance of observed tracker organizations between the top 500 site groups: 1-100, 101-200, 201-300, 301-400, 401-500. It is clear that Google had a strong position in all 5 groups, while Facebook was lower in the 1-100 group, and Twitter was lower in the first, second, and fifth groups. An explanation for the low coverage of Facebook in the top 100 sites might be that multiple sites of the same organization are also present, e.g. Google owns 4 sites in the top 25 and Microsoft owns 5 in the top 100. 67

5 DISCUSSION

In this chapter the research findings and implications are discussed more broadly and tied into the surrounding research landscape.

5.1 Results and implications

The findings of this thesis fit well into the prevailing scientific literature on online tracking prevalence. This was expected but by no means a certainty, as the combination of sites analyzed for this thesis differs from the sites in other studies, although major overlap is self-evident. The dominant position of Google, with notable trackers such as Google Analytics and DoubleClick, has apparently not changed during the past 10 years. Facebook’s runner-up status is likewise mirrored in similar studies, but after that, the situation becomes much more volatile. The same organizations are mostly present, but the order in which they show up differs between almost all studies. “The usual suspects”, e.g. AppNexus, comScore, Twitter, Adobe, Amazon, and Yahoo, to name just a few, are present and even with higher than average coverage of sites.

Beyond these usual suspects, and even though tracking prevalence is on a similar level on Finnish and non-Finnish pages, there is a clear geographical influence on the trackers observed. Not only are Google’s and Facebook’s trackers much more prevalent on the Finnish sites compared to the non-Finnish ones, but there are Nordic organizations that thrive there as well. Trackers from the likes of Adform, TNS, Enreach, and Leiki are much more common on the Finnish pages. This geographical influence is not only logical, but also supported by previous research (e.g. Falahrastegar et al. 2014B). It works both ways; the Russian trackers Yandex.Metrics and Mail.ru were observed on Russian sites that Finns frequent, whereas none of the Finnish sites had these trackers, indicating a geographical bias to Russian content.

One surprising find in the empirical data was that the website measurement tools Crazy Egg and Hotjar were much more prevalent on Finnish than on non-Finnish sites. These tools enable the tracking, analyzing and visualizing of users’ behavior on websites with heatmaps, A/B-testing, and other capabilities. There is no clear explanation for their prevalence in the data, no reasonable commonalities. The reason might be that as Finland is quite a small market, the preference and familiarity of a few key developers have kept these tools top-of-mind in the local organizations and therefore a presented simple choice when selecting site analytics tools. 68

Contemplating the motivation behind this thesis, it is clearly shown in the empirical data that Google has an overwhelmingly dominant presence when it comes to online tracking. Google can capture the behavior of Finnish web users on at least 75 % of the top 500 pages. This should be seen as a minimum level, as there were quite a few sites on which the automatic measurement failed to identify any trackers, whereas manual sample observations proved these findings to be wrong. Some have previously argued that Google Analytics is a site analytics service and that it should not be counted as tracking. But Google does provide the data from Google Analytics to the site owner for use in targeted advertising. Furthermore, in this thesis, the focus has been on the companies behind the trackers; Google is able to track the users on sites with Google Analytics, whatever their actual data usage is. Even if Google Analytics was for some reason removed from the results, Google’s next largest tracker, DoubleClick, has a reach of 63 % all by itself. It is nigh impossible to evade Google’s trackers when surfing on the world wide web.

Even though Google sits comfortably on the throne of tracking prevalence, one should not too readily dismiss Facebook’s position. Facebook was the second largest tracker, with 46 % 0f the top 500 sites which Finns frequent. This number by itself is very impressive, but when one adds the billions of people who use any of the Facebook-owned platforms, like Facebook, Instagram, Messenger, or WhatsApp, the picture becomes very different. Facebook can track Finnish users on 46 % of the sites they frequent and most of the key social media platforms they use. Considering the kind of information that can be gathered when visiting websites vs. using social media, the overall understanding Facebook has of its users certainly competes with that of Google.

After the two leaders, there are seven other organizations that have the capability to track Finnish users on over 20 % of the top 500 sites. These were mostly advertising platforms, which enable both the selling of ad space on publisher’s websites and the behavioral analysis that fuels the targeting of each ad. The advertising category is the predominant tracker category throughout the list of identified trackers, accounting for 66 % of the trackers. This does not come as a surprise, as the main reasoning behind tracking is to provide superior targeting through the behavioral analysis of user preferences and needs.

The long tail of tracking is visible both from the tracker and the website side. Over 90 % of the observed trackers were each found on less than 10 % of the sites and 30 % of the trackers were only found on a single site. On the other hand, although there was only one site with over 100 trackers, there were 88 other sites (out of 500) that had over 20 69

trackers each. Every pageview on these sites shared the user’s actions to a plethora of tracking organizations. To say that online tracking is common practice is definitely downplaying the realities of the current landscape.

Online tracking is in many ways an inevitable part of the core online business models; in the “free-to-use” model the user data is the asset being exploited, in the online commerce model it is personalization of content and retargeting of the user, and in the advertising model it is user targeting, impression verification, and behavioral analytics loops that require the tracking of users. In “free-to-use” services, like Facebook, the trade of user data for the social media services provided is quite evident and straightforward, but still many people complain about Facebook using their data. Might it be that they have not read the terms of service, like research suggests? Probably yes.

Even though legislation, like the GDPR33 (EU’s General Data Protection Regulation), tries to regulate the environment, the combination of websites, geographical differences, interpretation of the regulation and compliance thereto makes it an outright jungle for the regular user to navigate. This, and the fact that the empirical data was collected before the GDPR was mandatory, is the reason this thesis does not consider the regulatory environment and what effect it possibly has on the results. The data gathered came from a virtual browser that is technically incapable of giving any cookie or tracking consent, even though such consent should have been mandatory under pre-GDPR EU ePrivacy regulation. But once again, it is all up for interpretation.

Some important questions emerge as a result of the empirical findings: How can users protect themselves from online tracking? Why should they? The issue boils down to personal preference. The advice to each individual, if you are apprehensive about the thought of the Big Other or surveillance state and do not want others to use data about you, then do not use any free online services; better yet, do not go online at all and do not use mobile phones and apps. More realistically, be conscious about what data you give and to whom. If you do not want targeted advertising you will still see advertising, but it might be less relevant to you. Read and understand the terms of service and modify the privacy settings to your liking.

33 https://ec.europa.eu/commission/priorities/justice-and-fundamental-rights/data-protection/2018-reform-eu-data- protection-rules_en

70

If this is not enough, there are many ways to, or at least try to, technically influence the tracking opportunities offered to the online tracking community. The author uses a combination of Pi-hole34 DNS level blocking running on a Raspberry Pi mini-computer on the home network (restricting access on a per domain bases, e.g. if a browser tries to load content from the blocked domain advertising.com, it receives an empty page instead) and the Chrome browser with the Ghostery and uBlock Origin tracker blocking extensions. The reasoning is halfway between privacy and efficiency, as removing tracking and advertising makes the online browsing experience much faster and smoother, while keeping trackers at bay. Although this might sound excessive, it is actually quite simple (and fun, if one has the geek gene) and effective. If one would want to truly invest in ensuring privacy when browsing, one might try to implement most of the 30 separate defense strategies presented in Appendix 1 using a combination of the 13 tools presented in Appendix 2. Sadly, this would still not be enough, as the online tracking arms race continues inexorably. Regulation and tracker blocking certainly helps, but as recent research by Sanchez-Rola & Santos (2018) has shown, the tracking landscape is much wider than previously perceived. In their analysis of the Alexa top 1M websites, using code similarity and machine learning, they found 5.5 million previously unknown scripts that were probably tracking scripts.

In the end, the above-board use of tracking and targeting is probably not the thing people worry about. Going back to the five forms of information processing by Solove (2006), presented in chapter 2.6.2 Data processing, one might combine them and conclude that: people do not want large corporations to collect and combine their personal identifiable data without their knowledge and input, and use the results in ways that were never approved or communicated.

5.2 Limitations

The technical limitation of this thesis is bound by the chosen research approach, as discussed in chapter 3 RESEARCH METHODS. The main limitations are set by the quality and correctness of the top 500 site list provided by Alexa, the tracker-owner relationship provided by Disconnect, and finally the tracker measurement capabilities provided by the Tracker Tracker tool. By using the 3rd party tracking measurement tool Tracker Tracker, the author had little influence on the environment from which the actual data gathering was conducted. Geographically attributable information, like the

34 https://pi-hole.net 71

IP address of the server the browser was running on, is a typical example of an environmental variable that most probably influences the results. Another technical limitation was the quality of the results provided by the Tracker Tracker tool, which was exemplified by the high variation in the number of trackers identified and the high number of sites with no trackers identified, even after four redundant measurement runs. Further, there are many tracking mechanics that are not measured by the Tracker Tracker tool and which, although possibly less prevalent, could identify other interesting entities that are tracking users. Taking these limitations into account, the results presented in this thesis should be understood as the minimum level of tracking prevalence, as any missing trackers would only increase the reach of the identified tracking organizations.

As the data analysis was manual and required many iterations in Excel, there is a risk of errors in the analysis, but no greater than that in other similarly sized data sets. Another limitation related to the data analysis is the manual labeling of sites as either Finnish or non-Finnish. As the rules presented in chapter 3.4 Data gathering considerations are not exclusive, and neither do they clearly document each distinct labeling decision, the author’s own common sense and interpretation was involved. This might or might not have introduced some bias in the results. 72

6 CONCLUSIONS

The motivation for this master’s thesis was the apparent research gap found in the literature regarding a geographically focused online tracking research area. Contemporary research on the prevalence of online tracking from a Finnish perspective only measured Finnish media websites and only counted the number of tracking cookies found per site. This data alone does not sufficiently describe what a Finnish web user can expect to experience when browsing the world wide web. This resulted in the following aim for this thesis: to “describe the online tracking landscape and quantify the prevalence of mechanisms and organizations that track the online behavior of Finnish web users, in order to obtain a better understanding of the share of the Finnish digital footprint these organizations theoretically possess and whether their behavior differs on Finnish and non-Finnish websites”. In order to fulfill this aim, three research questions regarding the Finnish web users were proposed: 1) What are the most prevalent trackers on the websites which Finnish web users frequently visit?; 2) How complete a digital footprint can organizations build from their trackers on websites Finnish web users frequently visit?; 3) How does tracking differ between Finnish sites and non-Finnish sites?

The research design followed the prevailing online tracking measurement approaches, with key choices constrained by the nature and aim of the thesis: 1) using Alexa’s Finnish top 500 sites as the measurement input; 2) using Tracker Tracker, an online tracker measurement service, for the tracker measurement; 3) enriching the data with relationship data of tracker owners obtained from Disconnect.me; 4) analyzing and visualizing the results in Microsoft Excel.

The chosen research design resulted in findings that successfully answered the proposed research questions and supported contemporary research. Google’s dominant role in the tracking ecosystem was absolute, both when looking at the individual trackers and their combined organization-level coverage. The Google-owned tracking scripts Google Analytics, Doubleclick, and Google Tag Manager were all in the top 4 most prevalent trackers, with only Facebook Connect breaking the trio as the third most prevalent tracker. This resulted in Google having a combined tracking coverage of 75 % of the sites which Finnish web users frequent the most. Facebook was positioned in an equally clear second place with a site coverage of 46 %, and although AppNexus, a large advertising technology company, came third in both trackers and overall tracking coverage, the differences were much smaller in the successive placements after Google and Facebook. 73

There was also a long tail of trackers, with 139 (30 %) of the trackers only appearing on single sites.

There were also notable differences between tracking behavior on Finnish and non- Finnish sites, indicating at least some geographical or social differences. Organizations with roots in the Nordic countries, such as Adform, Frosmo, and Leiki, were all much more prevalent on the Finnish sites compared to the non-Finnish sites. On the other hand, the difference was ever more distinct when reversing the comparison. Quantcast and Amazon were clearly more prevalent on non-Finnish sites and the Russian Yandex and Mail.ru were not found on any Finnish sites. These types of findings were also expected, based on the results of previous research.

Overall, the research design and approach were successful, and the findings were credible and supported by previous research. Some technical limitations were identified, resulting in the interpretation that the findings of this thesis could be viewed as the minimum level of tracking prevalence, i.e. that tracking might be even more prevalent than indicated here. For further research on the tracking theme used in this thesis, the author suggests the following:

• Repeating the data gathering process using the same Tracker Tracker measurement tool and analyzing the possible changes that the EU GDPR has made on how the Finnish web users are tracked.

• Repeating the data gathering process utilizing the datasets made available by the researchers behind whotracks.me35 (owned by Cliqz, who also owns the Ghostery tracking extension used to identify trackers in the Tracker Tracker tool).

• Using the above-mentioned whotracks.me datasets and comparing the pre- and post-GDPR tracking practices of the same list of the top 500 sites frequented by Finnish web users.

To conclude, the findings in this thesis are well suited to provide a solid foundation for future research of the Finnish online tracking landscape.

35 https://github.com/cliqz-oss/whotracks.me/tree/master/whotracksme/data/assets 74

7 REFERENCES

Anand, B. N., & Shachar, R. (2009). Targeted advertising as a signal. Qme, 7(3), 237- 266. Retrieved from http://www.people.hbs.edu/banand/TargetedAdvertising.pdf Accessed May 5th, 2019

Barbaro, ., Zeller, T., & Hansell, S. (2006). A face is exposed for AOL searcher no. 4417749. New York Times, 9(2008), 8. Retrieved from https://www.nytimes.com/2006/08/09/technology/09aol.html?ei=5090&e Accessed May 5th, 2019

Barnes, S. B. (2006). A privacy paradox: Social networking in the united states. First Monday, 11(9). Retrieved from https://firstmonday.org/ojs/index.php/fm/article/view/1394/1312%2523 Accessed May 5th, 2019

Barth, S., & De Jong, M. D. (2017). The privacy paradox–Investigating discrepancies between expressed privacy concerns and actual online behavior–A systematic literature review. Telematics and Informatics, 34(7), 1038-1058. Retrieved from https://www.sciencedirect.com/science/article/pii/S0736585317302022 Accessed May 5th, 2019

Bauman, Z., & Lyon, D. (2012). Liquid surveillance: A conversation. Hoboken: John Wiley & Sons. Retrieved from https://upenn.instructure.com/courses/1243137/files/47829090/download?verifier=E AfXQ0BcJ3WyajVLFTXMgQ73YMtQV2mKcpBO9BwY&wrap=1 Accessed May 5th, 2019

Beales, H. (2010). The value of behavioral targeting. Network Advertising Initiative, 1. Retrieved from https://www.researchgate.net/profile/Howard_Beales/publication/265266107_The_V alue_of_Behavioral_Targeting/links/599eceeea6fdcc500355d5af/The-Value-of- Behavioral-Targeting.pdf Accessed May 5th, 2019

Beresford, A. R., Kübler, D., & Preibusch, S. (2012). Unwillingness to pay for privacy: A field experiment. Economics Letters, 117(1), 25-27. Retrieved from https://www.econstor.eu/bitstream/10419/56724/1/654787352.pdf Accessed May 5th, 2019

Brown, B. (2001). Studying the internet experience. Hp Laboratories Technical Report Hpl, 49. Retrieved from http://shiftleft.com/mirrors/www.hpl.hp.com/techreports/2001/HPL-2001-49.pdf Accessed May 5th, 2019

Bujlow, T., Carela-Español, V., Sole-Pareta, J., & Barlet-Ros, P. (2017). A survey on web tracking: Mechanisms, implications, and defenses. Proceedings of the IEEE, 105(8), 1476-1510. Retrieved from https://upcommons.upc.edu/bitstream/handle/2117/108437/web_tracking_survey- postprint.pdf Accessed May 5th, 2019 75

Carrascal, J. P., Riederer, C., Erramilli, V., Cherubini, M., & de Oliveira, R. (2013). Your browsing behavior for a big mac: Economics of personal information online. Paper presented at the Proceedings of the 22nd International Conference on World Wide Web, 189-200. Retrieved from https://arxiv.org/pdf/1112.6098.pdf Accessed May 5th, 2019

Castelluccia, C., Grumbach, S., & Olejnik, L. (2013). Data harvesting 2.0: From the visible to the invisible web. Paper presented at the The Twelfth Workshop on the Economics of Information Security. Retrieved from https://hal.inria.fr/file/index/docid/832784/filename/WEIS13-final.pdf Accessed May 5th, 2019

Cavna, M. (2013). ‘NOBODY KNOWS YOU’RE A DOG’: As iconic internet cartoon turns 20, creator peter steiner knows the joke rings as relevant as ever. Retrieved from https://www.washingtonpost.com/blogs/comic-riffs/post/nobody- knows-youre-a-dog-as-iconic-internet-cartoon-turns-20-creator-peter-steiner-knows- the-joke-rings-as-relevant-as-ever/2013/07/31/73372600-f98d-11e2-8e84- c56731a202fb_blog.html Accessed May 5th, 2019

Center for Democracy & Technology. (2011). What does “do not track” mean? (Proposal).Washington: Center for Democracy & Technology. Retrieved from https://www.cdt.org/files/pdfs/CDT-DNT-Report.pdf Accessed May 5th, 2019

Cohen, J. E. (2000). Examined lives: Informational privacy and the subject as object. Stan.L.Rev., 52, 1373. Retrieved from https://scholarship.law.georgetown.edu/cgi/viewcontent.cgi?article=1819&context=f acpub Accessed May 5th, 2019

Data Protection Ombudsman. (2019A). Data protection. Retrieved from https://tietosuoja.fi/en/data-protection Accessed May 5th, 2019

Data Protection Ombudsman. (2019B). What is personal data. Retrieved from https://tietosuoja.fi/en/what-is-personal-data Accessed May 5th, 2019

Data Protection Ombudsman. (2019C). Processing of personal data. Retrieved from https://tietosuoja.fi/en/processing-of-personal-data Accessed May 5th, 2019

Digital Methods Initiative. (2014). Tracker Tracker. Retrieved from https://wiki.digitalmethods.net/Dmi/ToolTrackerTracker Accessed May 5th, 2019

EMC. (2014). The EMC privacy index - global & in-depth country results. EMC. Retrieved from https://www.emc.com/collateral/brochure/privacy-index-global-in- depth-results.pdf Accessed May 5th, 2019

Englehardt, S., Eubank, C., Zimmerman, P., Reisman, D., & Narayanan, A. (2015). OpenWPM: An automated platform for web privacy measurement. Manuscript. Retrieved from https://pdfs.semanticscholar.org/2ccd/e1be1ff725aa8740868929cd1a6f5072ab3a.pdf Accessed May 5th, 2019 76

Englehardt, S., & Narayanan, A. (2016). Online tracking: A 1-million-site measurement and analysis. Paper presented at the Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 1388-1401. Retrieved from https://chromium.woolyss.com/f/OpenWPM-1-million-site-tracking- measurement.pdf Accessed May 5th, 2019

Falahrastegar, M., Haddadi, H., Uhlig, S., & Mortier, R. (2014A). The rise of panopticons: Examining region-specific third-party web tracking. Paper presented at the International Workshop on Traffic Monitoring and Analysis, 104-114. Retrieved from https://link.springer.com/content/pdf/10.1007/978-3-642- 54999-1_9.pdf Accessed May 5th, 2019

Falahrastegar, M., Haddadi, H., Uhlig, S., & Mortier, R. (2014B). Anatomy of the third- party web tracking ecosystem. ArXiv Preprint arXiv:1409.1066. Retrieved from https://arxiv.org/pdf/1409.1066.pdf Accessed May 5th, 2019

Falahrastegar, M., Haddadi, H., Uhlig, S., & Mortier, R. (2016). Tracking personal identifiers across the web. Paper presented at the International Conference on Passive and Active Network Measurement, 30-41. Retrieved from https://haddadi.github.io/papers/pam2k16.pdf Accessed May 5th, 2019

Farahat, A., & Bailey, M. C. (2012). How effective is targeted advertising? Paper presented at the Proceedings of the 21st International Conference on World Wide Web, 111-120. Retrieved from https://www2012.universite- lyon.fr/proceedings/proceedings/p111.pdf Accessed May 5th, 2019

Fruchter, N., Miao, H., Stevenson, S., & Balebako, R. (2015). Variations in tracking in relation to geographic location. Paper presented at the Proceedings of the 9th Workshop on Web 2.0 Security and Privacy (W2SP) 2015. Retrieved from https://arxiv.org/pdf/1506.04103.pdf Accessed May 5th, 2019

Gallagher, K., & Parsons, J. (1997). A framework for targeting banner advertising on the internet. Paper presented at the Proceedings of the Thirtieth Hawaii International Conference on System Sciences, 4 265-274. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.8298&rep=rep1&type =pdf Accessed May 5th, 2019

Ghostery. (2019). What are the new tracker categories? Retrieved from https://ghostery.zendesk.com/hc/en-us/articles/115000740394-What-are-the-new- tracker-categories- Accessed May 5th, 2019

Gill, P., Erramilli, V., Chaintreau, A., Krishnamurthy, B., Papagiannaki, K., & Rodriguez, P. (2013). Follow the money: Understanding economics of online aggregation and advertising. Paper presented at the Proceedings of the 2013 Conference on Internet Measurement Conference, 141-148. Retrieved from http://conferences.sigcomm.org/imc/2013/papers/imc184s-gillAemb.pdf Accessed May 5th, 2019 77

Greenberg, J. (2016). Ad blockers are making money off ads (and tracking, too). Retrieved from https://www.wired.com/2016/03/heres-how-that-adblocker-youre- using-makes-money/ Accessed May 5th, 2019

Grossklags, J., & Acquisti, A. (2007). When 25 cents is too much: An experiment on willingness-to-sell and willingness-to-protect personal information. Paper presented at the Sixth Workshop on the Economics of Information Security. Retrieved from http://weis07.infosecon.net/papers/66.pdf Accessed May 5th, 2019

Holvast, J. (2009). History of privacy. Paper presented at the The Future of Identity in the Information Society - 4th IFIP WG 9.2, 9.6/11.6, 11.7, FIDIS International Summer School, Brno, Czech Republic. Retrieved from https://link.springer.com/content/pdf/10.1007/978-3-642-03315-5_2.pdf Accessed May 5th, 2019

IHS Markit. (2017). The economic value of behavioural targeting in digital advertising. (Research report). London: IHS Markit. Retrieved from https://www.iabeurope.eu/wp- content/uploads/2017/09/BehaviouralTargeting_FINAL.pdf Accessed May 5th, 2019

Johnson, J. P. (2013). Targeted advertising and advertising avoidance. The Rand Journal of Economics, 44(1), 128-144. Retrieved from https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.491.2999&rep=rep1&typ e=pdf Accessed May 5th, 2019

Jones, R., Kumar, R., Pang, B., & Tomkins, A. (2007). I know what you did last summer: Query logs and user privacy. Paper presented at the Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, 909-914. Retrieved from http://www.rosiejonesphd.com/papers/cikm2007.kanon.pdf Accessed May 5th, 2019

Karaj, A., Macbeth, S., Berson, R., & Pujol, J. M. (2018). WhoTracks. me: Monitoring the online tracking landscape at scale. ArXiv Preprint arXiv:1804.08959. Retrieved from https://www.researchgate.net/publication/324744573_WhoTracksMe_Monitoring_the _online_tracking_landscape_at_scale Accessed May 5th, 2019

Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences of the United States of America, 110(15), 5802-5805. doi:10.1073/pnas.1218772110 [doi]

Kramer, A. D., Guillory, J. E., & Hancock, J. T. (2014). Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences of the United States of America, 111(24), 8788- 8790. doi:10.1073/pnas.1320040111 [doi]

Krishnamurthy, B., & Wills, C. (2009). Privacy diffusion on the web: A longitudinal perspective. Paper presented at the Proceedings of the 18th International 78

Conference on World Wide Web, 541-550. Retrieved from http://www2009.eprints.org/55/1/p541.pdf Accessed May 5th, 2019

Lerner, A., Simpson, A. K., Kohno, T., & Roesner, F. (2016). Internet Jones and the raiders of the lost trackers: An archaeological study of web tracking from 1996 to 2016. Paper presented at the 25th USENIX Security Symposium (USENIX Security 16). Retrieved from https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_lerne r.pdf Accessed May 5th, 2019

Li, T., Hang, H., Faloutsos, M., & Efstathopoulos, P. (2015). Trackadvisor: Taking back browsing privacy from third-party trackers. Paper presented at the International Conference on Passive and Active Network Measurement, 277- 289. Retrieved from https://www.symantec.com/content/dam/symantec/docs/research- papers/trackadvisor-taking-back-browsing-privacy-from-third-party-trackers-en.pdf Accessed May 5th, 2019

Libert, T. (2015). Exposing the hidden web: An analysis of third-party HTTP requests on one million websites. International Journal of Communication, 9. Retrieved from https://arxiv.org/pdf/1511.00619.pdf Accessed May 5th, 2019

Liu, Y., Gummadi, K. P., Krishnamurthy, B., & Mislove, A. (2011). Analyzing facebook privacy settings: User expectations vs. reality. Paper presented at the Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference, 61-70. Retrieved from https://conferences.sigcomm.org/imc/2011/docs/p61.pdf Accessed May 5th, 2019

Macbeth, S. (2017). Tracking the Trackers: Analysing the Global Tracking Landscape with GhostRank. Retrieved from https://www.ghostery.com/wp- content/themes/ghostery/images/campaigns/tracker-study/Ghostery_Study_- _Tracking_the_Trackers.pdf Accessed May 5th, 2019

Mayer, J. R., & Mitchell, J. C. (2012). Third-party web tracking: Policy and technology. Paper presented at the 2012 IEEE Symposium on Security and Privacy, 413- 427. Retrieved from https://jonathanmayer.org/publications/trackingsurvey12.pdf Accessed May 5th, 2019

McDonald, A., & Cranor, L. F. (2010). Beliefs and behaviors: Internet users' understanding of behavioral advertising Washington DC: TPRC. Retrieved from https://www.researchgate.net/profile/Lorrie_Cranor/publication/228237033_Beliefs_ and_Behaviors_Internet_Users'_Understanding_of_Behavioral_Advertising/links/00b7 d5319b862a63bb000000.pdf Accessed May 5th, 2019

Mendel, T., Puddephatt, A., Wagner, B., Hawtin, D., & Torres, N. (2012). Global survey on internet privacy and freedom of expression. UNESCO. Retrieved from https://www.sbs.ox.ac.uk/cybersecurity-capacity/system/files/UNESCO%20- %20Survey%20on%20Internet%20Privacy%20and%20Freedom%20of%20Expressio n.pdf Accessed May 5th, 2019 79

Metwalley, H., Traverso, S., Mellia, M., Miskovic, S., & Baldi, M. (2015). The online tracking horde: A view from passive measurements. Paper presented at the International Workshop on Traffic Monitoring and Analysis, 111-125. Retrieved from https://hal.archives-ouvertes.fr/hal- 01411188/file/336978_1_En_8_Chapter.pdf Accessed May 5th, 2019

Mori, M., MacDorman, K. F., & Kageki, N. (2012). The uncanny valley [from the field]. IEEE Robotics & Automation Magazine, 19(2), 98-100. Retrieved from https://ieeexplore.ieee.org/document/6213238 Accessed May 5th, 2019

Narayanan, A., & Shmatikov, V. (2008). Robust de-anonymization of large sparse datasets. Paper presented at the 2008 Ieee Symposium on Security and Privacy, 111-125. Retrieved from https://pdfs.semanticscholar.org/c40e/5c8b4957074644acdaf1f9f4332e63b5846b.pdf Accessed May 5th, 2019

Pariser, E. (2011). The filter bubble: What the internet is hiding from you. London: Penguin UK.

Purra, J., & Carlsson, N. (2016). Third-party tracking on the web: A Swedish perspective. Paper presented at the Local Computer Networks (LCN), 2016 IEEE 41st Conference on, 28-34. Retrieved from https://www.diva- portal.org/smash/get/diva2:1071640/FULLTEXT01.pdf Accessed May 5th, 2019

Räisänen, O. (2015). Trackers leaking bank account data. Retrieved from http://www.windytan.com/2015/04/trackers-and-bank-accounts.html Accessed May 5th, 2019

Rao, A., Schaub, F., & Sadeh, N. (2015). What do they know about me? contents and concerns of online behavioral profiles. ArXiv Preprint arXiv:1506.01675. Retrieved from https://arxiv.org/pdf/1506.01675.pdf Accessed May 5th, 2019

Roesner, F., Kohno, T., & Wetherall, D. (2012). Detecting and defending against third- party tracking on the web. Paper presented at the Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. Retrieved from https://www.usenix.org/system/files/conference/nsdi12/nsdi12- final17.pdf Accessed May 5th, 2019

Ruohonen, J., & Leppänen, V. (2017). Whose hands are in the Finnish cookie jar? Paper presented at the 2017 European Intelligence and Security Informatics Conference (EISIC), 127-130. Retrieved from https://arxiv.org/pdf/1801.07759.pdf Accessed May 5th, 2019

Sanchez-Rola, I., & Santos, I. (2018). Knockin’ on trackers’ door: Large-scale automatic analysis of web tracking. Paper presented at the International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, 281-302. Retrieved from http://paginaspersonales.deusto.es/isantos/papers/2018/2018- sanchez-rola-dimva-knocking.pdf Accessed May 5th, 2019 80

Saunders, M., Lewis, P., & Thornhill, A. (2015). Research methods for business students (7th ed.). London: Pearson Education.

Schneier, B. (2015). Data and goliath: The hidden battles to collect your data and control your world. New York: WW Norton & Company.

Shah, R. C., & Kesan, J. P. (2009). Recipes for cookies: How institutions shape communication technologies. New Media & Society, 11(3), 315-336.

Sirkkunen, E., & Haara, P. (2017). Yksityisyys ja notkea valvonta: Yksityisyys ja anonymiteetti verkkoviestinnässä-projektin loppuraportti. Tampere: Tampereen Yliopisto. Retrieved from https://tampub.uta.fi/bitstream/handle/10024/100510/978-952-03-0331-0.pdf Accessed May 5th, 2019

Solove, D. J. (2004). The digital person: Technology and privacy in the information age New York: NyU Press. Retrieved from https://scholarship.law.gwu.edu/cgi/viewcontent.cgi?article=2501&context=faculty_p ublications Accessed May 5th, 2019

Solove, D. J. (2006). A taxonomy of privacy. University of Pennsylvania Law Review, 154, 477. Retrieved from https://scholarship.law.gwu.edu/cgi/viewcontent.cgi?article=2074&context=faculty_p ublications Accessed May 5th, 2019

S-Pankki. (2015). Google Analytics -palvelun käyttö S-pankin verkkopalveluissa. Retrieved from https://www.s-pankki.fi/fi/tiedotteet/2015/google-analytics-- palvelun-kaytto-s-pankin-verkkopalveluissa/ Accessed May 5th, 2019

Spiekermann, S., Grossklags, J., & Berendt, B. (2001). E-privacy in 2nd generation E- commerce: Privacy preferences versus actual behavior. Paper presented at the Proceedings of the 3rd ACM Conference on Electronic Commerce, 38-47. Retrieved from http://people.ischool.berkeley.edu/~jensg/research/paper/grossklags_e-Privacy.pdf Accessed May 5th, 2019

Steiner, P. (1993). On the internet, nobody knows you’re a dog. The New Yorker, 69(20), 61.

Tsai, J. Y., Egelman, S., Cranor, L., & Acquisti, A. (2011). The effect of online privacy information on purchasing behavior: An experimental study. Information Systems Research, 22(2), 254-268. Retrieved from https://pdfs.semanticscholar.org/e221/d15a4b9f2eb2ab07694aaa584bd59c85532c.p df Accessed May 5th, 2019

Tufekci, Z. (2015). Algorithmic harms beyond Facebook and google: Emergent challenges of computational agency. Colo.Tech.LJ, 13, 203. Retrieved from https://ctlj.colorado.edu/wp-content/uploads/2015/08/Tufekci-final.pdf Accessed May 5th, 2019 81

Turow, J., King, J., Hoofnagle, C. J., Bleakley, A., & Hennessy, M. (2009). Americans reject tailored advertising and three activities that enable it. Available at SSRN 1478214. Retrieved from https://repository.upenn.edu/cgi/viewcontent.cgi?article=1551&context=asc_papers Accessed May 5th, 2019

Tuunainen, V. K., Pitkänen, O., & Hovi, M. (2009). Users' awareness of privacy on online social networking sites-case Facebook. Bled 2009 Proceedings, 42. Retrieved from https://pdfs.semanticscholar.org/9b83/3ca55abd01f842c764804706750743c67115.p df Accessed May 5th, 2019

Ur, B., Leon, P. G., Cranor, L. F., Shay, R., & Wang, Y. (2012). Smart, useful, scary, creepy: Perceptions of online behavioral advertising. Paper presented at the Eighth Symposium on Usable Privacy and Security, 4. Retrieved from https://www.cylab.cmu.edu/_files/pdfs/tech_reports/CMUCyLab12007.pdf Accessed May 5th, 2019

Varian, H. R. (2014). Beyond big data. Business Economics, 49(1), 27-31. Retrieved from http://edshare.soton.ac.uk/15212/7/BeyondBigDataPaperFINAL.pdf Accessed May 5th, 2019

Watson, S. M. (2014). Data doppelgängers and the uncanny valley of personalization. Retrieved from https://www.theatlantic.com/technology/archive/2014/06/data- doppelgangers-and-the-uncanny-valley-of-personalization/372780/ Accessed May 5th, 2019

Yan, J., Liu, N., Wang, G., Zhang, W., Jiang, Y., & Chen, Z. (2009). How much can behavioral targeting help online advertising? Paper presented at the Proceedings of the 18th International Conference on World Wide Web, 261- 270. Retrieved from http://ra.ethz.ch/CDstore/www2009/proc/docs/p261.pdf Accessed May 5th, 2019

Zuboff, S. (2015). Big other: Surveillance capitalism and the prospects of an information civilization. Journal of Information Technology, 30(1), 75-89. Retrieved from http://edshare.soton.ac.uk/15212/5/SSRN-id2594754.pdf Accessed May 5th, 2019

82

APPENDIX 1 TRACKING DEFENSE STRATEGIES

Figure 33 Tracking defense strategies (Bujlow et al. 2017:21) 83

APPENDIX 2 TRACKING DEFENSE TOOLS

Figure 34 Tracking defense tools (Bujlow et al. 2017:22) 84

APPENDIX 3 NUMBER OF TRACKERS PER SITE

Site Number of Trackers gsmarena.com 113 tomshardware.co.uk 90 lataayoutube.com 84 icy-veins.com 83 thesaurus.com 80 huffingtonpost.com 79 iltapulu.fi 78 hearthpwn.com 76 independent.co.uk 76 op.gg 75 nytimes.com 74 foreca.fi 71 businessinsider.com 66 seasonvar.ru 66 wowhead.com 66 findit.fi 65 mtv.fi 65 teamliquid.net 63 lenta.ru 62 stara.fi 61 sharepoint.com 60 adlibris.com 59 office365.com 59 vimeo.com 59 ts.fi 58 delfi.ee 56 gamepedia.com 54 hs.fi 54 tilannehuone.fi 54 fishki.net 53 imdb.com 53 utorrent.com 53 urbandictionary.com 52 wowprogress.com 51 luukku.com 50 asos.com 49 g2a.com 46 aalto.fi 43 liveleak.com 43 85

telia.fi 43 microsoft.com 42 nordicbet.com 42 nimenhuuto.com 41 telegraph.co.uk 41 dna.fi 40 forbes.com 40 kuvaton.com 40 sourceforge.net 40 cdon.fi 39 katsomo.fi 39 washingtonpost.com 38 worldofwarcraft.com 38 cnn.com 37 power.fi 37 tallinksilja.fi 37 theverge.com 36 ticketmaster.fi 36 tokmanni.fi 36 kuvake.net 35 rbc.ru 35 banggood.com 34 dailymotion.com 34 echo.msk.ru 34 duunitori.fi 33 livetulokset.com 33 wikia.com 33 aliexpress.com 32 reddit.com 32 rutracker.org 32 tori.fi 32 unity3d.com 32 hintaseuranta.fi 31 quasargaming.com 31 ria.ru 31 zalando.fi 31 adobe.com 30 slack.com 30 wordpress.com 30 clasohlson.com 29 ebookers.fi 29 86

ign.com 29 rt.com 29 techradar.com 29 myanimelist.net 28 pikabu.ru 28 savonsanomat.fi 28 autotalli.com 27 dailymail.co.uk 27 ksml.fi 27 motonet.fi 27 muropaketti.com 27 nettimarkkina.com 27 vr.fi 27 ess.fi 26 monster.fi 26 salesforce.com 26 boredpanda.com 25 helsinginuutiset.fi 25 last.fm 25 playstation.com 25 socialblade.com 25 suomalainen.com 25 yaplakal.com 25 gazeta.ru 24 lily.fi 24 nettiauto.com 24 nettimoto.com 24 nettivaraosa.com 24 pcgamer.com 24 1tv.ru 23 bbc.com 23 futisforum2.org 23 polar.com 23 amazon.com 22 feissarimokat.com 22 mamba.ru 22 menaiset.fi 22 providr.com 22 skyscanner.fi 22 vauva.fi 22 4pda.ru 21 87

adme.ru 21 aftonbladet.se 21 iltalehti.fi 21 suomi24.fi 21 trafi.fi 21 yahoo.com 21 investing.com 20 mangafox.me 20 spotify.com 20 booking.com 19 finnkino.fi 19 .com 19 hubspot.com 19 talouselama.fi 19 amazon.co.uk 18 amazon.de 18 demi.fi 18 etuovi.com 18 fonecta.fi 18 gigantti.fi 18 gogoanime.io 18 livejournal.com 18 uusisuomi.fi 18 pizza-online.fi 17 ruutu.fi 17 seiska.fi 17 alypaa.com 16 fastly.net 16 hongkong.fi 16 kotikokki.net 16 mediafire.com 16 rambler.ru 16 valio.fi 16 bbc.co.uk 15 k-ruoka.fi 15 leagueoflegends.com 15 runescape.com 15 sanakirja.org 15 xxl.fi 15 aktia.fi 14 avito.ru 14 88

bauhaus.fi 14 evernote.com 14 gearbest.com 14 hdzog.com 14 hm.com 14 is.fi 14 k-rauta.fi 14 ladbible.com 14 mail.ru 14 satakunnankansa.fi 14 speedtest.net 14 tivi.fi 14 vertaa.fi 14 voice.fi 14 aamulehti.fi 13 amazonaws.com 13 asiakastieto.fi 13 flightradar24.com 13 hclips.com 13 ikea.com 13 prnt.sc 13 rantapallo.fi 13 sportbible.com 13 tekniikkatalous.fi 13 tieto.com 13 diply.com 12 elisa.fi 12 fontanka.fi 12 genius.com 12 imgur.com 12 .com 12 translit.net 12 txxx.com 12 airbnb.com 11 dropbox.com 11 goodreads.com 11 huutokaupat.com 11 karkkainen.com 11 kinopoisk.ru 11 postimees.ee 11 prisma.fi 11 89

sberbank.ru 11 tut.fi 11 verkkokauppa.com 11 verkkouutiset.fi 11 vuokraovi.com 11 biltema.fi 10 drive2.ru 10 foreca.com 10 muusikoiden.net 10 mvlehti.net 10 ontvtime.ru 10 saastopankki.fi 10 smi2.ru 10 stackoverflow.com 10 theguardian.com 10 tube8.com 10 yandex.ru 10 asus.com 9 doska.fi 9 episodi.fi 9 etsy.com 9 helsinki.fi 9 hotmovs.com 9 kauppalehti.fi 9 naurunappula.com 9 russian.fi 9 telkku.com 9 thomann.de 9 twitch.tv 9 upornia.com 9 wish.com 9 .com 9 battle.net 8 ebay.co.uk 8 ebay.de 8 finder.fi 8 gismeteo.ru 8 if.fi 8 ilmatieteenlaitos.fi 8 jysk.fi 8 motherless.com 8 90

nordnet.fi 8 norwegian.com 8 oikotie.fi 8 s-pankki.fi 8 unibet.com 8 vantaa.fi 8 vikingline.fi 8 w3schools.com 8 airbnb.fi 7 cambridge.org 7 casinohuone.com 7 furaffinity.net 7 hsl.fi 7 jatkoaika.com 7 matkahuolto.fi 7 metropolia.fi 7 patreon.com 7 paypal.com 7 poppankki.fi 7 researchgate.net 7 riemurasia.net 7 trello.com 7 wordpress.org 7 ebay.com 6 elisaviihde.fi 6 happypancake.fi 6 helmet.fi 6 kinokrad.co 6 kinozal.tv 6 linkedin.com 6 msn.com 6 nokia.com 6 ouka.fi 6 oulu.fi 6 porn555.com 6 posti.fi 6 savefrom.net 6 tabletkoulu.fi 6 tallink.com 6 tiketti.fi 6 userapi.com 6 91

vk.com 6 alko.fi 5 banknorwegian.fi 5 bongacams.com 5 buzzfeed.com 5 .com 5 directrev.com 5 fanfiction.net 5 gfycat.com 5 gyazo.com 5 haahtela.fi 5 jyu.fi 5 kela.fi 5 kinoprofi.org 5 netflix.com 5 onclkds.com 5 .com 5 uef.fi 5 yle.fi 5 9gag.com 4 danskebank.fi 4 gidonline.club 4 luottokunta.fi 4 medium.com 4 mywatchseries.to 4 nets.eu 4 nih.gov 4 ok.ru 4 opensubtitles.org 4 paf.com 4 pinterest.com 4 roblox.com 4 ultimate-guitar.com 4 uta.fi 4 varusteleka.fi 4 veikkaus.fi 4 wikihow.com 4 .com 4 yr.no 4 ampparit.com 3 ask.com 3 92

azlyrics.com 3 bandcamp.com 3 beeg.com 3 discogs.com 3 espoo.fi 3 facebook.com 3 faceit.com 3 fbcdn.net 3 filmix.me 3 flickr.com 3 hel.fi 3 hinta.fi 3 lidl.fi 3 newsru.com 3 onnibus.com 3 op.fi 3 osuuspankki.fi 3 pakkotoisto.com 3 poliisi.fi 3 popads.net 3 .com 3 tumblr.com 3 twitter.com 3 tyk.info 3 utu.fi 3 4chan.org 2 apina.biz 2 auto24.ee 2 discordapp.com 2 gamefaqs.com 2 github.io 2 google.co.uk 2 google.de 2 google.fi 2 google.ru 2 google.se 2 gostream.is 2 io-tech.fi 2 jimms.fi 2 kanta.fi 2 .com 2 93

migri.fi 2 .org 2 .com 2 nametests.com 2 otava.fi 2 pathofexile.com 2 quora.com 2 stackexchange.com 2 streamable.com 2 terveyskirjasto.fi 2 timeanddate.com 2 tripadvisor.com 2 apple.com 1 bet365.com 1 bitmedianetwork.com 1 blogger.com 1 blogspot.com 1 handelsbanken.fi 1 hbonordic.com 1 hltv.org 1 instagram.com 1 kaleva.fi 1 messenger.com 1 my-hit.org 1 pcmdnsrv.com 1 perfecttoolmedia.com 1 ppy.sh 1 primewire.ag 1 reittiopas.fi 1 skype.com 1 steamcommunity.com 1 steampowered.com 1 tampere.fi 1 telegram.org 1 thepiratebay.org 1 tripadvisor.fi 1 tui.fi 1 udemy.com 1 vidzi.tv 1 .com 1 xnxx.com 1 94

.com 1 youtube.com 1 739c49a8c68917.com 0 ablogica.com 0 adexchangeprediction.com 0 afh32lkjwe.net 0 alastonsuomi.com 0 anna.fi 0 archive.org 0 bing.com 0 bp.blogspot.com 0 cloudfront.net 0 cnet.com 0 cpm20.com 0 deviantart.com 0 doublepimpssl.com 0 duckduckgo.com 0 enterfinland.fi 0 europa.eu 0 feedly.com 0 finlex.fi 0 finna.fi 0 finnair.com 0 flowfestival.com 0 fmovies.is 0 focuusing.com 0 github.com 0 google.com 0 googleusercontent.com 0 haa66855mo.club 0 href.li 0 humblebundle.com 0 huuto.net 0 inet.fi 0 inschool.fi 0 iwanttodeliver.com 0 junbi-tracker.com 0 kaksplus.fi 0 kinogo.club 0 kissanime.ru 0 95

knowyourmeme.com 0 live.com 0 local-finders.com 0 magicred.com 0 media.tumblr.com 0 mega.nz 0 microsoftonline.com 0 mmg.fi 0 mol.fi 0 momondo.fi 0 netposti.fi 0 nettivene.com 0 nordea.fi 0 office.com 0 openload.co 0 paytrail.com 0 peda.net 0 pinimg.com 0 pirateproxy.cc 0 pirnet.fi 0 poe.trade 0 proxybay.one 0 pulseonclick.com 0 punkinfinland.net 0 rarbg.to 0 redd.it 0 rottentomatoes.com 0 salainenseksiyhteys.com 0 sanomapro.fi 0 soundcloud.com 0 spankbang.com 0 spotscenered.info 0 stockmann.com 0 suomi.fi 0 supersaa.fi 0 swedbank.ee 0 t.co 0 take-game-bonus.com 0 taloon.com 0 te-palvelut.fi 0 96

tekniikanmaailma.fi 0 turku.fi 0 twimg.com 0 umblr.com 0 vero.fi 0 viaplay.fi 0 vice.com 0 wikimedia.org 0 wikipedia.org 0 wiktionary.org 0 xda-developers.com 0 ylilauta.org 0

97

APPENDIX 4 TRACKER REACH

Tracker Number of Sites Share (of 500 sites) Google Analytics 327 65 % DoubleClick 313 63 % Facebook Connect 205 41 % Google Tag Manager 159 32 % AppNexus 147 29 % ScoreCard Research Beacon 127 25 % Facebook Custom Audience 125 25 % Adform 115 23 % Rubicon 112 22 % Google Adsense 110 22 % Google Publisher Tags 106 21 % MediaMath 100 20 % OpenX 95 19 % Index Exchange (Formerly Casale Media) 95 19 % Adobe Audience Manager 89 18 % Twitter Button 86 17 % Google Dynamic Remarketing 86 17 % AddThis 85 17 % Criteo 84 17 % Google AdWords Conversion 82 16 % TradeDesk 80 16 % PubMatic 78 16 % Advertising.com 76 15 % TNS 75 15 % BidSwitch 72 14 % Google Syndication 65 13 % BrightRoll 64 13 % BlueKai 62 12 % Quantcast 62 12 % Turn Inc. 59 12 % Rocket Fuel 57 11 % LiveRamp 56 11 % Yahoo Ad Manager Plus 54 11 % Facebook Social Plugins 52 10 % New Relic 50 10 % eXelate 49 10 % Yieldlab 49 10 % ADTECH 48 10 % 98

DoubleClick Ad Exchange-Buyer 47 9 % Yahoo Ad Exchange 45 9 % Crazy Egg 45 9 % DoubleClick Bid Manager 45 9 % Hotjar 44 9 % Improve Digital 43 9 % Tapad 40 8 % Google+ Platform 40 8 % Aggregate Knowledge 39 8 % Twitter Advertising 39 8 % Datalogix 39 8 % PulsePoint 38 8 % DoubleClick Floodlight 37 7 % Yandex.Metrics 37 7 % SpotXchange 35 7 % DataXu 35 7 % SiteScout 35 7 % (blank) 34 7 % Drawbridge 33 7 % Omniture (Adobe Analytics) 33 7 % Media.net 32 6 % SmartClip 32 6 % Dotomi 31 6 % SMART AdServer 31 6 % Twitter Syndication 31 6 % Krux Digital 30 6 % sovrn (formerly Lijit Networks) 30 6 % Econda 29 6 % Amazon Associates 29 6 % Facebook 29 6 % Taboola 29 6 % RadiumOne 29 6 % Lotame 28 6 % Admeta 28 6 % Simpli.fi 28 6 % Frosmo Optimizer 28 6 % Yahoo Analytics 28 6 % LiveInternet 28 6 % AdScale 28 6 % StickyAds 27 5 % ChartBeat 26 5 % 99

Enreach 26 5 % Optimizely 25 5 % Visual Website Optimizer 25 5 % ShareThis 25 5 % Semasio 24 5 % Typekit by Adobe 23 5 % Bing Ads 23 5 % Media Innovation Group 23 5 % Platform161 23 5 % VKontakte Widgets 22 4 % Delta Projects 22 4 % DoubleClick Ad Exchange-Seller 22 4 % Ensighten 22 4 % Dstillery 21 4 % Akamai Cookie Sync 21 4 % Leiki 21 4 % Eyeota 21 4 % Teads 21 4 % Videology 20 4 % i-Behavior 19 4 % Infectious Media 19 4 % Adap.tv 19 4 % gumgum 19 4 % Top Mail 18 4 % AdRiver 18 4 % LinkedIn Ads 18 4 % TripleLift 18 4 % Connexity 17 3 % Mail.Ru Group 17 3 % OwnerIQ 17 3 % AdGear 17 3 % Atlas 17 3 % Alexa Metrics 17 3 % adingo 17 3 % LinkedIn Marketing Solutions 17 3 % Tealium 16 3 % Netmining 16 3 % Tribal Fusion 16 3 % cXense 16 3 % AdRoll 16 3 % Switch Concepts 15 3 % 100

NetRatings SiteCensus 15 3 % myThings 15 3 % Emediate 15 3 % Amazon Mobile Ads 14 3 % Sonobi 14 3 % Sizmek 14 3 % Snoobi 14 3 % Google AJAX Search API 14 3 % Acuity Ads 14 3 % AT Internet 14 3 % PowerLinks 13 3 % Adobe Test & Target 13 3 % ExoClick 13 3 % Yandex.Direct 13 3 % Pinterest 13 3 % Integral Ad Science 13 3 % Veruta 13 3 % Vi 13 3 % Yieldr 13 3 % Crimtan 12 2 % ShareThrough 12 2 % AlmondNet 12 2 % Pingdom 12 2 % Adition 12 2 % Magnetic 12 2 % Optimax Media Delivery 12 2 % rutarget 12 2 % eyeReturn Marketing 11 2 % Gravatar 11 2 % EQ Advertising 11 2 % Marketo 11 2 % GetIntent 11 2 % Eyeview 11 2 % Adify 11 2 % TubeMogul 11 2 % Moat 10 2 % BidTheatre 10 2 % Audience Science 10 2 % Clearstream.TV 10 2 % Flashtalking 9 2 % LiveRail 9 2 % 101

UserReport-Analytics 9 2 % Nugg.Ad 9 2 % Bombora 9 2 % Rambler 9 2 % Admedo 9 2 % VisualDNA 9 2 % Internet BillBoard 8 2 % TrafficHaus 8 2 % TRUSTe Notice 8 2 % SOASTA mPulse 8 2 % Parse.ly 7 1 % FriendFinder Network 7 1 % Nativo 7 1 % Linkpulse 7 1 % DoubleClick Spotlight 7 1 % PageFair 7 1 % Neustar AdAdvisor 7 1 % Clickonometrics 7 1 % Ancora 7 1 % Facebook Exchange (FBX) 7 1 % Weborama 7 1 % Mail.Ru Banner Network 7 1 % TradeDoubler 7 1 % Smaato 7 1 % Adglue 6 1 % YieldMo 6 1 % Content.ad 6 1 % Parsely 6 1 % Mixpanel 6 1 % The ADEX 6 1 % Jivox 6 1 % JuggCash 5 1 % AdFox 5 1 % Unruly Media 5 1 % Usabilla 5 1 % Conversant 5 1 % UserVoice 5 1 % RichRelevance 5 1 % Qualtrics 5 1 % Disqus 5 1 % [x+1] 5 1 % 102

Sekindo 5 1 % Gemius 5 1 % Adkontekst 5 1 % LivePerson 5 1 % TrafficJunky 5 1 % MaxPoint Interactive 5 1 % Marin Search Marketer 4 1 % Demandbase 4 1 % Symantec (Norton Secured Seal) 4 1 % RUN 4 1 % DoublePimp 4 1 % Navegg 4 1 % Outbrain 4 1 % Google Translate 4 1 % Resonate Networks 4 1 % KXCDN 4 1 % Widespace 4 1 % ClickTale 4 1 % LinkedIn Widgets 4 1 % TradeTracker 4 1 % eBay Stats 4 1 % Intent IQ 4 1 % Metrigo 4 1 % Signal 4 1 % Mark & Mini 4 1 % Cedexis Radar 4 1 % Convertro 4 1 % OnThe.io 4 1 % Nexage 4 1 % 33Across 4 1 % Effective Measure 4 1 % FACETz 4 1 % DataMind 4 1 % AdSniper 4 1 % Zanox 4 1 % SkimLinks 4 1 % Bazaarvoice 4 1 % Ghostery Privacy Notice 3 1 % DoubleVerify 3 1 % Elastic Beanstalk 3 1 % Sailthru Horizon 3 1 % 103

Realtime 3 1 % Eloqua 3 1 % CPX Interactive 3 1 % MyFonts Counter 3 1 % eStat 3 1 % PixFuture 3 1 % AMP Platform 3 1 % Sentry 3 1 % Begun 3 1 % SessionCam 3 1 % Medley 3 1 % Ooyala Player 3 1 % Yandex 3 1 % Mail.ru counter 3 1 % Perfect Market 3 1 % Po.st 3 1 % Euroads 3 1 % Openstat 3 1 % Ve Interactive 3 1 % TRUSTe Seal 3 1 % MarketGid 3 1 % HubSpot 3 1 % Videoplaza 3 1 % SiteImprove Analytics 3 1 % AdLabs 3 1 % Fidelity Media 3 1 % Admatic 3 1 % Twitter Badge 3 1 % Browser Update 3 1 % NetSeer 3 1 % Perfect Audience 3 1 % AdXpansion 3 1 % RTB House 3 1 % Impact Radius 3 1 % Google Custom Search Engine 3 1 % Underdog Media 3 1 % Innovid 3 1 % Amplitude 3 1 % Dynamic Yield 3 1 % Zopim 3 1 % Rythmxchange 2 0 % 104

Rich 2 0 % C3 Metrics 2 0 % Lockerz Share 2 0 % Burt 2 0 % adsnative 2 0 % Mouseflow 2 0 % Tumblr Buttons 2 0 % MarkMonitor 2 0 % AdLantic 2 0 % Adblade 2 0 % Pornvertising 2 0 % Dianomi 2 0 % Janrain 2 0 % Digital Analytix 2 0 % Adloox 2 0 % Petametrics 2 0 % Cross Pixel Media 2 0 % ZergNet 2 0 % Tynt 2 0 % Rhythmxchange 2 0 % Kavanga 2 0 % Branch Metrics 2 0 % Baidu Ads 2 0 % Between Digital 2 0 % Undertone 2 0 % WEDCS 2 0 % Snap Engage 2 0 % Meetrics 2 0 % admitad 2 0 % Ziff Davis 2 0 % UserEcho 2 0 % Digital Window 2 0 % OnScroll 2 0 % Facebook Beacon 2 0 % Sociomantic 2 0 % display block 2 0 % FreeWheel 2 0 % Zemanta 2 0 % Venatus Media 2 0 % Sift Science 2 0 % Research Now 2 0 % 105

LinkShare 2 0 % Livefyre 2 0 % LifeStreet Media 2 0 % Artificial Computation Intelligence 1 0 % UserReport 1 0 % Heap 1 0 % Leadsius 1 0 % IP Mappers 1 0 % Legolas Media 1 0 % Grapeshot 1 0 % Coull 1 0 % Hubspot Forms 1 0 % ad4game 1 0 % InsightExpress 1 0 % Adtraction 1 0 % PopAds 1 0 % AudienceInsight 1 0 % Qubit Opentag 1 0 % Rokwell 1 0 % Gridsum 1 0 % Auditorius 1 0 % nPario 1 0 % CRM4D 1 0 % AdHands 1 0 % Azalead 1 0 % uTarget 1 0 % Live Journal 1 0 % Vidible 1 0 % LiveChat 1 0 % Wikia Beacon 1 0 % Adult Webmaster Empire 1 0 % Yandex.API 1 0 % SaleCycle 1 0 % Yotpo 1 0 % Salesforce Live Agent 1 0 % Recreativ 1 0 % Sanoma 1 0 % Naver 1 0 % ScaleOut 1 0 % trueAnthem 1 0 % Adnet Media 1 0 % 106

New York Times 1 0 % BBelements 1 0 % NY Times TagX 1 0 % FeedBurner 1 0 % iGoDigital 1 0 % Feedsportal 1 0 % Adscience 1 0 % AdvertServe 1 0 % Adconion 1 0 % SexAdNetwork 1 0 % INFOnline 1 0 % Loggly 1 0 % Conde Nast 1 0 % advmaker.ru 1 0 % Perform Group 1 0 % Shopify Stats 1 0 % iPerceptions 1 0 % FreakOut 1 0 % Connextra 1 0 % AdX 1 0 % Ad Man 1 0 % SimpleReach 1 0 % Propeller Ads 1 0 % AB Tasty 1 0 % Purch 1 0 % Sirdata 1 0 % Zedo 1 0 % Adyoulike 1 0 % AdStir 1 0 % Bizible 1 0 % adGENIE 1 0 % Generic Affiliate Systems 1 0 % C8 Network 1 0 % GeoTrust 1 0 % TrafMag 1 0 % Direct/ADVERT 1 0 % Acxiom 1 0 % DirectREV 1 0 % Gumroad 1 0 % smartAD 1 0 % Heatmap 1 0 % 107

Gigya Socialize 1 0 % Centro 1 0 % MdotLabs 1 0 % HotLog 1 0 % SnapEngage 1 0 % eBay Partner Network 1 0 % Adzerk 1 0 % Ads by Yahoo! 1 0 % Bluelithium 1 0 % IHS Markit Online Shopper Insigh 1 0 % Social Amp 1 0 % UpToLike 1 0 % mediaFORGE 1 0 % Elastic Ad 1 0 % Medialand 1 0 % Optimonk 1 0 % AffiliateLounge 1 0 % UserZoom 1 0 % Spongecell 1 0 % Vdopia 1 0 % Spotify Embed 1 0 % Commission Factory 1 0 % Mediapost Communications 1 0 % Commission Junction 1 0 % Statcounter 1 0 % Intent Media 1 0 % Bounce Exchange 1 0 % VigLink 1 0 % Sub2 1 0 % Confirmit 1 0 % SumoMe 1 0 % ip-label 1 0 % Adobe TagManager 1 0 % AOL CDN 1 0 % Brightcove 1 0 % Wordpress Stats 1 0 % Miaozhen 1 0 % Polar Mobile 1 0 % Tag Commander 1 0 % Yahoo! Retargetting 1 0 % Tanx 1 0 % 108

Yandex Kik 1 0 % Aidata 1 0 % Jumptap 1 0 % ADARA Analytics 1 0 % Yieldify 1 0 % Monetate 1 0 % Keen IO 1 0 % Monster Advertising 1 0 % YLE 1 0 % ThreatMetrix 1 0 % Kenshoo 1 0 % Tinypass 1 0 % Kiosked 1 0 % Bunchbox 1 0 % Klikki 1 0 % Muscula 1 0 % eXTReMe Tracker 1 0 % TrackJS 1 0 % AdOcean 1 0 %

109

APPENDIX 5 TRACKING ORGANIZATION REACH

Tracking organization Number of Sites Share (of 500 sites) Google 375 75 % Facebook 232 46 % AppNexus 147 29 % comScore 127 25 % Twitter 118 24 % Adobe 117 23 % Adform 115 23 % Fox One Stop Media 112 22 % MediaMath 100 20 % OpenX 95 19 % Index Exchange 95 19 % Yahoo! 87 17 % AddThis 85 17 % Criteo 84 17 % The Trade Desk 80 16 % PubMatic 78 16 % AOL 76 15 % TNS 75 15 % BidSwitch 72 14 % BrightRoll 64 13 % Quantcast 62 12 % BlueKai 62 12 % Turn 59 12 % Rocket Fuel 57 11 % Rapleaf 56 11 % Microsoft 52 10 % New Relic 50 10 % eXelate 49 10 % Yieldlab 49 10 % ADTECH 48 10 % Amazon.com 46 9 % Crazy Egg 45 9 % Hotjar 44 9 % Improve Digital 43 9 % Tapad 40 8 % Datalogix 39 8 % AK 39 8 % CONTEXTWEB 38 8 % Yandex 38 8 % 110

SiteScout 35 7 % SpotXchange 35 7 % DataXu 35 7 % Drawbridge 33 7 % SmartClip 32 6 % media.net 32 6 % ValueClick 32 6 % Horyzon Media 31 6 % Federated Media 30 6 % Krux 30 6 % Taboola 29 6 % RadiumOne 29 6 % Econda 29 6 % adscale 28 6 % Simpli.fi 28 6 % Frosmo Optimizer 28 6 % LiveInternet 28 6 % Lotame 28 6 % Admeta 28 6 % Mail.Ru 27 5 % StickyAds 27 5 % Enreach 26 5 % Chartbeat 26 5 % Wingify 25 5 % ShareThis 25 5 % Optimizely 25 5 % Semasio 24 5 % WPP 23 5 % VKontakte 23 5 % ClickDistrict 23 5 % Delta Projects 22 4 % Ensighten 22 4 % Akamai 21 4 % m6d 21 4 % Teads.tv 21 4 % Eyeota 21 4 % Leiki 21 4 % Videology 20 4 % I-Behavior 19 4 % Adap.tv 19 4 % GumGum 19 4 % 111

Infectious Media 19 4 % TripleLift 18 4 % AdRiver 18 4 % Connexity 17 3 % OwnerIQ 17 3 % BLOOM Digital Platforms 17 3 % Fluct 17 3 % AdRoll 16 3 % Tealium 16 3 % cXense 16 3 % Exponential Interactive 16 3 % Netmining 16 3 % Switch Concepts 15 3 % myThings 15 3 % Emediate 15 3 % Nielsen 15 3 % Acuity 14 3 % Sonobi 14 3 % DG 14 3 % Snoobi 14 3 % AT Internet 14 3 % AdSafe Media 13 3 % Yieldr 13 3 % Digital Target 13 3 % ExoClick 13 3 % PowerLinks 13 3 % Pinterest 13 3 % MyBuys 13 3 % Crimtan 12 2 % RuTarget 12 2 % Pingdom 12 2 % Datonics 12 2 % ShareThrough 12 2 % Magnetic 12 2 % ADITION 12 2 % OptMD 12 2 % TubeMogul 11 2 % Automattic 11 2 % Marketo 11 2 % EQ Ads 11 2 % Cox Digital Solutions 11 2 % 112

eyeReturn Marketing 11 2 % Eyeview 11 2 % GetIntent 11 2 % AudienceScience 10 2 % TRUSTe 10 2 % BidTheatre 10 2 % Clearstream.TV 10 2 % Moat 10 2 % LiveRail 9 2 % Admedo 9 2 % Rambler 9 2 % Flashtalking 9 2 % UserReport-Analytics 9 2 % VisualDNA 9 2 % Bombora 9 2 % nugg.ad 9 2 % Soasta 8 2 % Internet BillBoard 8 2 % TrafficHaus 8 2 % FriendFinder Networks 7 1 % PageFair 7 1 % Weborama 7 1 % Neustar 7 1 % Ancora 7 1 % Tradedoubler 7 1 % Clickonometrics 7 1 % Nativo 7 1 % Smaato 7 1 % Linkpulse 7 1 % Parse.ly 7 1 % Adglue 6 1 % Content.ad 6 1 % Parsely 6 1 % Jivox 6 1 % The ADEX 6 1 % Mixpanel 6 1 % Yieldmo 6 1 % Sekindo 5 1 % AdFox 5 1 % Gemius 5 1 % LivePerson 5 1 % 113

Usabilla 5 1 % MaxPoint 5 1 % JuggCash 5 1 % UserVoice 5 1 % TrafficJunky 5 1 % Qualtrics 5 1 % Unruly 5 1 % 33Across 5 1 % [x+1] 5 1 % RichRelevance 5 1 % Adkontekst 5 1 % Disqus 5 1 % Skimlinks 4 1 % ClickTale 4 1 % TradeTracker 4 1 % Nexage 4 1 % Navegg 4 1 % Cedexis Radar 4 1 % Bazaarvoice 4 1 % OnThe.io 4 1 % eBay 4 1 % Outbrain 4 1 % AdSniper 4 1 % BrightTag 4 1 % Demandbase 4 1 % DataMind 4 1 % Symantec (Norton Secured Seal) 4 1 % Mark & Mini 4 1 % The Heron Partnership 4 1 % zanox 4 1 % Metrigo 4 1 % DoublePimp 4 1 % Effective Measure 4 1 % Resonate Networks 4 1 % KXCDN 4 1 % RUN 4 1 % Widespace 4 1 % FACETz 4 1 % Intent IQ 4 1 % Convertro 4 1 % Ve Interactive 3 1 % 114

Openstat 3 1 % Dynamic Yield 3 1 % Elastic Beanstalk 3 1 % Realtime 3 1 % Sailthru Horizon 3 1 % Eloqua 3 1 % Perfect Audience 3 1 % Medley 3 1 % Admatic 3 1 % Zopim 3 1 % Sentry 3 1 % Evidon 3 1 % SessionCam 3 1 % HubSpot 3 1 % Perfect Market 3 1 % RTB House 3 1 % SiteImprove Analytics 3 1 % Amplitude 3 1 % AdXpansion 3 1 % Ooyala 3 1 % Fidelity Media 3 1 % MarketGid 3 1 % DoubleVerify 3 1 % Underdog Media 3 1 % CPX Interactive 3 1 % MyFonts Counter 3 1 % PixFuture 3 1 % Euroads 3 1 % Po.st 3 1 % Videoplaza 3 1 % Begun 3 1 % Collective 3 1 % AdLabs 3 1 % Impact Radius 3 1 % Médiamétrie-eStat 3 1 % Browser-Update.org 3 1 % NetSeer 3 1 % Innovid 3 1 % sociomantic labs 2 0 % MarkMonitor 2 0 % Burt 2 0 % 115

Ziff Davis 2 0 % AdLantic 2 0 % LifeStreet 2 0 % Cross Pixel 2 0 % Rhythmxchange 2 0 % Adiant 2 0 % Rich 2 0 % Lockerz 2 0 % Meetrics 2 0 % SnapEngage 2 0 % Kavanga 2 0 % Branch Metrics 2 0 % Between Digital 2 0 % Zemanta 2 0 % Petametrics 2 0 % Sift Science 2 0 % Adloox 2 0 % UserEcho 2 0 % LinkShare 2 0 % Janrain 2 0 % Digital Window 2 0 % Venatus Media 2 0 % dianomi 2 0 % display block 2 0 % Mouseflow 2 0 % C3 Metrics 2 0 % adsnative 2 0 % FreeWheel 2 0 % Tumblr 2 0 % Spotify 2 0 % Pornvertising 2 0 % Baidu 2 0 % Livefyre 2 0 % ZergNet 2 0 % admitad 2 0 % Undertone 2 0 % Research Now 2 0 % OnScroll 2 0 % BBelements 1 0 % uTarget 1 0 % trueAnthem 1 0 % 116

Gigya 1 0 % Legolas Media 1 0 % Purch 1 0 % AdOcean 1 0 % advmaker.ru 1 0 % Adara Media 1 0 % DirectAdvert 1 0 % VigLink 1 0 % Qubit Opentag 1 0 % Yotpo 1 0 % Grapeshot 1 0 % Intent Media 1 0 % Gridsum 1 0 % New York Times 1 0 % AB Tasty 1 0 % Ad Man 1 0 % Loggly 1 0 % UserReport 1 0 % Recreativ 1 0 % Kiosked 1 0 % Renegade Internet 1 0 % Perform Group 1 0 % AdX 1 0 % Brightcove 1 0 % C8 Network 1 0 % PopAds 1 0 % Gumroad 1 0 % Elastic Ad 1 0 % Adyoulike 1 0 % ThreatMetrix 1 0 % Heap 1 0 % IP Mappers 1 0 % Artificial Computation Intelligence 1 0 % TrafMag 1 0 % Rokwell 1 0 % iPerceptions 1 0 % Heatmap 1 0 % Dataium 1 0 % Adnet Media 1 0 % Adconion 1 0 % AdStir 1 0 % 117

Keen IO 1 0 % Acxiom 1 0 % Vdopia 1 0 % SaleCycle 1 0 % Adult Webmaster Empire 1 0 % Salesforce 1 0 % CRM4D 1 0 % Sanoma 1 0 % adGENIE 1 0 % ScaleOut 1 0 % eXTReMe digital 1 0 % Adzerk 1 0 % GeoTrust 1 0 % mediaFORGE 1 0 % ZEDO 1 0 % Medialand 1 0 % LiveChat 1 0 % AffiliateLounge 1 0 % InsightExpress 1 0 % SexAdNetwork 1 0 % Naver 1 0 % AudienceInsight 1 0 % Betgenius 1 0 % DirectREV 1 0 % Tinypass 1 0 % Shopify 1 0 % TrackJS 1 0 % Mediapost Communications 1 0 % Conde Nast 1 0 % SimpleReach 1 0 % ip-label 1 0 % iGoDigital 1 0 % Confirmit 1 0 % Sirdata 1 0 % nPario 1 0 % Feedsportal 1 0 % NY Times TagX 1 0 % Auditorius 1 0 % FreakOut 1 0 % Ad4Game 1 0 % Jumptap 1 0 % 118

Miaozhen 1 0 % UpToLike 1 0 % smartAD 1 0 % Optimonk 1 0 % Aidata 1 0 % Centro 1 0 % Adtraction 1 0 % UserZoom 1 0 % Coull 1 0 % Bizible 1 0 % Commission Factory 1 0 % Kenshoo 1 0 % Social Amp 1 0 % Klikki 1 0 % Monetate 1 0 % Vidible 1 0 % Monster 1 0 % Adscience 1 0 % Spongecell 1 0 % Leadsius 1 0 % INFOnline 1 0 % Wikia Beacon 1 0 % Azalead 1 0 % Wordpress Stats 1 0 % StatCounter 1 0 % Bounce Exchange 1 0 % Muscula 1 0 % Yieldify 1 0 % Sub2 1 0 % Generic Affiliate Systems 1 0 % SumoMe 1 0 % YLE 1 0 % InfoStars 1 0 % AdHands 1 0 % Bunchbox 1 0 % Polar Mobile 1 0 % Commission Junction 1 0 % Live Journal 1 0 % Tag Commander 1 0 % Propeller Ads 1 0 % Tanx 1 0 %