Master Thesis Law and Technology LLM

Nothing personal: The concepts of anonymization and in European Data Protection

Supervisors: Student: F. Stoitsev Lorenzo Dalla Corte (1st) ANR: 729037 Colette Cuijpers (2nd)

August 2016

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

Nothing personal: The concepts of anonymization and pseudonymization in European Data Protection

Table of Contents

List of Abbreviations ...... 2 Chapter 1 - Introduction ...... 3 Chapter 2 – Defining the Concepts ...... 9 2.1. The concept of ...... 9 2.2. Anonymization ...... 13 2.3. Pseudonymization ...... 14 2.4. Data Protection Directive ...... 14 2.4.1. Anonymization ...... 14 2.4.2. Pseudonymization ...... 18 2.5. GDPR ...... 19 2.5.1. Anonymization ...... 19 2.5.2. Pseudonymization ...... 20 2.6. Anonymization and Pseudonymization techniques ...... 22 2.7. Conclusion ...... 23 Chapter 3 – Identifying the threats ...... 24 3.1. Re-identification ...... 24 3.2. The landmark re-identification studies ...... 25 3.2.1. Massachusetts Medical Database ...... 26 3.2.2. AOL ...... 26 3.2.3. Netflix ...... 27 3.3. Utility versus Privacy ...... 29 3.4. New Challenges ...... 32 3.4.1. Big Data ...... 32 3.4.2. Profiling and Behavioral Advertising ...... 37 3.5. Conclusion ...... 40 Chapter IV Measures to address the challenges ...... 42 4.1. Computer scientists’ recommendations ...... 42 4.2. Risk ...... 44 4.3. Risk-based approach in the European Data Protection Legislation ...... 45 4.3.1. Risk-based approach and pseudonymization ...... 48 4.4. The robustness of Anonymization ...... 50 4.5. DPIA ...... 52 4.6. Data protection by design and by default ...... 55 4.6. Conclusion ...... 58 Chapter 5 - Conclusion ...... 59 Bibliography ...... 61 Legislation and Case Law ...... 61 Books, Articles and Papers ...... 62 Documents and Reports ...... 67 Other ...... Error! Bookmark not defined.

1

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

List of Abbreviations

AOL - American On Line BD – Big Data CNIL - Commission Nationale de l’Informatique et des Libertes (French Supervisory Authority) DPD – Data Protection Directive DPbD – Data Protection by Design DPIA – Data Protection Impact Assessment ECHR – European Convention on Human Rights EDPB – European Data Protection Board EDPS – European Data Protection Supervisor EU – European Union GDPR – General Data Protection Regulation HHP - Heritage Health Prize ICO – Information Commissioner’s Office ICT - Information and Communications Technologies IMDb - Internet Movie Database IP – Internet Protocol ISO – The International Organization for Standardization MAC - Media Access Control (address) MIT - Massachusetts Institute of Technology MS – Member States NYC – New York City PbD – Privacy by Design PETs – Privacy Enhancing Technologies PSI – Public Sector Information TFEU – Treaty on the Functioning of the European Union UKAN - United Kingdom Anonymization Network US – United States WP29 – Article 29 Working Party

2

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

Nothing personal: The concepts of anonymization and pseudonymization in the light of the European Data Protection

Chapter 1 - Introduction

“In today’s era of instant information gratification, we have ready access to opinions, rationalizations, and superficial descriptions. Much harder to come by is the foundation knowledge that informs a principled understanding of the world.” Zoltan L. Torey 1

The clash between data use and data protection is one of the most relevant topics of our time, conceived as the ongoing conflict in which the private companies and governments are the aggressors who are looking for data and the individuals who are the victims or the providers of personal data - “the new oil of the internet and the new currency of the digital world”.2 Data protection law is meant to bring balance to this unequal dispute, however, its effectiveness has been repeatedly challenged by the critics. In that sense, the marginal role of the data protection legislation has been overtaken by fast development of data processing techniques, specifically those allowing/permitting the automated processing of vast amounts of data3.

The dramatic change of information technologies and the widespread use of the internet have made the current Data Protection Directive4 (hereinafter referred as the “Directive” or “DPD”) obsolete5. This should come as no surprise, as the current data protection principles “were drawn up in 1990 and adopted in 1995, when only 1% of the European Union population was using the Internet and the founder of Facebook was only 11 years old!”6 The upcoming

1 Zoltan L. Torey, The Conscious Mind, MIT Press, (2014), 1. 2 Miglena Kuneva, Roundtable on Online Data Collection, Targeting and Profiling, (2009) 3 Orla Lynskey, The Foundation of the EU Data Protection, OUP, (2015) 1. 4 The European Parliament and the Council Directive 95/46/EC of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data, OJ L 281, Data Protection Directive. 5 Bert-Jaap Koops, ‘Trouble with European Data Protection Law’ 4 International Data Privacy Law, (2014), 250. 6 Viviane Reding, ‘Outdoing Huxley: Forging a High Level of Data Protection for Europe in the Brave New Digital World’, Speech at Digital Enlightenment Forum, (2012), 4.

3

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

General Data Protection Regulation7 (“GDPR” or “Regulation”) is the Directive’s successor and is aimed to resolve, inter alia, this particular challenge.

Besides the above, another fundamental right is also at stake, namely the right to privacy. The new generation of unobtrusive and powerful technologies affects this essential factor responsible for the normal development and well-being of companies, individuals and organizations. The Universal Declaration of Human Rights (1948)8 and the Charter of the Fundamental Rights of the European Union9 (the Charter) recognize the impact of this right on humankind by stating that its preservation needs a specific and modernized system for protection. The violation of privacy will furthermore lead to chilling effects in other fundamental rights such as freedom of speech and freedom of information.

The enormous value of available information and the use of sophisticated systems such as Big Data (BD) analytics is nowadays one of the key elements towards progress and success of humankind.10 The main issue is how to process data without disclosing information related to identifiable individuals, especially when this data has a sensitive character. 11 Anonymization has been frequently raised out as one of the possible solutions to this ongoing debate.

“Anonymization” describes the process of transforming data “into a form which does not identify individuals and where identification is not likely to take place”.12 Therefore, through it, the legislator pursues a more extensive use of the data. 13 Article 29 Working Party (WP29)14 characterizes anonymization as a “strategy to reap the benefits of this technological revolution for individuals and society at large whilst mitigating the risks for the individuals

7 European Commission, Proposal for a Regulation of the European Parliament and of the Council on the protection of individuals with regard to the processing of personal data and on the free movement of such data, COM (2012) 11 final, (General Data Protection Regulation). 8 United Nations General Assembly, Universal Declaration of Human Rights, 10 December 1948, 217 A (III). 9 European Union, Charter of Fundamental Rights of the European Union 2912 OJ C 326. 10 Omer Tene and Jules Polonetsky, Big Data for All: Privacy and User Control in the Age of Analytics, 11 Nw. J. Tech. & Intell. Prop. 239 (2013). 11 Josep Domingo-Ferrer, David Sánchez, and Jordi Soria-Comas, Database Anonymization Privacy Models, Data Utility, and Microaggregation-based Inter-model Connections, Morgan & Claypool Publishers, (2016) 13. 12 Informational Commissioner’s Office, Anonymisation: managing data protection risk code of practice, press release, (2012). 13 Ibid. 14 Article 29 of the Data Protection Directive provides for the creation of a ‘Working Party on the Protection of Individuals with regard to the Processing of Personal Data’ (the so-called ‘Article 29 Working Party’). The Working Party is an independent body which acts in an advisory capacity. It is composed of a representative of each of the national supervisory authorities, a representative of the European Data Protection Supervisor (the authority established for the Community institutions and bodies), and a representative of the Commission. Under the GDPR, the name will change to the European Data Protection Board (EDPB).

4

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology concerned”. 15 The Opinion on anonymization techniques 16 concludes that anonymization techniques can provide privacy guarantees and could be used as a tool which generates efficient anonymization processes. 17 Complementary to this, it mentions the daunting challenge to generate completely anonymized datasets and, at the same time, to preserve the information which is needed for the assignment.18 Notwithstanding, this argument just opens Pandora’s Box and barely indicates all implications related to the concept of anonymization under the European Data Protection legislation. Therefore, more research needs to be conducted, especially in the light of the new General Data Protection Regulation (GDPR or the Regulation).19

A number of government authorities, lawmakers and scholars have often highlighted anonymization’s many social and economic benefits. Their faith in anonymization is suitable to many interpretations, which attributed to this concept a central role in forming the core of standard procedures for providing confidentiality, data security and safe data processing. With regard to anonymization, Lawrence Lessig stresses its importance in all four dimensions of regulation – norms and ethics, the market, the architecture and the law, and encourages all administrators to use it.20

Conversely, other scholars, backed by case studies,21 and recent technological tendencies (such as BD), demonstrated the limitations and shortcomings of anonymization as a basis for policy. From a practical perspective, it is frequently quite challenging to determine if data has been effectively anonymised, and whether there is a concrete risk of re-identification. A case study from 2013, for instance, showed that an attacker (a team of Whitehead Institute specialists), with only an ordinary computer, internet connection, and freely available sources at their disposal, has been capable of revealing the personal genetic data of nearly 50 participants in genomic research.22 To some degree, this could be explained by the nature of

15 Article 29 Data Protection Working Party, Opinion 05/2014 on Anonymisation Techniques, Adopted 10.04.2014. 16 Ibid. 17 Ibid. 18 Ibid. 19 GDPR (n 7). 20 Lawrence Lessig, Code Version 2.0Basic Books, (2006), 56. 21 The AOL data breach already in 2906 was one of the first widely known cases of re-identification: the company’s release of users’ search logs allowed singling out certain users based on their searches. See more: section 3.2.2. 22 Francis Aldhouse. Anonymisation of personal data – A missed opportunity for the European Commission, Institute for Law and the Web, University of Southampton, (2014).

5

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology anonymized data, which includes risk as inherent characteristic 23 and by anonymization’s reliance on a diversity of circumstances that are effortful to quantify. More specifically, this challenge is caused by the inability to predict the available technology and data used for re- identification. Additionally, there are difficulties in quantifying the harms on privacy.24 These factors seem to support the reasoning behind the adoption of risk-based approach25 by the GDPR. This paper will adopt a comprehensive approach by addressing the following research question: How can risk-based approach be used to counteract the risks arising from anonymization and pseudonymization, as defined in the GDPR?

In order to answer this question, a better understanding of this “difficult area of law”26 requires further evaluation and clarification of the concepts of “anonymization”, “pseudonymization”, and “re-identification” under the European Data Protection Legislation. For that reason, the second chapter will address the essential concept of “personal data”, which is an integral part of the discussed concepts. “Understanding anonymization means understanding what personal data is”27, the main reason for this is because the “processing” of “personal data” is the benchmark for the applicability of data privacy rules.28 Subsequently, in the same chapter, an analysis will be provided of the relevant concepts (anonymization and pseudonymization), in order to make the topic more intelligible. The primary sources analyzed will be the Data Protection Directive and the Regulation. Additionally, a brief overview of anonymization and pseudonymization techniques will be described in order to provide a technical background.

Considering the new technological trends and the undermined faith in the privacy-protecting power of anonymization29, as underlined before, the consequences and the nature of the existing challenges will be assessed in pursuance of clarifying all grey areas hidden behind. The third chapter will take into consideration the criticism raised by a number of scholars. A good example is the widespread skepticism about anonymization techniques functioning as

23 WP29 (n 15). 24 Samson Yoseph Esayas. The role of anonymization and pseudonymisation under the EU data privacy rules: beyond the ‘all or nothing’ approach, , European Journal of Law and Technology, Vol 6, No 2. (2015). 25 According to the Working Party, fundamental principles applicable to controllers (e.g., legitimacy, data minimization, purpose limitation, transparency, etc.) should continue to apply under a risk-based approach, though their implementation may be varied according to the risk at hand through the application of accountability tools such as impact assessments, PbD, breach notification requirements and security mechanisms. 26 ICO (n 12), 9. 27 Ibid. 28 Aldhouse, (n 23). 29 Paul Ohm. Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization, UCLA Law Review, Vol. 57, (2010), 1707-1708.

6

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

“silver bullets” as manifested30 by Paul Ohm, an Associate Professor at the University of Colorado Law School – a fact that is long known to statisticians and computer specialists.31 BD analytics, profiling (behavioral advertising), increasingly easy re-identification and obstacles in front of the lawmakers will be discussed in order to reveal the possible shortcomings and drawbacks related to the concept of anonymization and pseudonymization.

Although there are numerous downsides (e.g. reduced utility of the data and possible risks of re-identification), the primary reason to anonymize is the protection of the data subjects when storing or disclosing data32 and as such it is important to analyze how these impediments could be reduced or removed. Therefore, the fourth chapter will focus on the possible measures incorporated in the upcoming Regulation which are directly related with anonymization and pseudonymization. The proposed risk-based approach as a significant game-changer will be assessed, which brings supportive and innovative measures in addition to the familiar anonymization techniques. Elements such as impact assessments, data breach notifications, privacy by design (PbD) will be examined to the extent they are relevant to the discussion on the role of anonymization and pseudonymization. Some of the measures proposed by the computer scientist literature and other provided by the GDPR will be also addressed.

One of the aims of this master’s thesis is to explore the concepts of anonymization and pseudonymization in the European Data Protection Legislation. For this reason, in the conclusion, the outcomes of each chapter will be brought together in identifying the answer of the research question. More particularly, the focus will be on data protection impact assessment, privacy by design, (PbD) and privacy by default, tools that could help to extract the positive effects and to develop the right use of the anonymization and pseudonymization.

The thesis adopts a traditional legal method, namely desk research. In order to answer the central question, this thesis is based on the following “benchmark” sub-questions: What are the legal definitions of personal data, anonymization, and pseudonymization and what role do they play in the European data protection legal framework? What are the relevant challenges that surround the concepts of anonymization and pseudonymization?

30 Ibid. 31 Aldhouse, (n 23). 32 Ohm (n 29).

7

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

Is it possible to solve the described problems by adopting a risk-based approach, as defined in the GDPR? These questions will be illustrated and explained in chapters 2 to 4. Providing an answer to each one of them will gradually lead to the answer of the central question. Therefore, the Data Protection Directive and the General Data Protection Regulation will be analyzed in detail and will be used as prime sources in this master thesis. This classic doctrinal research will be supported by literature from the field of computer sciences and statistics due to the interdisciplinary nature of the subject in order to clarify the discuss concepts.

8

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

Chapter 2 – Defining the Concepts

The purpose of this chapter is to lay down the thesis’s topic by exploring the key conceptual definitions of personal data, anonymization, and pseudonymization. The analysis furthermore defines when the data is considered to be personal and when not. This differentiation will be supported by relevant cases and legal literature. After the clarification of the concepts the next step is to explore them through the prism of the relevant Data Protection legislation (e.g. DPD and GDPR). Furthermore, a brief overview of anonymization and pseudonymization techniques will be outlined in the final section of the chapter in order to allow for more complete analysis of the topic.

2.1. The concept of personal data

In order to find the appropriate interpretation and before the legal analysis of anonymization and pseudonymization, it is required to shed light on the cornerstone of the Data Protection law, in particular the concept of personal data. Article 16 of the Treaty on the Functioning of the European Union (TFEU) 33 , proclaims explicitly that “everyone has the right to the protection of personal data concerning them.” Article 7 of the Charter of Fundamental Rights of the European Union (EU Charter)34 replicates Article 8 “Right to respect for private and family life” of the ECHR.35 Article 8 of the EU Charter36 enshrines explicitly the right of data protection as a separate right, as a consequence, both envisaged rights have become fundamental and thus rise the level of protection in the European legislation. It should be noted that Article 8 and 7 of the EU Charter to some extent overlap, however at the same time they have different scope.37

Secondary EU law contains the provisions which build the core framework of what constitutes personal data. Written in 1995 and still in force, the Data Protection Directive is the main Community instrument which has a dual objective: the protection of the individuals with regard to the processing of persona data and the free movement of such data within the EU.38 It is important to outline with regards to the personal data that qualifies under Data

33 European Union (Consolidated Version of the Treaty on the Functioning of the European Union) art. 16, 2008 O.J. C 115/47. 34 EU Charter (n 9). 35 CoE, European Convention on Human Rights, [1950], CETS No. 005. 36 EU Charter (n 9). 37 See, Juliane Kokott and Christoph Sobotta, The distinction between privacy and data protection in the jurisprudence of the CJEU and the ECtHR, International Data Privacy Law, Vol. 3, No. 4 (2013). 38 DPD (n 4), Art. 1.

9

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

Protection law and as such could be regulated by it. That means that the rest, regardless perceived as a personal data or not is not covered by the basic data protection rules.39

The text of Article 2 of the Directive set forth the definition of personal data as follows: “any information relating to an identified or identifiable natural person (“data subject”)”.40 The objective of the concept “personal data” is to cover all data linked to an identified or identifiable natural person, either directly or indirectly.41 The recently adopted Regulation which will come into force on 25 May 2018 adhered to the above definition in its Article 4(1).42 Particular attention will be given on what exactly constitutes identifiable due to the more complex nature of the term and the recent legislators efforts to clarify it.

The lawmakers’ intention was to craft a wide-reaching concept of personal data, and this is expressed by the use of the wording “any information”.43 The concept is widely subject to interpretation, personal data, among others, includes the name of the individual together with his telephone details or information which covers his working conditions or hobbies. 44 Another example could be given for the professional and business activities of the individuals which according to European Court of Human Rights should not be excluded from the nation of “private life”.45 This is not to say that there are not certain limitations.

Additionally, it should be noted that even when material scope of data protection legislation is satisfied, it does not necessarily mean that all the rules apply automatically or completely. Therefore, the protection of personal data as a fundamental right “[…] is not, however, an absolute right, but must be considered in relation to its function in society”.46 Moreover, the EU Charter sets forth the general limitations of the fundamental rights in its Article 52(1) with regards privacy and data protection: the limitations must be provided for by law, must respect the essence of the affected right, and must be necessary and genuinely meet the

39 Paul De Hert and Vagelis Papakonstantinou, “The New General Data Protection Regulation: Still a Sound System for the Protection of Individuals?,” Computer Law and Security Review 32, no. 2, 179–94, (2016). 40 DPD (n 4), Art. 2. 41 Nearly two years into the discussions on the proposed reform, it is still not clear which provisions of the Commission Proposal and in what shape will make it into the final draft. Almost 4,000 amendments were tabled. 42 GDPR (n 7). 43 DPD (n 4), Art. 2. 44 See, for example, Judgment of the European Court of Justice C-101/2001of 6.11.2003 (Lindqvist), §24: "The term personal data used in Article 3(1) of Directive 95/46 covers, according to the definition in Article 2(a) thereof, any information relating to an identified or identifiable natural person. The term undoubtedly covers the name of a person in conjunction with his telephone coordinates or information about his working conditions or hobbies". 45 ECHR (n 35), Art. 8. 46 See, for example, EU Charter (n 9), Art. 8; CJEU, Joined cases C-92/09 and C-93/09, Volker and Markus Schecke GbR and Hartmut Eifert v. Land Hessen, 9 November 2010, para. 48.

10

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology objectives of general interest recognized by the European Union or the need to protect the rights and freedoms of others.47 This illustrates that the application of this right requires certain flexibility and moreover it should balance against other rights, taking into account the particular case.

The EU legislator recognized the importance of certain types of data, e.g. data related to health, and made it subject to a specific regime, defining it as sensitive data. This distinction between sensitive data and “ordinary” is based on the specific a nature of the former one. According to the Directive these “special categories of data” constitute information that reveals racial or ethnic origin, political opinions, religious or philosophical beliefs, trade- union membership, and the processing of data concerning health or sex life”.48 The General Data Protection Regulation specifies another category of sensitive data, genetic and biometric data,49 which were however already interpreted as sensitive by doctrine and jurisprudence.50 The processing of sensitive data is in principle prohibited and therefore allowed only under specific circumstances.51 The sensitive personal data requires a different treatment than the rest (ordinary personal data that is not listed in the above mentioned provisions) due to its sensitive character that could expose the data subjects to a higher risk.

The DPD proclaimed that “identifiable person is one who can be identified, directly or indirectly”.52 The identifiability of a natural person through direct and indirect identification is among the key elements of the definition of personal data. With regards to the former option contained in the wording “directly identifiable”, the interpretation is clearer and does not call for legal disputes.53 To be directly identified, a natural person should be manifestly distinguishable from the other persons. On one hand, the elements contained in the informational set that make possible the direct identification of a specific individual are called “direct identifiers”.54 On the other hand, “indirect identifiers” refer to information with regard the physical, physiological, mental, economic, cultural, or social identity of that particular

47 EU Charter (n 9), Art. 52(1). 48 DPD (n 4), Art. 8 49 The definition of these two categories of data is given by Article 4 of the Regulation according to which they constitute additions in the data protection field that come as a result of scientific developments in their respective fields. 50 See, for example, European Court of Human Rights, S. and Marper v. the United Kingdom, Nos. 30562/04 and 30566/04, 04 December 2008. 51 DPD (n 4), Art. 8 52 DPD (n 4), Art.2 53 De Hert and Papakonstantinou (n 39). 54 Examples are outward signs of the appearance of this person, like height, hair colour, clothing, etc

11

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology individual.55 For example, the personal name as identifier is considered as one of the primary identifiers which contains the information that is strongly related to a particular person and could cause the same person to be directly identified. In certain cases the personal name may not be enough due to the fact that many names are not unique and further combination with other personal details is necessary in order to be achieved identification.56 In general, that means that the identity of the data subject could be established either directly or by receiving additional data.57 These examples just illustrate that the “identifiability” is a concept that has a contextual meaning and should be approached as such.58 Especially, the indirect identifiers in the second scenario need a further research in order to be related to a certain person.

That is the most commented and discussed part of the definition of personal data: the phrase “indirectly identifiable”.59 The definition given by the DPD refers to the personal identifiers such as “an identification number or to one or more factors specific to his physical, physiological, mental, economic, cultural or social identity”.60 The above characteristics, when cross-correlated with each other, can allow the identification of a particular individual through the identification of unique patterns. This covers situations where – while it would appear prima facie that the person considered could not be singled out through the use of direct identifiers – the availability of additional indirect identifiers (or quasi-identifiers) allows for the data controller or to any other actor to distinguish the individual from the collectivity of reference. 61

This understanding has been strongly challenged by the development of new technologies, where data analytics are increasingly sophisticated and the amount of the publicly available data is enormous.62 In 2007, WP 29 in its opinion with regards to personal data 04/2007 expansively addressed the issue of what exactly constitutes “identifiable” in order to reflect the realities of the new digital era and to facilitate the appropriate interpretation by the data

55 DPD (n 4), Art. 2(a). 56 Frederik J., Zuiderveen Borgesius, Singling Out People Without Knowing Their Names – Behavioural Targeting, Pseudonymous Data, and the New Data Protection Regulation, (2016); WP29 (n 15). 57 European Union Agency for Fundamental Rights and The Council of Europe, Handbook on European Data Protection Law, 2014. 58 WP29 (n 59 DPD (n 4), Art. 2 60 Ibid. 61 For example, dr. Latanya Sweeney conducted a study, which indicates that 87.1% of the citizens of the United States were uniquely identified by gender, date of birth, and ZIP code. This study will be further addressed in Chapter 3 of the thesis. 62 Luiz Costa and Yves Poullet, “Privacy and the Regulation of 2012,” Computer Law and Security Review 28, no. 3 (2012).

12

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology controllers.63 Article 4 (1) of the GDPR follows the same line as the above mention opinion by incorporating “online identifiers” and “location data” 64 explicitly in the text of the definition of personal data.65

Article 4(1) of GDPR originally proposed by the Commission stayed resilient and preferred the option with little modifications in a final text, supported also by the Parliament and the Council. Nevertheless, these new additions to the definition do not mean per se that all addressed indirect identifiers could be granted as a personal data.66 The definition of personal data, especially in the part that refers to the identifiability should be read in conjunction with Recital 2667, which presents the idea of the “proportionality test”68 of the efforts to identify a data subject. Next paragraph will discuss in more details the test and the situations in which this test is not passed, leading to anonymous data which is not in the scope of the data protection legislation.69 2.2. Anonymization

The second term that requires further clarification is anonymization. “Anonymize” means to “remove identifying particulars or details from (something, especially medical test results) for statistical or other purposes”.70 Stated more generally, the process of anonymization is related to the transformation of data in such a way that is no longer possible to pinpoint the identity of an individual. The essence of anonymization is to be used as a technique which protects privacy along with security and compliance with the data protection rules. For example, the historical researchers share individuals data with other historical researchers, online platforms sell aggregated customers’ information to advertising companies.71 In all listed scenarios the data has been anonymized in order to preserve the privacy of the concerned individuals and to fulfill the compliance with the rules of the Data Protection legislation. There is a common misconception to label pseudonymization as part of the concept of anonymization.72 However,

63 Article 29 Working Party (2007), Opinion 4/2007 on the concept of personal data, WP 136, Adopted on 20 June 2007. 64 GDPR (n 7), Art. 4 (1). 65 De Hert and Papakonstantinou, (n 39). 66 Christopher Kuner, The European Commission’s Proposed Data Protection Regulation: A Copernican Revolution in European Data Protection Law, in Bloomberg BNA Privacy and Security Law Report, 6 February 2012, pages 1-15 67 DPD (n 4), Recital 26 and GDPR (n 7), Recital 26. 68 De Hert and Papakonstantinou, (n 39). 69 Ibid. 70 Oxford Online Dictionary, Anonymized , accessed 29 January 2016. 71 Ohm (n 29), 1701. 72 WP29 (n 15).

13

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology as outlined below both are two very distinct concepts and their differentiation is of utmost importance for the interests of the data subject.

2.3. Pseudonymization

The dictionary definition presents the concept of pseudonymization as “the replacement of all data (e.g. in a database) that identifies a person with an artificial identifier.”73 In other words, pseudonymization is also a privacy-preserving technique which presents a replacement of an individual’s identifier such as names or an address with a pseudonym. This pseudonym usually is held by a data controller or a third party and makes possible the linkability between the pseudonymized data and the individual. In principle, the link between data and individual can be made only by the person who pseudononymized the data – a person who holds the pseudonymization key. According to the Working Party Opinion 05/2014 pseudonymization “merely reduces the linkability of a dataset with the original identity of a data subject”.74 In contrast, in anonymization, the so-called linkability should not be possible or at least should be very difficult to be achieved. This difference has a crucial role within the European data protection law, according to which, in cases where the individuals are no longer identifiable based on the truly anonymized dataset, this data does not fall within the scope of data protection.75

2.4. Data Protection Directive

2.4.1. Anonymization

The Directive and the new Regulation will be approached in a chronological way in order to understand the legal efforts through which the concepts of anonymization and pseudonymization have undergone. The starting point is the Data Protection Directive which approached the anonymization in Recital 26:76

“Whereas the principles of protection must apply to any information concerning an identified or identifiable person; whereas, to determine whether a person is identifiable, account should be taken of all the means likely reasonably to be used either by the controller or by any other person to identify the said person; whereas the

73 Pseudonymization, wikitionary, , last accessed 10 February 2016. 74 Ibid. 75 Ibid. 76 It should be noted, however, that Recitals in European Court law are not considered to have independent legal value, but they can expand an ambiguous provision’s scope.

14

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

principles of protection shall not apply to data rendered anonymous in such a way that the data subject is no longer identifiable”.77

Recital 26 implicitly describes the notion of anonymization. The Directive proclaims that the data should be transformed in such a way that the individual cannot be identified. The analysis of Recital 26 disclosed two interesting elements which need further explanation given for the better clarification of the concepts of anonymization and pseudonymization. Firstly, the benchmark for the identifiability is incorporated in the wording “likely reasonably” which should be understood as a “probability” of identification and “difficulties” in identification such as: monetary aspects, time-consuming issues required for identification. 78 It can be seen that the legal text is concentrated on the ex ante scenario in which the data must be processed in a way which does not allow the individual to be identified through “all” “likely” and “reasonable” factors. The second part of the Recital 26 describes a scenario in which the data is already anonymized in such a way that the identity of the individual could not be revealed. In other words, this data is no longer considered as personal and does not fall under the scope of the DPD. Therefore, this anonymized data may be processed without taking into account the strict rules of the European Data Protection legislation.79 However, there is no further clarification on how this de-identification could be achieved.

The WP29 discussed in a controversial way the concept of anonymization in its opinion 05/2014 with regards anonymization techniques.80 The opinion addresses Recital 26 of the Directive and more particularly anonymized data in an ambiguous way. On one hand, it states that the current technological developments are strongly challenging the concept of anonymization by making easier the process of the re-identification. On the other hand, the text of the opinion describes multiple times the process of anonymization as irreversible which implies that the aim is a zero risk. 81 On the contrary, all of the envisaged anonymization techniques in the opinion and also all known techniques in general rely on the assumption that a zero risk is not practically impossible.82 This suggests that anonymization

77 DPD (n 4), Recital 26. 78 Lee Bygrave, Data Protection Law: Approaching Its Rationale, Logic and Limit, The Hague: Kluwer Law International, (2002), 44. 79 Esayas (n 24). 80 Khaled El Emam and Cecilia Álvarez, “A Critical Appraisal of the Article 29 Working Party Opinion 05/2014 on Data Anonymization Techniques,” International Data Privacy Law 5, no. 1, 73-87, (2015). 81 Ibid. 82 Cynthia Dwork, Differeruial Privacy, in Automata, Languages And Programming, 33rd Int'l Colloquium Proc. Part 111, 2 (2006), , last accessed 13 February 2016.

15

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology should be approached by a risk-based approach, 83 concept which gained much more recognition in the legal text of the GDPR and would be addressed in Chapter IV. This mainly because of the very nature of anonymization which includes the risk as inherent characteristic.84 For example, the risk factor should be considered in the evaluation of the validity of any anonymization technique - including the potential further uses of data that is “anonymized”.85

Another essential point is when anonymization serves as further processing of data, as discussed in WP29 Opinion.86 Something which is extremely relevant in the context of BD and profiling (behavioral advertising), both concepts will be addressed as part of the technological challenges presented in Chapter 3. This is the reason why the lawfulness of the anonymization process requires clarification in this section. The purpose limitation principle stipulated in the Article 6(1) (b) of the Directive sets the requirement for the data controller that the data should be “collected for specified, explicit and legitimate purposes and not further processed in a way incompatible with those purposes”. 87 This is one of the key principles in applying data protection legislation and it has a crucial role into the selection of the appropriate data protection measures for any processing procedure. The rationale behind this principle is to create a framework concerning the collection of personal data for a given legitimate purpose could be processed (purpose specification) and consequently could be used for further purposes (compatible use).88

Anonymization has been controversially discussed in several WP29 opinions in the light of its particular significance with regards the purpose limitation principle. The process of anonymization goes through three different stages: when the personal data is collected, the application of the anonymization technique to this data (anonymization process by itself), and the final outcome – the anonymized data. WP29 Opinion on anonymization techniques stated that the anonymization process applied to the personal data constitutes a “further processing”.

83 The core of the risk-based approach consists of incorporation by the data controllers of risk analysis and the adoption of risk-measured responses. 84 WP29 (n 15). 85 Ibid. 86 Gabe Maldoff, Top 10 operational impacts of the GDPR: PART 8 – Pseudonymization, (2016), last accessed 10 March 2016. 87 DPD (n 4), Art. 6(1) (b). 88 Article 29 Working Party, Opinion 03/2013 on purpose limitation, WP 203, Adopted on 02 April 2013.

16

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

The issue arises with regards the compatibility89 of the anonymized data. In order to define if the use of anonymization is a compatible with the initial purpose or not, the compatibility test stipulated in the WP29 Opinion on the purpose limitation should be applied.90

According to which, the assessment should take into account

“the relationship between the purposes for which the data have been collected and the purposes of further processing; the context in which the data have been collected and the reasonable expectations of the data subjects as to their further use; the nature of the data and the impact of the further processing on the data subjects; the safeguards applied by the controller to ensure fair processing and to prevent any undue impact on the data subjects”.91

The following statement focuses in more general way in which anonymization could be one of the possible useful safeguards that should be taken into consideration when assess the compatibility of the further processing. 92 With respect to the compatibility of the anonymization process by itself the manner in which has been described remain unclear from this too general description depicted in this Opinion. The most logical conclusion is that for each of the different stages of the anonymization cycle we should apply the compatibility test described above. If the applicability test show that the subsequent purpose for which the data has been used is incompatible with the initial one, then this further processing requires one of the legal grounds set forth in article 7 of the Directive.93

WP29 Opinion with regards purpose limitation principle proposed a case by case approach in order to determine the compatibility of the further processing purpose. The proposed appropriate measures include: full anonymization, partial anonymization, pseudonymization, and encryption. The choice among this techniques highly depends on the particular processing operations and purposes.94

The main cause for the ambiguity related to the compatibility of anonymization can be found in one of the core qualities of anonymization, namely that effectively anonymized falls outside of the scope of the data protection rules. Therefore, anonymized data could be

89 The concept of compatibility: the use of data for compatible purposes is allowed on the ground of the initial legal basis. What ‘compatible’ means, however, is not defined and is left open to interpretation on a case-by- case basis. 90 WP29 (n 86). 91 Ibid. 92 El Emam and Alvarez (n 78). 93 WP29 (n 86). 94 WP29 (n 86).

17

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology processed for different than the initial collection purpose per se. For example, Khaled El Emam and Cecilia Alvarez in their article which criticizes WP29 Opinion on Anonymization Techniques concluded that “anonymization should be deemed ‘compatible’ by its own nature”.95 Disregarding the compatibility test requirements listed above and putting the topic in a particular context that serves to support their own views.

2.4.2. Pseudonymization

The next concept that it will be examined through the prism of the DPD is pseudonymization, although it is not explicitly mentioned in the text of the Directive. However, the definition of pseudonymization can be found in the national legislation of some of the Member States as well as in the above mentioned guidelines.96 For instance, the German Data Protection Act sets forth the concept of “aliasing” translated as pseudonymization or key-coding which “means replacing a person’s name and other identifying characteristics with a label, in order to preclude identification of the data subject or to render such identification substantially difficult”. 97 The interpretation of this concept under the WP29’s opinions has been significantly amended several times. It varies from a technique which allows in most of the cases a processing of indirectly identifiable data at low level of risk exposures for the individuals at stake98 to a strategy that “merely reduces the linkability of a dataset with the original identity of a data subject”. 99 The middle ground of these interpretations is that pseudonymization is not a sub-category of anonymization and as such it is still deemed to be personal data, protected by the DPD.

However, there are some national guidelines like the one created by UK Informational Commissioner’s Office (ICO) that interpreted the way how pseudonymization works a bit differently. ICO sets forth in its guideline limits to this technique and concluded that a proper pseudonymization could eventually lead to adequate anonymization.100 Still, the risk of re- identification in this case remains higher than the use of anonymization technique. 101 The definition given by the International Organization of Standardization (ISO) makes the interpretation of this important concept even more complex. Accordingly, pseudonymization

95 El Emam and Alvarez (n 78). 96 ICO (n 12). 97 Bundesdatenschutzgesetz (Federal Data Protection Act, Dec. 20, 1990, BGBl. I at 2954, as amended), Art. 3 (6) (a). 98 Article 29 Working Party, Opinion 4/2007 on the concept of personal data, WP 136, 20 June 2007. 99 WP29 (n 15). 100 ICO (n. 12), 21, ICO discussed this possibility in negative way: “this does not mean though, that effective anonymisation through pseudonymisation becomes impossible.” 101 Ibid.

18

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology is a “particular type of anonymization that both re-moves the association with a data subject and adds an association between a particular set of characteristics relating to the data subject and one or more pseudonyms.”102

2.5. GDPR

The divergent interpretation and given meaning of anonymization and pseudonymization by the different MS has a significant negative effect on the practical utilization of these privacy preserving techniques. According to the European Commission’s study of the implementation of the DPD among the EU countries, anonymization and pseudonymization concepts constitute a “major area of divergent interpretation”.103 Something that is meant to change after the GDPR enters into force. This is because, as a Regulation, GDPR is a directly applicable in all MS, according to article 288 of Treaty on the Functioning of the European Union.104 2.5.1. Anonymization

The GDPR set forth the definition of anonymization in Recital 26, clearly proclaiming that “the principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable”.105 The wording remains almost the same as the one given by the Directive, whereas the difference is the clarification of what should constitute all “the means reasonably likely to be used”. The wording of this particular part of the Recital follows the opinion provided by the Working Party.106 Even including an example such as singling out in the text of the Recital 26.107 Similarly to the DPD, the “proportionality test”108 sets the requirements for the identifiability as “the means reasonably likely to be used” bear in mind “all objective factors” such as technology, time, effort and cost towards the assessment in a case by case analysis which information may leads to the identifiable persons.109

102 International Organization for Standardization, ISO/TS 25237:2008(E): Health informatics: pseudonymization, International Organization for Standardization Geneva, (2009). 103 European Council, Annex 2, Evaluation of the Implementation of the Data Protection Directive 244, (2012). 104 TFEU (n 30). 105 GDPR (n 7), Recital 26 106 WP29 (n 15). 107 GDPR (n 7) Recital 26. 108 De Hert and Papakonstantinou, (n 39). 109 Ibid.

19

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

With regards the further re-use of data GDPR presents the legitimate test in Recital 50.110 The criteria for the test are almost identical to the once stated by the WP29 for the compatibility test. Both tests incorporate an evaluation of data-processing procedures in the relevant context (taken into consideration the interests that are at stake), the safeguards taken to prevent from data protection breaches and the proportionality of the processing. In order to be compliant the most reliable and legally certain method of compatibility assessment for the data controllers is to use Data Protection Impact Assessment (DPIA)111 which could evaluate the relevant contextual implications of the use and re-use of data.112 2.5.2. Pseudonymization

One of the novelties presented in the GDPR is the explicit inclusion of the concept of pseudonymization in the text of the Regulation. The definition was not presented by the initial Commission draft, but added by the Parliament and confirmed by the Council. In the final text, the definition of pseudonymization is set forth in article 4 (5) of the Regulation, and reads “the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information”.113 It should be noted that there are additional conditions which should be fulfilled in order to decrease the risks regarding to processing of personal data while keeping the same data as a valuable utility. These requirements oblige that the additional data should be “kept separately and is subject to technical and organizational measures to ensure that the personal data are not attributed to an identified or identifiable natural person”.114 Therefore, pseudonymization is a privacy-enhancing technique which provides a non-attribution by keeping separately and securely the directly identifying data from the processed information.115

Recital 28 described one of the features of pseudonymization which is the aim of reducing the risk of re-identification. Nevertheless, the same Recital does not exclude the pseudonymized data from the scope of the GDPR and stands clearly that this data still remains personal, “[d]ata which has undergone pseudonymization, which could be attributed

110 GDPR (n 7) Recital 50, “[..] any link between those purposes and the purposes of the intended further processing; the context in which the personal data have been collected, in particular the reasonable expectations of data subjects based on their relationship with the controller as to their further use; the nature of the personal data; the consequences of the intended further processing for data subjects; and the existence of appropriate safeguards in both the original and intended further processing operations.” 111 Lokke Moerel and Corien Prins, Privacy for the homo digitalis, Proposal for a new regulatory framework for data protection in the light of Big Data and the Internet of Things, (2016). 112 GDPR (n 7), Art. 4 (5). 113 Ibid. 114 Ibid, Article 4 (3) b. 115 Maldoff (n 86).

20

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology to a natural person by the use of additional information, should be considered as information on an identifiable natural person”. Therefore, this type of privacy enhancing technology is “not intended to preclude any other measures of data protection”.116

Recital 28 proclaimed that: “Personal data which has undergone pseudonymization, which could be attributed to a natural person by the use of additional information, should be considered as information on an identifiable natural person” Notwithstanding, because they are recognized information on an identifiable data subject, the “proportionality test” is applicable just on the “separate additional information” ought to be applied because there is a probability that the particular pseudonymized data are proceed as a non-personal escaping out from the scope of the data protection rules. 117 Many legal scholars proposed the creation of different sub-categories of data, instead of the current beyond the “all or nothing” approach (personal and non-personal data).118 This standing point is built on the clear differentiation between the different categories of identifiability, based on combined legal and technical knowledge. For instance, Solove and Schwartz suggested three categories of data: identified, identifiable, or non-identifiable. 119 Consequently, creating a new sub-category of personal/non personal data could be a good option that creates certain level of flexibility but also raises a lot of concerns, and in general requires a careful formulation and reflection which further should be more concretely specified. 120

In one of his articles, Koops defined two possible scenarios in which the pseudonymized data could serve as a “useful in-between category”. 121 The first possibility covers the above specified case, in which the pseudonymized data could achieve a sufficient level of protection which could be assessed by passing the “proportionality test” (there are no reasonable expectation that this “additional information” would be used to lead to match or identification); then this could be a substantial subcategory of non-personal data. As a result, there is a self-contradiction when trying to regulate something which preassembly should be out of the scope of the rules due to the fact that is not any longer a personal data. Secondly, if the risk of linkability remains high even though the additional information is subject to prescribed organizational and technical measures, then the pseudonymized data constitutes

116 GDPR (n 7), Recital 28 117 De Hert and Papakonstantinou, (n 39). 118 Esayas (n 24). 119 Paul Schwartz & Dan Solove, The PII Problem: Privacy and a New Concept of Personally Identifiable Information, 86 NYU L. Rev. 1814, (2011); 120 Koops (n 5). 121 Ibid, 250–61.

21

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology personal data. In this scenario pseudonymized data could be qualified as a meaningful sub- category of personal data, which according to the Regulation is a subject of certain safe harbors (they will be described in more details in the fourth chapter).122

2.6. Anonymization and Pseudonymization Techniques

Now that the legal definitions have been given a short overview of the technical aspects behind anonymization and pseudonymization will be presented. As identified by Working Party, anonymization techniques have been divided into two sub-categories, namely generalization and randomization.123 “Generalization” refers to a method which aim is to modify the data or generalize it in a way that it is no longer possible to single out the concerned individual.124 For example, changing the birth date to year or month of birth.125 The second category of techniques - “randomization” aim to reduce the risk of identification of the data subject by adding elements to the data. It consists of different techniques such as: noise addition, differential privacy, and swapping.126 Pseudonymization techniques have been categorized as a process in which one attribute (typically unique attribute) has been replaced in a dataset by another.127 There are different pseudonymization techniques with regards what kind of pseudonym has been used. For example, the initial identifier can be replaced by a random pseudonym which is different from the initial value (e.g. a random surname picked by the data subject) or it can be derived from the original values of an attribute or set of attributes (e.g. a hash function or encryption scheme).128 As an outcome the data subject is still potentially indirectly identifiable. From practical point of view, in many situations several of these techniques could be applied in order to achieve better overall privacy and data protection. WP29 also suggests a case by case assessment for the data controller, in order to determine the most appropriate techniques.129

The opinion analyses the main rationale behind the different techniques by assessing their robustness (strengths and weaknesses), as well as the common shortcomings in the use of each technique. The assessment of these techniques is based on three criteria of risks, namely

122 Koops (n 5). 123 It should be noted that it is possible to choose different classifications for the techniques. Christopher Millard, Cloud computing law, Oxford University Press, New York, NY, (2013). 124 WP29 (n 15). 125 Ibid. 126 Ibid. 127 Ibid. 128 Ibid. 129 Ibid.

22

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology singling-out, linkability, and interference.130 These criteria introduced by the WP29 will be addressed in more details in Chapter 4.

2.7. Conclusion

This chapter has defined the key concepts that will be further used in the analysis of this paper. Firstly, the definitions of personal data, anonymization, and pseudonymization were emphasized as an important starting point of identifying the challenges that will be presented in the next chapter. Secondly, the compatibility of anonymization as a further processing of personal data was also explained. Furthermore, anonymization and pseudonymization techniques were briefly presented in order to give the chapter a complete overview of the presented chapter. The next chapter will identify the challenges by the emergence of anonymization and pseudonymization in the light of the risk of re-identification and new technologies.

130 Ibid.

23

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

Chapter 3 – Identifying the threats

This chapter delves further in the challenges that surround the concepts already explained in the previous chapter. Over the past years, the debate over “utility versus privacy” has been a subject which brings consideration and controversy. Leading computer scientists define the faith in anonymization as a “broken promise of privacy”131. The main argument given for this statement is the impossibility of the anonymization to preserve the personal data of concerned individuals (privacy) and at the same time to maintain the data in a useful format for further use (utility).132 The biggest threat to and the opposite concept of anonymization, namely “re- identification”, will be approached by different perspectives, in order to present it in a more coherent manner. In this light, several landmark cases which revealed the tangible weaknesses of the anonymization and its role to protect the personal data of the individuals will be presented. The practical value that stands behind anonymization and pseudonymization will be evaluated in the light of the new digital era’s challenges such as BD and profiling.

3.1. Re-identification

The concepts of anonymization and pseudonymization face obstacles in the face of the ubiquitous development of sophisticated technologies and the huge amounts of available data. Due to these factors the opposite process of anonymization, namely re-identification, becomes an “increasingly common and present threat”.133 WP29 recognizes the risk of re- identification as a serious threat and evaluates its impact on the different anonymization techniques.134 The subject that performs the re-identification is known as adversary. The adversary is aiming at linking the already anonymized data to outside/auxiliary data (e.g. publicly available information), and has as final goal to learn the identity of the data subjects.135 Computer scientists do not label the adversary as a bad or good,136 they just define it as a subject which aim is something that the data controller does not want to take place. 137 Likely adversaries includes investigators, stalkers, nosy colleagues, employers,

131 Ohm (n 29). 132 Ibid, 1707–11 133 Article 29 Data Protection Working Party, Opinion 03/2013 on open data and public sector information ('PSI') reuse, Adopted on 05 of June 2013 134 WP29 (n 15); See more: Section 2.6. 135 Ohm (n 29). 136 Ibid; See Irit Dinur & Kobbi Nissim, Revealing Information While Preserving Privacy, in Proc. 22nd Acm Symp. On Principles Database Sys. 202-203 (2003), < http://portal.acm.org/citation.cfm?id=773173.>, last accessed 07 March 2016. 137 Ibid, 4.

24

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology neighbors, or a data-broker employing re-identification based on their existing datasets to enrich their records.138

This reversing process threatens the very distinction on which data protection is based upon, the one between personal and non-personal data. This is mainly because, for the data controller which is processing the already anonymized data, the fact that a third party could be in a position to single out to certain individuals from the anonymized data set could be unknown.139 This could lead to a false or misleading faith in the ‘anonymized’ data. Re- identification studies consider the so-called ‘risk of re-identification’ to which data protection legislation should try to answer appropriately in order to withstand its own principles.

In the next section three re-identification studies will be presented, all of them playing an important role into the controversial debate which is surrounding re-identification. Further, to clarify the concept of re-identification, the different views on the level of risk that it poses to individuals will be discussed. There are two prominent position with regards to the risk of re- identification. Firstly, the critics of anonymization proclaim the potential threat of re- identification which undermines the legal significance given to the concept of anonymization. Secondly, the defenders of anonymization argue that the high risk of re-identification is exaggerated to some extent and that the most of the established techniques will suffice to keep the data anonymized. 140 The middle ground between both hypotheses will be also discussed in order to derive the positives and negatives that surrounds the whole debate.

3.2. The landmark re-identification studies

There are three landmark cases which support the concept of easy re-identification.141 All of them occurred in United States: however, they are also relevant in terms of the European Data Protection legislation. Mostly, because they refer to drawbacks of certain anonymization techniques, which goes beyond the limits of the applicable law. Moreover, WP29 refers to them when discussing the present re-identification threat142 and the different anonymization techniques in order to express its concerns about the re-identification. 143 More recent studies,

138 Arvind Narayanan and Edward W. Felten, No silver bullet: De-identification still doesn't work, (2014), , last accessed 10 March 2016. 139 WP29 (n 133). 140 Jane Yakowitz, Tragedy of the Data Commons, 25 Harv. J.L. & Tech. 4, (2011). 141 Ohm (n 29). 142 WP29 (n 133). 143 WP29 (n 15).

25

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology which give an account of the sophistication of the de-anonymization science will also be discussed.

3.2.1. Massachusetts Medical Database

A famous and widely cited study conducted by Latanya Sweeney indicates that 87.1% of the citizens of the United States could be uniquely identified by gender, date of birth, and ZIP code alone. 144 Sweeney was able to achieve these results by using 1990 census data. 145 Moreover, Dr. Sweeney’s research reveals that even by using different combinations of less distinctive data in terms of identifiability could lead to high percentage of singling out specific individuals. For example, by knowing a particular citizen’s birth date, gender and city, 53% of all Americans could be identified.

Furthermore, by using these identifiers derived from the publicly available data, Dr. Sweeney was able to re-identify a publicly released “anonymized” data related to specific hospital visits.146 The same data was announced as protected by removing patients’ identifiers such as name, social security number, and address. 147 However, the company that applied the privacy-preserving technique and made available the public release did not remove the “deadly” combination, namely, gender, date of birth, and ZIP code.148

3.2.2. AOL

The second case which reveals the weaknesses into the process of pseudonymization and at the same time the very real risk presented by the re-identification, is related with American On Line (AOL).149 It took place in 2006, after the announcement of the new product by the company named “AOL Research”. This product was supposed to reveal publicly a huge amount of AOL”s search queries generated from the AOL’s website while maintaining a high

144 Latanya Sweeney, Uniqueness of Simple Demographics in the U.S. Population, Laboratory for Int'l Data Privacy, Working Paper LIDAP-WP4, (2000). 145 According to another study the above cited percentage is lower – 61% using the 1990 census data and 63% using the 2003 census data. Philippe Golle, Revisiting the Uniqueness of Simple Demographics in the US Population, Palo Alto Research Center, (2006), , last accessed 10 March 2016. 146 Sweeney (n 144). 147 Ibid. 148 Ohm (n 29) cited in Latanya Sweeney statement in Recommendations to Identify and Combat Privacy Problems in the Commonwealth: Hearing on H.R. 351 Before the H. Select Comm. on Information Security, 189th Sess. (2005), < http://dataprivacylab.org/dataprivacy/talks/Flick-05-10.html>, last accessed 10 March 2016. 149 Michael Barbaro & Tom Zeller, Jr., A Face Is Exposed for AOL Searcher No. 4417749, N.Y. TIMES, Aug. 9, 2006.

26

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology quality data protection by trying to anonymize the data prior to the public release.150 WP29 is giving an example for a common misconception between anonymization and pseudonymization by referring to this case. 151 In other words, AOL claimed that they would have used an anonymization technique, while in fact the technique used lead merely to pseudonymization. AOL tried to preserve the identity of the concerned individuals hidden behind the search queries by replacing certain identifiers such as users’ name and IP address with a number and by doing this the privacy-preserving feature of this product was a complete failure.

The New York Times reporters Michael Barbaro and Tom Zeller showed that the data pseudonymized by AOL could be linked to the users’ direct identifiers even without employing sophisticated re-identification techniques.152 For example, the query search “dog that urinates on everything” was linked to the user with a number 4417749, which stands behind the real identity of a sixty-two years old woman named Thelma Arnold.153 The two reporters were able to reveal the user identity by the leftover details with specific uniqueness in the publicly released data. More particular, the uniqueness of the users search queries was the key element which enables the re-identification of the data that was pseudonymized.154

3.2.3. Netflix

The third well-known re-identification case is related to the Netflix prize contest, and happened in 2006.155 The movie rental company Netflix organized a competition whose aim was to improve its movie recommendation services. The participants in this contest were allowed to use the Netflix database with one million ratings of thousands movies by almost half million users.156 Similarly to the AOL case, the database was “anonymized” by changing all of the direct identifiers such as users’ names with indirect identifiers, as well as “deliberately perturbing” the dataset by eliminating ratings, adding “noise”157 to ratings and dates, and changing the dates in which the ratings were given. 158

150 Ibid. 151 WP29 (n 15) 9. 152 Barbaro and Zeller (n 149). 153 Ibid 154 Ohm (n 29), 1723. 155 The Netflix Prize Rules, Netflix Prize, last accessed 14 March 2016. 156 Ibid. 157 Noise in the terms that the ratings were expressly increased or decreased slightly in order to be more difficult for re-identification. 158 Arvind Narayanan & Vitaly Shmatikov, Robust De-Anonymization of Large Datasets, 29 Proc. Ieee Symposium on Security & Privacy 111, 111–12 (2008).

27

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

The promised data protection through anonymization of the personal data to the users again was a failure. The people that revealed the shortcomings into the anonymized data were Narayanan and Shmatikov.159 They took part into the Netflix prize contest, however their aim wasn’t to improve the company’s prediction algorithm related to the user ratings, but rather to test the re-identifiability of the publicly released data.160 The outcome of their findings was that the publicly released data, albeit obfuscated, could be used to easily re-identify the initial data set if the potential adversary possessed additional information related to some of the users’ movie rentals and preferences.161 For example, it could be enough to relatively know someone’s movie preferences: this information could be gathered during a friendly conversation, to find the rest of the concerned individual’s movie watching history. 162 Another argument that they advanced is that additional information could be found in many other publicly available sources. For their research study, they used the publicly available information presented by the Internet Movie Database (IMDb) which has a similar user movie rating system.163 By using the above mentioned re-identifying techniques, Narayanan and Shmatikov were able to find that “user 1337 had rated Gattaca a 4 on March 3, 2003, and Minority Report on November 10, 2003”.164

These three famous case studies demonstrated the existing risk of re-identification. The “pockets of surprising uniqueness”, which represent specific elements of remaining data were enough to contribute to the identifying the users, even though the publicly release data was already anonymized/pseudonymized by the data controllers.165 These uniqueness in certain data elements is compared to the human fingerprints left at a crime scene.166 The outcomes imply that perhaps everything could be personal data to one who has access to the right auxiliary information. 167 Therefore, these three cases put a serious doubt over the anonymization techniques, even though sophisticated re-identifying techniques were not used in any of them. Following these critiques there is a question mark about the balance between the preserving the personal data of concerned data subjects and at the any time to publicly release it in a useful format.

159 Ibid. 160 Felix T. Wu, Defining Privacy And Utility In Data Sets, University Of Colorado Law Review vol 84, (2013), 1119 161 Narayanan and Shmatikov (n 158). 162 Ibid 122 163 Ibid 122-123 164 Ibid 123 165 Ohm (n 29). 166 Ibid. 167 Ibid.

28

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

3.3. Utility versus Privacy

The rationale behind the anonymization technique is that the use of it will preserve the privacy of the concerned individuals by removing the “personally identifiable information” while the already anonymized data could still be a useful utility. 168 European Data Protection legislation is attributably giving important role to anonymization as demonstrated in the previous chapter. This role has been challenged by many computer scientists, epidemiologists, and statisticians. For the purpose of this paper, the divergent opinions that are surrounding the debate between privacy and utility will be separated in more general way into three sub- groups. Namely, the pragmatists who are proclaiming the impossibility of co-existence between both terms, the formalists which support the opposite view, and finally the third group is represented by the ones with a middle ground perspective.

The pragmatists describe the connection between utility and privacy as “two concepts at war”.169 In other words, anonymization should not be recognized as a “panacea” which could bring the desired balance between concepts such as security, novelties, and the free flow of data.170 Another commonly shared claim is that utility and privacy are directly interrelated that “as the utility of data increases even a little, the privacy plummets”.171 The ideal scenario to have a perfect protection of the personal data requires that there should not be a public release of the data at all. Additionally, in order to have a perfect utility of the data, means that the information should be published in its initial format without any limitations.172 One of the main factors for the failure of anonymization is that there is an enormous amount of publicly available data, from which the adversaries could draw.173 Furthermore, Charu Aggarwal has highlighted that individual transactions and preference records include unique record called “sparse” in their multi-dimensional space 174 which increases the probability that re- identification succeeds, decreases the amount of publicly available information needed, and

168 Ohm (n 29), 1701. 169 Shuchi Chawla et al., Toward Privacy in Public Databases, in 2 Theory Cryptography Conf. 363 (2005). 170 Ohm (n 29), 1736. 171 Ohm (n 29), 1751. 172 Ibid, 1752. 173 Cynthia Dwork, Differential Privacy, in 33rd International Colloquium on Automata, Languages And Programming Part Ii, at 1, 2 (2006), last accessed 02 March 2016; cited in Ira S. Rubinstein And Woodrow Hartzog, Anonymization And Risk, New York University School Of Law Public Law & Legal Theory Research Paper Series Working Paper No. 15-36, (2015), 712 174 “high dimensional data sets” - each record contains many attributes (i.e., columns in a database schema), which can be viewed as dimensions; “sparse” data set is one in which each individual record contains values only for a small fraction of attributes. sparse data sets include not only recommendation systems but also any real-world data sets of individual transactions or preferences. See, Narayanan and Shmatikov (n 158).

29

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology improves the accuracy of the re-identification attacks.175 In such cases, almost any variable is likely to constitute a clear quasi-identifier.176 This view has been supported in the Brickel- Shmatikov paper with regards to the Netflix prize contests in which they “demonstrate that even modest privacy gains require almost complete destruction of the data-mining utility”.177 This strong argument against anonymization is referenced as “the auxiliary information problem”.178

On the other side of the debate are the supporters of the idea that anonymization is not yet dead.179 Firstly, the utmost importance of the public release of the data had been neglected by the critics of the anonymization.180 Yakowitz proclaimed this vital role of the public release data and moreover she warns that any further restricting measures such as ending or limiting or the amount of the shared information could lead to the so-called “a new tragedy of the data commons”.181 Yakowitz descried data commons as the “diffuse collections of data made broadly available to researchers with only minimal barrier to entry.”182 Secondly, authors such as Yakowitz and Daniel-Barth-Jones claimed that the potential threats of re- identification have primarily theoretical character and have been overstated. Moreover, they argued that the above mentioned famous cases (Massachusetts database, AOL, and Netflix) should not be a decisive factor in the ongoing debate due to their unofficial representative character and over exaggeration by the media. 183 Therefore, authors such as Yakovitz

175 Charu C. Aggarwal, On k-Anonymity and the Curse of Dimensionality, in Proceedings Of The 31st International Conference On Very Large Data Bases, 901, 909 (2005), http://www.vldb2005.org/program/paper/fri/p901-aggarwal.pdf [https://perma.cc/QZ9E-HQDV] 176 See more: Narayanan and Shmatikov (n 158). 177 Justin Brickell & Vitaly Shmatikov, The Cost of Privacy: Destruction of Data-Mining Utility in Anonymized Data Publishing, 14 Proc. Acm Sigkdd Int’l Conf. On Knowledge Discovery & Data Mining 70, 70 (2008). Cited in Wu (n 160). 178 Ira S. Rubinstein and Woodrow Hartzog, Anonymization And Risk, New York University School Of Law Public Law & Legal Theory Research Paper Series Working Paper No. 15-36, (2015), 713 179 For objections to the ‘death of anonymization’ narrative, see, for example, Jane Yakowitz Bambauer, Is De- Identification Dead Again?, Info/L. Blog (Apr. 28, 2015), last accessed 07 March 2016. 180 Jane Yakowitz, Tragedy of the Data Commons, 25 Harv. J.L. & Tech. 1, 2–3 (2011), ; last accessed 05 March 2016. 181 Ibid. 182 Ibid. 183 Jane Yakowitz Bambauer, Is De-Identification Dead Again?, Info/L. Blog (Apr. 28, 2015), last accessed 10 March 2016. ; Daniel C. Barth-Jones, Press and Reporting Considerations for Recent Re-Identification Demonstration Attacks: Part 2 (Re-Identification Symposium), Bill Health Harv. L. Blog (Oct. 1, 2013), , last accessed 10 March 2016.

30

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology proposed solutions which facilitate the dissemination of anonymized data in an easier course of action.184

The third perspective balances between the previous two extreme stands and seeks to reap the positives from both positions. The cited case studies have been approached in a different way, not as examples for the failure of anonymization, but more as something from which the data controllers could learn and upgrade their current level of data protection.185 For example, Overstock.com organized a similar contest to Netflix’s improving product recommendation contest, but taking into account the anonymization failure. The company was able to limit the available anonymized prize data to the contestants and by doing this it minimized the risk of possible re-identification.186 Moreover, both sides have “misinterpreted, or at least” used the relevant computer science sources in order to fulfil their own views. 187 The way both disputing parties discussed the “failure” or “success” of anonymization in mostly hypothetical and contextual way could not be used as a key policymaking yardstick. Overall, anonymization covers a range of technical measures which are useful for certain purposes, but not others.188 However, what is important is how well those purposes fit to the legislative and policy objectives accompanying the public concerns that should be achieved. 189 Something which is more a social choice, than pure mathematics.190 The benefits that could be gained from this dispute can encompass a sound framework about this controversial subject. For example, one of these benefits is the defining the associated risks with regards re-identification. The link between the legislative interpretation, the mathematical theory, and the practical meaning of the concepts such as anonymization could be built upon by focusing on the potential threats. The digital era has opened up new technological challenges which require a stronger notion of privacy than ever before. Data Protection Legislation which is able to deal with stronger, more sophisticated, more knowledgeable attackers. This is where the computer scientists’ theories can be vital.

184 For example, Yakowitz’s proposal imposes two conditions on a data controller: ‘(1) strip all direct identifiers, and (2) either check for minimum subgroup sizes on a preset list of common indirect identifiers—such as race, sex, geographic indicators, and other indirect identifiers commonly found in public records—or use an effective random sampling frame.’ Ibid 185 Rubinstein and Hartzog (n178), 723 186 See Steve Lohr, The Privacy Challenge in Online Prize Contests, N.Y. Times (May 21, 2011), http://bits.blogs.nytimes.com/2011/05/21/the-privacy-challenge-in-online-prize- contests/[https://perma.cc/RHS9-ZX29] last accessed 10 March 2016. 187 Wu (n 160), 1124. 188 Ibid. 189 Ibid. 190 Ibid.

31

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

3.4. New Challenges

More recent research studies demonstrate that the risk of re-identification is an increasingly common threat and should be approached even with bigger consideration. 191 The unprecedented amount of publicly available online and offline information leads to strong criticism towards the achievement of effective anonymization. For example, the revealing of celebrities’ photos at the New York City taxicabs eventually resulted in a surprising failure of the “promised” anonymization.192 The logic used in the research studies was the same one used in the previous examples.193 The comparison of the publicly released anonymized data (173 million NYC taxi trips)194 with an insignificant amount of auxiliary data led to the conclusions such as that Brad Cooper took a taxi cab to Greenwich Village, then he visited a restaurant called Melibea, where he had a dinner and paid $10.50, with no recorded tip.195 Moreover, the increasing array of wide-spread sensors and different types of mobile devices has facilitated widespread end-to-end global connectivity.196 All those factors impact the vast accumulation of overall data processing and at the same time fuel the development of sophisticated tools such as BD, which the businesses are managing for purposes including profiling of data subjects for behavioral advertising.197 These technological innovations have had an essential impact on our daily lives, from remodeling society to re-shaping the extent concepts of identity.198 But what exactly is BD and how effects anonymization?

3.4.1. Big Data

BD is a term that used to describe “a massive volume of both structured and unstructured data that is so large that it’s difficult to process using traditional database and software techniques”.199 The initial definition of BD was generally related with the “three Vs” (volume, variety, and velocity). During the recent years the concept of BD has broaden its scope – mostly because of the private companies” marketing policies – to add “seven Vs” (the

191 Jules Polonetsky, Omer Tene and Kelsey Finch, Shades of Gray: Seeing The Full Spectrum Of Practical Data De-Identification, in Santa Clara Law Review, (2016). 192 Anthony Tockar, Riding with the Stars: Passenger Privacy in the NYC Taxicab Data Set, September 15, 2014, last accessed 10 March 2016. 193 Ibid. 194 NYC Taxi Trips, last accessed 10 March 2016. 195 Tockar (n 192). 196 Gloria González Fuster and Amandine Scherrer, Big data and smart devices and their impact on privacy, Study for LIBE Committee, (2015), last accessed 10 March 2016. 197 Ibid. 198 Ibid. 199 last accessed 10 March 2016.

32

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology primary three, plus viscosity, variability, volatility and veracity).200 BD builds upon data analytics which can process vast amount of data, including unforeseen information, which can eventually generate unpredicted outcomes that could be potentially useful.201 The concept is defined by two intrinsic characteristics: on one hand, powerful knowledge discovered through using large quantities of information by fulfilling the criteria of the “three Vs”, particularly: large size (volume), generated in near present-time (velocity), and diverse (variety), and, on the other hand, the use of advanced data processing techniques making possible the detection of previously unknown patterns. 202 These patterns can be used to identify general trends or recognize anomalies; based on past and present data, such patterns might possess a predictive quality within the meaning that their goal can be to forecast what may happen in future. 203 Taking into account these two intrinsic characteristics together explains BD’s main rationale: the larger the amount of available data to be processed, irrespective of its apparent value or interest, the greater the chance that unpredicted, and potentially useful, information can be derived.204 The supporters of BD stress that this new technique is very promising area which can bring an economic growth by referring to it as it “has been dubbed the oil of the digital economy”.205

A survey reveals that 57% of businesses which took part in it consider themselves to be “managing BD”, within the meaning of “very large datasets” including “streaming data from machines, sensors, web applications and social media”. 206 The so-called “datafication” process, includes an estimated 2.3 trillion gigabytes of data which have been stored and combined with other information on a daily basis. These great numbers reveal the trend that

200 Pierre Delort, Le Big Data, Que Sais-Je ?, Paris: Presses Universitaires de France, (2015); cited in Ibid. 201 Viscosity – refers to the amount of resistance that affects the flow of data; variability – refers to the rate of change of the flow of data; volatility – refers to how long the data should be stored for, and how long the data components will be valid for use; veracity – refers to any noise and/or bias that is part of the data. See more: John Girard, Deanna Klein, and Kristi Berg, Strategic Data-Based Wisdom In the Big Data Era, IGI Global Book Series AKATM, (2015), 3. 202 Gloria González Fuster and Amandine Scherrer, Big Data and smart devices and their impact on privacy, Study for LIBE Committee, 2015: last accessed 14 May 2016. 203 Ibid. 204 Ibid. 205 Alex Pentland, Society's nervous system: building effective government, energy, and public health systems. (2011). 206 Philip Russom, ‘TDWI Best Practices Report: Managing Big Data’, Fourth Quarter 2013. Cited in Preliminary Opinion of the European Data Protection Supervisor Privacy and competitiveness in the age of Big Data: The interplay between data protection, competition law and consumer protection in the Digital Economy, (2014), last accessed 10 May 2016.

33

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology the data themselves grow into the ground for separate branch of services like different fitness trackers and global mapping.207

Moreover, BD extends far beyond personal data by including in its analytic process aggregated and anonymous data.208 This shifts the whole paradigm of data protection law as for example, the data controllers such as private companies may consider a certain data to be non-personal due to anonymization, while nowadays it is rare for data generated by user activity209 to be completely and irreversibly anonymized.210

That is exactly where the danger of the anonymized data lies. The threat of sophisticated types of re-identification enhanced by BD distorts the current anonymization model by allowing the correlation of huge amounts of non-personal data that linked together becomes personal. Therefore, this threat undermines the current differentiation between personal and non-personal data. 211 According to Solove BD is also threatening the individuals by the accumulation of data related to a certain person, an issue related to the aggregation.212 This dignitary harm is directly related to the use of sophisticated analytic powers, the continuous comparison by numerous sources, and the big scale which BD incorporates.213 Thus all these features make aggregation more revealing, more interfering, and more granular. 214 Ohm depicts this privacy-harmful correlation between the aggregation and re-identification as the

207 Source for estimated daily generation of data: IBM. See Mayer-Schönberger, V., and Cukier, K., Big Data, A Revolution That Will Transform How We Live, Work and Think, Eamon Dolan/Houghton Mifflin Harcourt, Reprint edition (2013), 94 – 97. 208 Preliminary Opinion of the European Data Protection Supervisor Privacy and competitiveness in the age of Big Data: The interplay between data protection, competition law and consumer protection in the Digital Economy March 2014, last accessed 21 May 2016. 209 Masses of personal information are generated by over 369m internet users in the EU through their consumption of social media, games, search engines and e-commerce and other services. Information on subscribers to a given online service which is collected includes names, gender, personal preferences, location, email addresses, IP addresses and surfing history. Ibid 210 WP29 (n 211 Ohm (n 29), 1751. 212 Daniel J. Solove, Access and Aggregation: Public Records, Privacy and the Constitution, 86 Minn. L. Rev. 1137, 1185 (2002), (‘The aggregation problem arises from the fact that the digital revolution has enabled information to be easily amassed and combined.’). 213 See Daniel J. Solove, A Taxonomy of Privacy, 154 Penn. Law Review 477, (2006), 506, (noting that aggregation “can cause dignitary harms because of how it unsettles expectations. … Aggregation upsets these expectations, because it involves the combination of data in new, potentially unanticipated ways to reveal facts about a person that are not readily known”). 214 Ira. S. Rubenstein, Big Data: The End of Privacy or a New Beginning?, International Data Privacy Law Advance Access, (2013), last accessed 10 March 2016.

34

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

“database of ruins”,215 mainly because this interplay leads to adding and gathering of more information to an individual’s profile in the hands of the data controllers. As it has already been shown in the above mentioned cases that the re-identification is possible by combining the anonymized data sets with related auxiliary information, now this threat heightens by the involvement of the BD.

The impact of BD on re-identification has been demonstrated by Yves-Alexandre de Montjoye as part of the Massachusetts Institute of Technology’s (MIT) BD initiative, which examined the re-identification of mobile phone and credit card metadata. 216 More particularly, De Montjoye argued that the individuals cannot “hide in the crowd” anymore in terms of privacy and that the promise of the anonymization to “hide” them is broken.217 Also, the already mentioned “auxiliary problem” has been discussed in the terms of BD in this study. This project presents the concept of “unicity” whose aim is to quantify “how much outside information one would need, on average, to re-identify a specific and known user in a simply anonymized data set”.218 As could be expected, the more unicity could be conferred to a certain data set, the higher the level of re-identification would be.219 The study reveals the high level of uniqueness of the mobile phone metadata. Emphasized are that 95% of this type of data has been identifiable by the help of just four random 4 spatiotemporal points (e.g. home address).220 Therefore, this feature facilitates the easier re-identification by using less outside information.221 Additionally, the assessment/quantifying of credit card data shows quite similar results.222 De Montjoye, demonstrates that it is reasonable that “most large-scale metadata sets—for example, browsing history, financial records, and transportation and mobility data—will have a high unicity.”223 Nevertheless, he suggests that data protection legislation should take risk tolerance approach which includes tools such as quantitative assessments of the likelihood of re-identification.”224

215 Ohm (n 29), 1747. 216 Yves-Alexandre de Montjoye et al., Unique in the Shopping Mall: On the Reidentifiability of Credit Card Metadata, 347 SCI. 536, 537 (2015). 217 Ibid. 218 Ibid. 219 Rubenstein (n 214). 220 Montjoye (n 216). 221 Ibid 439. 222 Ibid 537 (showing that only 4 spatiotemporal points are enough to uniquely re-identify 90% of shoppers using credit cards). 223 Ibid. 224 Ibid.

35

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

On the other side of the barricade, scholars such as Ann Cavoukian and Daniel Castro remain firmly of the opinion that anonymization is still an effective privacy-preserving tool which is especially valuable and applicable in a BD environment. 225 They argue that the above mentioned re-identification studies lack of the practical demonstration of how the proclaimed re-identification could be performed and argue that there are no specific individuals that had been singling out in these studies.226 Another argument that is in favor of this view is that acquiring the required auxiliary information such as individuals’ work addresses (part of the four spatiotemporal elements in the De Montjoye study of the datasets which leads to the outstanding 95% of re-identification rate of the population) would be quite a challenging task for the attacker, especially when the used sources are limited to the publicly available ones.227

To demonstrate how intense the debate is, just one month after the publication of the above mentioned paper written by Cavoukian and Castro, two other scientists directly responded to it by stating 8 points of disagreement228 and by proclaiming that the anonymization is not a “silver bullet” for the individuals’ privacy. 229 Another paper, reveals the “effectiveness” of a new re-identification algorithm aiming anonymized social network graphs by indicating that a third of the verified users with accounts on both Flickr230 and Twitter231 can be identified with just twelve percent error rate in the anonymous Twitter graph.232 Moreover, with regard the the realities of the BD empowerment, a study introduces that even mere information sets, such as mobile phone’s remaining battery life could be a useful identifiers singling out a person from the crowd.233

225 Ann Cavoukian and Daniel Castro, Big Data and Innovation, Setting the Record Straight: Deidentification Does Work (2014), available at last accessed 14 May 2016. 226 Ibid. 227 Ibid. 228 The main arguments are that: ‘(i) there is no evidence that de-identification works either in theory or in practice and (ii) attempts to quantify its efficacy are unscientific and promote a false sense of security by assuming unrealistic, artificially constrained models of what an adversary might do.’ 229 Arvind Narayanan and Edward W. Felten, No silver bullet: De-identification still doesn't work, July 9, 2014, last accessed 10 May 2016. 230 Microblogging service 231 Online-photo-sharing site. 232 Arvind Narayanan & Vitaly Shmatikov, De-Anonymizing Social Networks, 30th IEEE Symposium on Security & Privacy, (2009), 173. 233 Lukasz Olejnik, Gunes Acar, Claude Castelluccia & Claudia Diaz, The Leaking Battery: A Privacy Analysis of the HTML5 Battery Status API, (2015), last accessed 10 March 2016.

36

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

“Anonymization plays a central role in modern data handling, forming the core of standard procedures for storing or disclosing personal information”.234The risk of re-identification undermines the legal effects given by the lawmakers to anonymization. More particularly, BD and the enormous availability of auxiliary information challenged the main goal of anonymization – to irreversibly transform personal data into non personal.235 This feature is directly related with the further use of the anonymized data. The status of the anonymization as a privacy keeper in scenarios which include BD involvement as a reliable security measure also has been strongly challenged.236 3.4.2. Profiling and Behavioral Advertising

Another way in which BD has been used is to facilitate and make the process of online profiling much simpler by enabling higher number of behavioral correlations that can lead to the identification of particular individuals.237 A profile represents a digital representation of individuals, often generated by automated means. In the scenarios which include processing of large data sets this means to derive an asset of typical features of the user which can serve as a basis for decision making. 238 The definition of profiling refers to “[t]he process of ‘discovering’ correlations between data in databases that can be used to identify and represent a human or nonhuman subject (individual or group) and/or the application of profiles (sets of correlated data) to individuate and represent a subject or to identify a subject as a member of a group or category”239 or “the creation of a representation based on automated monitoring of individual behavior.”240 There are two types of profiles that need to be distinguished from each other: individual and group profiles. The individual profiling covers a collection of data of a particular person. 241 For example, this recognition could be based on information gathered by the use of a cookie which enables the behavioral monitoring of the concerned individual. Group profiles are created by the use of profiling and cover a set of attributes encompassing a group of individuals. There are different grounds on which the group profile could be created. For instance, this group could be distinguishable as such by its public manifestation (e.g. group of friends or family members) expressing a certain relationship

234 Ohm (n 29). 235 El Emam and Alvarez (n 82). 236 Ohm (n 29). 237 Moerel and Prins (n 111). 238 Arnold Roosendaal, Digital Personae and Profiles in Law, Protecting Individuals’ Rights in Online Context, Wolf Legal Publishers, (2014), 30-31. 239 A. Pfitzmann, and M. Hansen, Anonymity, Unlinkability, Undetectability, Unobservability, Pseudonymity, and Identity Management - A Consolidated Proposal for Terminology, Dresden/Kiel: TUD/ULD, (2008). 240 Roosendaal (n 238). 241 Ibid.

37

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology between the members. Also the members could share a common attribute and by doing this they can be categorized on the base of this feature. Aggregation of personal profiles also could be a ground for the creation of this type of profiles. The importance of this differentiation is given for the purposes of the level of identifiability of the concerned persons. Therefore, this has a key role with regards to anonymization.

The analysis of profiling will be restricted to on one particular type of profiling, namely behavioral targeting. This is because pseudonymization and anonymization are widely used by the company for the data processing activities related to this type of profiling. The definition of this category of online profiling is related to the personalized communication that involves monitoring individuals” online behaviour and uses the collected data to display to users individually targeted advertisements. 242 This type of “seductive marketing” 243 (behavioral advertising) can be spread on much larger scale and can easily turn into an excessively intrusive activity.244 Huge amounts of data about millions of persons is collected, accumulated and processed specifically for behavioral advertisement.245 For instance, the data of proximately 1.5 billion people have been collected by Facebook.246 Google states that it “reaches 90% of Internet users worldwide.” 247 Of course there are other companies that are not under the spotlight but also process a huge amount of data about millions or even billions of persons like the AddThis (“1.9 billion”)248 and Rubicon Project (“600 million”).249 To put it simply, an example for behavioral targeting includes an advertising network which monitors online users’ behaviour, in order to display tailor-made ads aiming a particular user. The so-called ad networks represent private organizations which provide advertisements to a large portion of websites. These companies include in their services a tracking of the users’ personal website visiting experience to put in place ads to all of the visit sites. One of the most commonly used tracking tools by the ad networks are the cookies which had been described already in Chapter II. The outcomes of this accumulation of tremendous amount of

242 M. Hildebrandt, S. Gutwirth, ‘Defining Profiling: A New Type of Knowledge?’ in Profiling the European Citizen, Cross-Disciplinary Perspectives, Springer Science, (2008). 243 For example, “influential marketing or persuasion profiling”. 244 Moerel and Prins (n 111). 245 Borgesius (n 56). 246 Facebook says it had ‘1.55 billion monthly active users as of September 30, 2015’ last accessed 10 June 2016. 247 Google Adwords, ‘About the Google Display Network’ (publication date unknown) last accessed 11 June 2016. 248 ‘AddThis offers unparalleled insight into the interests and behaviors of over 1.9 billion web visitors’ last accessed 20 June 2016. 249 last accessed 21 June 2016.

38

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology information complicates the overall picture in terms of privacy and data protection compliance. Most of the companies that employed behavioral advertising stated that they do not involve any personal data in this process and justified this argument by the fact that they do not link direct identifiers such as the names of the users to the data they process.250 The reasons behind this “promise” are really simple – to escape from the data protection rules by presenting that the data they are using is either non-personal or anonymized. For instance, the Interactive Advertising Bureau statement with regards behavioral targeting states: “The information collected and used for this type of advertising is not personal, in that it does not identify you – the user – in the real world. No personal information, such as your name, address or email address, is used. Data about your browsing activity is collected and analyzed anonymously.”251 As already demonstrated in Chapter II, the scope of personal data goes far beyond the narrow identifier such as data subject’s name, although it is the most recognizable one. The explicit confirmation of this, concerning behavioral advertising practices performed by Google, has been included in a letter signed by 27 national Data Protection Authorities with intended recipient Google, according to which Google processes personal data about its “passive users.”252 These behavioral advertising practices include tracking of user by the Google’s subsidiary Double-Click (ad network).253 CNIL (the Data Protection Supervisory Authority in France) states with regards Google: “[T]he sole objective pursued by the company is to gather a maximum of details about individualized persons in an effort to boost the value of their profiles for advertising purposes. Its business model then is not dependent on knowing the last name, first name, address or other directly identifying details about individuals, which it does not need to recognize them every time they use its services. (…) In other words, the accumulation of data that it holds about any one person allows it to individualize the person based on one

250 Ibid. 251 Interactive Advertising Bureau Europe. Your Online Choices. A Guide to Online Behavioural Advertising, www.youronlinechoices.com/uk/about-behavioural-advertising> accessed 24 January 2016. 252 Article 29 Working Party, Letter to Google (signed by 27 national Data Protection Authorities), 16 October 2012, last accessed 11 June 2016. 253 Ibid., appendix, p. 2, footnote 2. Passive users are ‘users who does not directly request a Google service but from whom data is still collected, typically through third party ad platforms, analytics or +1 buttons.’

39

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

or more uniquely personal details. These data must, as such, be considered as identifiable rather than anonymous”.254 The data subject’s name can be linked to a cookie profile, and therefore this cookie profile could be shared to third parties by cookie synching. According to Narayanan “there is no such thing as anonymous online tracking.”255 Another issue with behavioral advertising is the uncertainty of the information that has been correlated and linked to a certain individual. For instance, the commercial data that has been aggregated and for its basis has been linked to an identified data or auxiliary information does not exclude in any case the possibility of erroneous information. In one of his articles for “Time”, Joel Stein reported errors in the information of his commercial profile.256 According to one of the commercial profiles, Joel Stein is an eighteen- to nineteen-year-old girl.257 This example, just demonstrates the complexity with regards the concept of personal data and its constantly shaping scope (challenged by the technological novelties), which is of a vital importance for the applicability of the data protection rules. The uncertainty that could be found in the information which is the basis of commercial profiles suggests that a “database of ruin”258 is still a far fetching reality. The non-legal argument of the companies that are managing commercial profiles is that their primary reasons are advertisements and for these purposes they do not need identify or identifiable individuals. However, as already demonstrated by the landmark re-identification cases, and even more in consideration of technologies such as BD, nowadays reality requires a careful crafting on the side of the lawmakers when they are regulating concepts such as anonymization and pseudonymization. 3.5. Conclusion

This chapter presented the challenges that surround the concepts of anonymization and pseudonymization. Firstly, the concept opposite to anonymization, namely re-identification,

254 Commission Nationale de l’Informatique et des Libertés, ‘Deliberation No. 2013-420 of the Sanctions Committee of CNIL imposing a financial penalty against Google Inc’ (2014), 11 ; last accessed 10 June 2016. The Dutch Data Protection Authority confirms this interpretation in a report on Google and DoubleClick. College bescherming persoonsgegevens 2013 (Google) – ‘Investigation into the combining of personal data by Google, Report of Definitive Findings’ (2013), 44, last accessed 28 June 2016. ‘Identification is also possible without finding out the name of the data subject. All that is required is that the data can be used to distinguish one particular person from others.’ See also Ibid, 49-57. 255 Arvind Narayanan, There is no such thing as anonymous online tracking,Center for Internet and Society, Stanford Law School, (2011) last accessed 05 August 2016. 256 Joel Stein, Your Data, Yourself, TIME, (2011), 40-44. 257 Ibid. 258 Ohm (n 29).

40

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology was introduced. Subsequently, the three landmark re-identification studies completed the overview of this specific regulatory issue. It was demonstrated that there are different positions with regards what could be considered as a plausible risk among the specialists. Similar disagreement persists also in the debate with regards the utility versus privacy, where three different stands were depicted: the pragmatists who are proclaiming the impossibility of co-existence between both terms, the formalists which support the opposite view, and finally the third group is represented by the ones with a middle ground perspective. Therefore, the new technological challenges such as big data, and profiling were described in the context of the complex issues which arise from their clash with anonymization. The most relevant positions with regards the risks arising from the use of anonymization and pseudonymization were identified and whether there can be solutions to these issues in the light of the GDPR will be analyzed in Chapter 4.

41

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

Chapter IV Measures to address the challenges

The fourth chapter will assess how the European Data Protection Legislation deals with the challenges presented in the previous chapter. Prior to the analysis of the regulatory mechanisms, the recommendations that have been given by computer scientists and statisticians in order to mitigate the risks related to the use of anonymization techniques will also be examined. Afterward, the new GDPR which will be the main source with its regulatory paradigms such as risk-based approach, data protection impact assessments, and the key concept of data protection by design. 4.1. Computer scientists’ recommendations

This section summarizes a number of solutions proposed by computer scientists and statisticians related to the already presented privacy challenges which anonymization is posing. The reason behind this is to see if they fit to the provisions set forth in the Regulation and how they could complement the approach taken by European Data Protection legislation. As discussed in the previous chapter, the debate with regard to anonymization includes highly divergent opinions. This, of course, reflects on the proposed solutions of this topic. Therefore, these different stands may serve as useful ideas and frameworks from which the lawmakers can derive knowledge.

For instance, Paul Ohm suggests that in the “post-anonymization world” the legislators should “incorporate risk assessment strategies” that can reflect the challenges such as “easy re-identification”.259 This could happen by focusing on reducing the risks of disclosing the uniqueness in certain data elements rather concentrate on the ultimate process of anonymization.260 The main argument is that the lawmakers rely too much on the premises implied in the concept of anonymization. The highlighted elements in this approach are the preventive measures and the anonymization process by itself.261 In other words, the promise of anonymization implies a false threshold of protection, and to achieve this level of data protection is something that is not easy to be reached. Moreover, it is relatively easier to create a certain set of steps or requirements which could be followed in order to reduce the risks, which could include anonymization and pseudonymization mechanisms along with

259 Ohm (n 29), 1734-1735. 260 Ira S. Rubinstein And Woodrow Hartzog, Anonymization And Risk, New York University School Of Law Public Law & Legal Theory Research Paper Series Working Paper No. 15-36, (2015), 723. 261 Ibid.

42

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology administrative and legal instruments,262 or reliance more widely on the basis of existing best practices of methods such as risk assessments.263

For instance, in the area of data security, it is well-known that the perfect security does not exist.264 In one of his paper, Bambauer is stating that “[s]cholars should cast out the myth of perfection, as Lucifer was cast out of heaven. In its place, we should adopt the more realistic, and helpful, conclusion that often good enough is . . . good enough”265. As already mentioned the same is valid for the anonymization (perfect anonymization is not possible) and as Rubenstein, Wu, Bambauer, and even Ohm have also admitted that this could be overcome through the risk tolerance approach.266

Therefore, anonymization should be used as an additional measure to other privacy- preserving mechanisms. 267 The legislators should focus on the solutions that require a specific regulation as well as situations for which a general approach is more appropriate.268 Careful crafting between these two options is needed. Therefore, the desired balance requires specific contextual consideration. According to Ohm, European Data Protection Legislation as embodied in the DPD sets the bar too high by broadening too much the scope of personal data269, especially in the contemporary re-identification realities.270 As a solution he proposes lowering the overburdening data protection requirements set forth in the DPD and understanding data protection through the contextual prism which leads to case by case approach. 271 Schwarz and Solove 272 , on their side, have stated that what the privacy legislation should rely on is a three-tiered model which demonstrates that the level of data protection legislation is strongly dependent on whether the data poses a “remote,” “possible,”

262 Ibid. Such as: contracts prohibiting re-identification and sensitive attribute disclosure, data enclaves, posing sanctions for re-identification or even banning the re-identification. 263 PbD implies that compliance with and enforcement of legal standards is incorporated into technical designs. This concept will be discussed in more details later in this chapter. 264 Derek E. Bambauer, The Myth of Perfection, 2 Wake Forest L. Rev. Online 22, (2012), last accessed 10 Jully 2016. 265 Ibid.> 266 Rubenstein, (n 181), Ohm, ( n 20); Wu, (n 160); Yakowitz (n 183), and Stuart S. Shapiro, Separating the Baby from the Bathwater Toward a generic and practical framework for anonymization, Technologies for Homeland Security (HST), 2011 IEEE International Conference 15-17, (2011). 267 Ohm (n 29), 1763. 268 Ibid. 269 Ohm (n 29, Paul Ohm is given example with the inclusion of the IP address in the scope of personal data, plus the level of easy re-identification which impacts significantly the broadening of the personal data scope. 270 Ohm (n 29), 1763. 271 Ibid. 272 Paul M. Schwartz & Daniel J. Solove, The PII Problem: Privacy and a New Concept of Personally Identifiable Information, 86 N.Y.U. L. REV. 1814, (2011), 1877–78.

43

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology or “substantial” risk of re-identification. 273 Similarly, Mayer-Schonberger and Kenneth Cukier depict several different situations, all of them impacted directly by BD analytics and therefore present a contextual differentiation between scenarios which presents low, medium, and high level of risks for individuals’ privacy and data protection in terms of risk of re- identification. 274 Nevertheless, the proposal already discussed in the Chapter II for broadening the categories of data with regards identifiability of the data subject is also a sustainable point of view.

As the previous section highlighted, computer science literature reveals extremely divergent opinions and that certain aspects have been either exaggerated, or in some ways misinterpreted, by legal scholars.275 The common ground between all different theories is that anonymization should be considered to be a concept that implies a ranging level of risk which is strongly related to the certain context in which this measure has been used.276 The policy debate exists in the shadow of the computer science literature evaluating anonymization as a compliance and at the same time as a security measure which could facilitate data protection. Lawmakers have referred numerous times to computer studies in order to justify their own views,277 however their intentions had been expressed in different policy proposals.278 This can be seen in the long process of adopting and consolidating the new data protection rules of the Regulation with its approximately 4000 amendment proposals as a product of unprecedented lobbying actions. 279 The perspective of European Data Protection will be addressed in the next paragraph by discussing the essential role of the so-called risk-based approach which increases its significance under the upcoming Regulation. In order to answer the main research question in a comprehensive manner, first the notion of risk will be defined and then the key element of risk-based approach will be addressed. 4.2. Risk

The most used term in the above sections, besides the one of anonymization, is the notion of “risk”. There is no unified definition of risk, “the most common uses are: risk as a hazard, as

273 Rubinstein and Hartzog (n 260). 274 Viktor Mayer-Schönberger and Kenneth Cukier, Big Data: A Revolution That Will Transform How We Live, Work, And Think, (2013). 275 See Chapter III. 276 Rubinstein and Hartzog (n 260). 277 Wu (n 160). 278 See Ohm, supra note 21, at 1751–58 (explaining why “technology cannot save the day, and regulation must play a role”); Yakowitz, supra note 25, at 23–35 (describing “five myths about re-identification risk”); see also Schwartz & Solove, supra note 27, at 1879 (asserting that “practical tools also exist for assessing the risk of identification”). 279 See Data Protection Regulation provokes intense lobbying, EJC News, v49 issue 7/8, July/August 2013.

44

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology probability, as consequence, and as potential adversity or threat”.280 Before discussing how data protection is dealing with the “digital era” challenges, a short description of the concept of risk and its close relation to the one of anonymization will be presented. In their landmark paper, Amos Tversky and Daniel Kahneman demonstrated that the human beings have proved poor decision-making when risk is involved, and therefore have often been biased, especially in situations which includes uncertainty.281 This could be well seen in the tendency to confuse the likelihood of the situation with its final outcome/impact.282 Furthermore, where risks are related to human actions, these biases themselves feature into the risk profile.283 Accordingly, WP29 stated that “a risk factor is inherent to anonymization”.284 For instance, in the scenario which includes re-identification, if the adversary has been convinced that the likelihood of his re-identification attack is not likely to be successful then the adversary is less likely to make the required effort for the attack. Therefore, the outcome is that the actual risk in this scenario will be beyond the objectively measured one.285 The concept of “risk” is incorporated into the data protection legislation in the form of the risk-based approach which has a strong impact on anonymization as well.

4.3. Risk-based approach in the European Data Protection Legislation

The definition given to the risk-based approach presented it as an approach that “goes beyond mere compliance with regulatory requirements. It goes to the heart of what responsible and accountable organizations seek to achieve, how they implement privacy requirements on the ground and how they demonstrate compliance. The risk-based approach may also help to clarify and communicate the underlying rationales for regulation.”286 Data protection has long relied on the “risk-based approach”. The current Directive set forth this concept in the articles related to data security, such as Article 17, according to which the required security measures must ‘ensure a level of security appropriate to the risks represented by the processing and the

280 Paul Slovic and Elke U. Weber, Perception of Risk Posed by Extreme Events Center for Decision Sciences, (CDS) Working Paper Columbia University, (2002), 4. 281 Amos Tversky and Daniel Kahneman, Judgment under uncertainty: Heuristics and Biases Science, 185 (4157): 1124-1131, (1974), last accessed 21 Jully 2016. 282 Ibid. 283 Mark Elliot, Elaine Mackey, Kieron O’Hara and Caroline Tudor, The Anonymisation Decision-making Framework, UKAN Publications, (2016), last accessed 10 August 2016. 284 WP29 (n 15). 285 Ibid. 286 Center for Information Policy Leadership, Hunton&Williams LLP, A Risk-based Approach to Privacy: Improving Effectiveness in Practice, (2014), last accessed 10 June 2016.

45

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology nature of the data to be protected’ and Article 20, with regards to the Data Protection Authorities prior checking obligations. 287 In the same way, the applicable rules to the processing of special categories of data contained in Article 8 can also be recognized as the application of a risk-based approach: increased level of the measures/requirements that should be taken have been directly associated with the type of processing of data which involves a greater risk for the data subjects.288 This is not to say that where the risk-based approach applies, the rights of the persons concerned with regards their privacy and protection of their personal data will be weakened.289 The level of protection should remain the same even in scenarios which could be considered as less “risky”. Instead, this implies that the risk-based approach impacts the “scalability of legal requirements based on risk addresses compliance”. 290 Hence, the legal requirements that are imposed on the data controller related to a ‘low risk’ processing may require less in order to comply with its data protection obligation compared to the scenarios in which the processing of personal data could be considered to be with high levels of risk. However, the perception of risk under the Directive has been seen more as a complete elimination of the risk rather than risk-based approach, 291 something that could be well seen in the approach towards the concept of anonymization.292 Besides the aforementioned legal obligations, the risk-based approach has an essential role in the private and public sector where it has been employed as an effective tool for enhancing data protection.293 Notwithstanding, recently, this strategy has gained a more important role in European Data Protection Legislation.294 This results clearly from the final text of the GPDR. Accordingly, article 24 of the GDPR, entitled “Responsibility of the controller” sets forth the requirements for the data controllers to take into account the risks:

287Article 29 Data Protection Working Party, Statement on the role of a risk-based approach in data protection legal frameworks, 14/EN,WP 218, Adopted on 30.05.2014 288 Ibid. 289 Ibid. 290 Ibid. 291 Centre for Information Policy Leadership at Hunton & Williams LLP, Protecting Privacy in a World of Big Data Paper 2 The Role of Risk Management, Discussion Draft, (2016), last accessed 10 June 2016. 292 WP29 (n 15). 293 Christopher Kuner, Fred H. Cate, Christopher Millard, Dan Jerker B. Svantesson, and Orla Lynskey, Risk management in data protection, International Data Privacy Law, Vol. 5, No. 2, (2015), 95. 294 Ibid.

46

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

Taking into account the nature, scope, context and purposes of processing as well as the risks of varying likelihood and severity for the rights and freedoms of natural persons, the controller shall implement appropriate technical and organizational measures to ensure and to be able to demonstrate that processing is performed in accordance with this Regulation. Those measures shall be reviewed and updated where necessary.295

The risk-based approach consists of the use of risk management mechanisms which shall be applied by the data controller taking into account the level of risk of the particular data processing operation. 296 Furthermore, this way of approaching privacy issues is directly related to the challenges that the concept of anonymization is posing. In the Regulation there is a dual correlation between the risk-based approach and anonymization. 297 Firstly, anonymization could be applied as a risk management tool (“as an appropriate technical measure” 298 ) which reduces the level of risk. 299 Other examples are the data security requirements300 and the data protection impact assessment obligation301 which has been laid down in the Regulation, in which anonymization serves as an important privacy preserving tool. Furthermore, the risk-based approach has been stretched to have its own impact on other important privacy preserving mechanisms specified in the Regulation such as the data protection by design and by default principles (Article 25), data breach notifications (Articles 33 and 34) and the implementation of codes of conduct and certification (Articles 40 and 42).302 Secondly, the risk-based approach, in particular the risk management tools, could be useful to be applied to the performed anonymization techniques in order to quantify the potential risks in the form of the data protection impact assessment. This has been underlined by the ICO by stating that the risk assessment should be applied in the initial phase of the anonymization process.303

295 GDPR (n 7), Article 24. 296 Raphae¨l Gellert, Data protection: a risk regulation? Between the risk management of everything and the precautionary alternative, International Data Privacy Law, Vol. 5, No. 1, (2015), last accessed 10 August 2016. 297 Antikainen Antti, Risk-Based Approach as a Solution to Secondary Use of Personal Data, University of Helsinki, Faculty of Law, (2014), last accessed 21 July 2016. 298 GDPR (n 7), GDPR, Article 24. 299 WP29 (n 15), 9. “Anonymisation may be a good strategy to keep the benefits and to mitigate the risks.” 300 E.g. GDPR (n 7), Art. 32 (1). 301 E.g. GDPR (n 7), Art. 35 (7)(d). 302 Article 29 Data Protection Working Party, Statement on the role of a risk-based approach in data protection legal frameworks, 14/EN,WP 218, Adopted on 30.05.2014 303 ICO (n 12) 19.

47

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

Complementary to this, the risk-based approach has been indicated in various discussions in the area of the data protection field in relation to the BD.304 For instance, Moerel and Prins argue that the purposes for which personal data might be used should no longer be considered the main focus of regulation and that legal compliance should rather shift to the interests that are served by the use of the collected data.305 Thus, they proclaimed the risk management as a basis for a responsible data use including in scenarios of further use of anonymized data.306 This different stand will be discussed in more detail in the section related to one of the main risk management tools, namely data protection impact assessment, prior to this the impact of the risk-based approach of the pseudonymization will be addressed.

4.3.1. Risk-based approach and pseudonymization

The notion of pseudonymization has led to vigorous debates in European Parliament and in the Council related to the lighter legal approach that should be applied to it on grounds that the level of identifiability is decreased, therefore the privacy and data protection risks for the data subjects are also reduced.307 Private companies such as Amazon and Yahoo lobbied for a lighter regime for pseudonymized data, arguing that pseudonymized data reduces significantly the privacy risks. 308 There were even proposals from some Parliament Members that this type of data should remain outside of the scope of the data protection rules, something that did not make it in the final version. 309 European Commissioner Viviane Redding warned that “‘pseudonymous data must not become a Trojan horse at the heart of the Regulation, allowing the non-application of its provisions.’310 As a final outcome of the long negotiation process in the final text of the GDPR, the role assigned to pseudonymization

304 Article 29 Data Protection Working Party, Statement on the role of a risk-based approach in data protection legal frameworks, 14/EN, WP 218, Adopted on 30.05.2014 305 Moerel and Prins (n 111). 306 Ibid. 307 Article 29 Data Protection Working Party, Statement on the role of a risk-based approach in data protection legal frameworks, 14/EN, WP 218, Adopted on 30.05.2014. 308 Yahoo! Rationale for Amendments to Draft Data Protection Regulation as Relate to Pseudonymous last accessed 10 May 2016; Amazon EU Sarl, Proposed amendments to MEP Gallo’s opinion on data protection last accessed 10 May 2016. 309 For instance, shadow rapporteur Alvaro proposed adding a rule that would legitimize the processing of pseudonymous data. ‘Processing of pseudonymized data shall be lawful.’ (Alexander Alvaro, ‘Draft Amendments to the General Data Protection Regulation’ (21 February 2013) last accessed 10 May 2016; (amendment 48, p. 31.) 310 Viviane Reding, The EU Data Protection Regulation: Promoting Technological Innovation and Safeguarding Citizens’ Rights (Speech, Intervention at the Justice Council, 4 March 2014). last accessed 10 June 2016.

48

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology is mainly as an appropriate security measure.311 The fact that pseudonymization is considered as a risk-reducing measure has been signified in the legal text of Recital 28.312 Moreover, this has also been illustrated in article 30 of the GDPR according to which the data controllers, in certain cases, by employing pseudonymization will not be required to comply with data subjects requests for access.313 Another example, are the articles related to the data breach notification (article 33 and article 34); although pseudonymization is not explicitly mentioned, the text is referring to this type of privacy enhancing technique by describing the process of “render[ing] the personal data unintelligible to any person who is not authorized to access it.”314 The absence of pseudonymization explicitly in this article might be explained by the technologically neutral approach which is one of the main goals of the Regulation. The reasoning for this ‘safe harbor’ option implied by the use of pseudonymization and encryption could be explained also by the fact that this notification requirement could result in an over-burdening obligation for the data controllers. 315 With regards to the applicability of these legal relaxations for the data controllers, pseudonymization should fulfill the requirements set forth in article 4 of the GDPR, namely that the “additional information is kept separately”, and “is subject to technical and organizational measures to ensure that the personal data are not attributed to an identified or identifiable natural person.”316 Nonetheless, the applicability of that type of exceptions is not an automatically occurring, rather it should be examined by the competent supervisory authority. 317 As it has been shown in this chapter so far, the risk-based approach has a prominent role with regards to anonymization, pseudonymization, as well as for the GDPR in general, however it would be naive to believe that just by itself this concept could be the “privacy panacea”. The inherent role of risk management is to facilitate the prioritization of the needed investment for the privacy enhancing “organization and technical measures”, as well as the enforcing of the

311 Borgesius (n 56); GDPR (n 7), Art. 23(1); Art. 30(1)(a); Art. 83(1); Recital 60(a); Recital 61; Recital 67; Recital 125. 312Recital 28, GDPR, “The application of pseudonymisation to personal data can reduce the risks to the data subjects concerned and help controllers and processors to meet their data-protection obligations. The explicit introduction of ‘pseudonymisation’ in this Regulation is not intended to preclude any other measures of data protection“. 313 GDPR (n 7), Art. 30(1)(a). 314 GDPR (n 7), Art. 33. 315 See, for example, discussion in Paul Schwartz and Edward Janger, Notification of Data Security Breaches. Michigan Law Review, Vol. 105, p. 913, 2007 316 GDPR (n 7), Article 4 (5). 317 GDPR (n 7), Article 34 (4).

49

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology obligations that have been laid down in the legislation.318 Employing a risk-based approach can help to identify the challenges to data protection and the selection of the most effective tools for mitigating these risks.319 Furthermore, it can increase the collective knowledge about the possible risks related to the data processing activities by identifying and quantifying them, notably in the current digital era with pervasive ways of re-identification, BD, and profiling. Moreover, it could bring the necessary balance between privacy and utility by providing discipline and rigor to the way of how everybody is perceiving the processing of data.320 Notwithstanding the value of risk management, its embodiment in data protection practices requires substantial work if it is to achieve its full potential for data protection. As discussed in the previous chapter, the potential risks with regard anonymization can be understood differently even among specialists. Therefore, the next section will analyze the criteria which WP29 is setting in order to quantify the “robustness of the anonymization techniques”.321 We will see the criteria in order to set the scene for DPIA which is the first step in identifying risks. 4.4. The robustness of Anonymization

In the WP29 Opinion on Anonymization techniques, the inherent correlation between risk and anonymization had been stressed out multiple times. The so-called risk factor is one of the criteria that should be taken into consideration when examining the “validity of any anonymization technique” in terms of the potential ways in which the data has been “anonymized”. 322 The Opinion also presents the idea that the different anonymization techniques and practices used by the data controllers possess “variable degrees of robustness”. 323 Hence, in order to assess this the data controllers should take into consideration the state of the art of the particular technology as well as the three main types of risk for the anonymization, namely:

“(i) is it still possible to single out an individual,

(ii) is it still possible to link records relating to an individual, and

(iii) can information be inferred concerning an individual?”324

318 Christopher Kuner, Fred H. Cate, Christopher Millard, Dan Jerker B. Svantesson & Orla Lynskey, Risk management in data protection, International Data Privacy Law, 2015. 319 Ibid. 320 Moerel and Prins (n 111). 321 WP29 (n 15). 322 Ibid. 323 Ibid. 324 Ibid.

50

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

Thus, an anonymization technique addressing these three risks would be robust against re- identification performed by the most likely and reasonable means which may be employed. This is the approach that should be applied prior to the choice of the appropriate anonymization technique. These three questions could be incorporated into the most recognizable risk management tools, namely DPIA, which will be discussed in the next section. Additionally, ICO‘s ‘Anonymization: managing data protection risk code of practice’ also underlines the close interrelatedness between anonymization and risk management by introducing the “motivated intruders test”.325 This test introduces the idea that the data controllers should ask themselves whether the motivated intruder could reach a re-identification, if motivated to pursue this.326 The ‘motivated intruder’ test is valuable tool which aims to set the threshold for the risk of identification related to an individual (intruder) that is not an expert and is motivated to achieve re-identification.327 The test is not presumed to be applied to a highly skilled hacker or a person with a specific prior knowledge.328 In general, this code of practices is aiming at giving the data controllers criteria that they should take into account when they are releasing anonymized data and selecting the appropriate anonymization techniques. Therefore, the ICO admits that to assess the potential risk of re- identification is a complex and not always possible329 procedure which has to be periodically reviewed.330 This is why the ICO’s guideline avoids “zero-risk” framing and instead uses risk tolerant language such as “mitigating,” in place of eliminating, risk.331 However, as already demonstrated in the previous chapter even for well-known computer specialists the risk involved in identical situations (e.g. the three landmark studies addressed in Chapter 3) could be quantified divergently. The risk mitigating tools incorporated in the legal text of the Regulation includes the DPIA which could use the above mentioned criteria in its arsenal, especially when the assessment is related to the use of anonymization techniques. Therefore, the next section will explore the concept of DPIA and its relation to the challenges presented in the previous chapter and risks involved in employing anonymization and pseudonymization.

325 ICO (n 12) 23-24. 326 Ibid. 327 Ibid. 328 Ibid. 329 E.g. “It is worth stressing that the risk of re-identification through data linkage is essentially unpredictable because it can never be assessed with certainty what data is already available or what data may be released in the future.”, Ibid. 330 Something that is also stated by the WP29 in their opinion with regards Anonymization techniques. 331 Rubinstein and Hartzog (n 260), 746.

51

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

As an addition, recently the United Kingdom Anonymization Network (UKAN) introduced a four-step assessment related to the disclosure of risk.332 This type of assessment take into consideration the assessment of the risk as well as the particular control exercised over the data release. The key elements of this test are the following: the detailed description of the possessed data sorted by the relevant risk features, an analysis aiming for the creation of plausible scenarios for the concrete data situation, where is necessary the use of analytic tools which could assess the risk given to the scenarios established by the previous point, and penetration test (which again could be used to test the same scenarios by simulating attacks using friendly intruders).

4.5. DPIA

A DPIA can be defined as “a methodology or process for assessing the privacy-related risks associated with organizational activities that involve the processing of personal data.”333 This notion is introduced in very general way in Article 35 of GDPR as a requirement which applies to certain types of data processing technologies.334 The minimal components that a DPIA should consist are the following: description of the processing procedures, evaluation of the risks, assessment of the proportionality and necessity of the processing procedures, and the mitigating tools applied. 335 The provision underlines the key role of the supervisory authorities which should further develop check lists and guidelines as helping tools for the effective implementation of the DPIA obligation by the controllers. The main advantages of the use of DPIA could be summarized as following: “(i) establish and maintain compliance with privacy and data protection laws and regulations, (ii) to manage privacy risks within an organization and in relation to third persons, and (iii) to provide public benefits to the success of PbD efforts”. 336 Moreover, the DPIA is useful as a tool that assesses and directly approaches the privacy and data protections issues through the whole data processing life cycle since the very beginning, brings safeguards or appropriate measures at the moment of designing the products, and it is pivotal for future data protection compliance audits and checks.337 This procedure could serve as a credible source of knowledge to ease all the

332 Mark Elliot, Elaine Mackey, Kieron O’Hara and Caroline Tudor, The Anonymisation Decision-making Framework, UKAN Publications, 2016. 333 Rolf H. Weber, Privacy management practices in the proposed EU regulation, International Data Privacy Law, 2014, Vol. 4, No. 4, 293. 334 GDPR (n 7), Art. 35. 335 GDPR (n 7),, Article 35 (7). 336 Rolf H Weber, ‘Can Data Protection be Improved through Privacy Impact Assessments’, Ijusletter IT 12 September 2012, No. 4 337 Ibid.

52

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology involved stakeholders (including all concerned data subjects). Additionally, DPIA provides a cost-effective process of reducing privacy risks.338 This risk management tool has its own impact on the challenges related to anonymization.

The DPIA is an essential part of the procedure of processing safe and useful data by assessing whether the data should be released or shared in first place, assessing the level of disclosure control, and considering optimum ways of releasing or sharing of data.339 From a practical point of view, in the context of this thesis the most challenging part of this assessment will be to address the anonymization process itself. It will require a high expertise and experienced judgement skills on the part of the data controller. 340 The difficulties that make this assessment so complex are hidden in the already mentioned “auxiliary problem”341 as well as other factors that can impact the effective anonymization. 342 Additionally, the expertise of the adversary as well as the used techniques by him/her are not constant values. Here criteria to be taken into account by the data controller could be the already mentioned strategies by the WP29 in terms of particular robustness of the anonymization techniques as well as the “motivated intruder test” presented by ICO.

DPIA has an important role with regards BD343 as WP29 stresses that all privacy and data protection principles (including purpose and data minimization) apply to BD without any limitation.344 As already mentioned in Chapter II, DPIA could be used to ensure that BD is compliant with the above mentioned principles. The aim in this scenario would be to assess all potential contextual implications of the use and re-use of data.345 Moreover, DPIA will stick to the already accepted criteria of the legitimate interest test as well as the compatibility test.346 By applying this assessment, data controllers will ensure that the data processing procedures will be carried out uniformly and that they are of a consistently high quality while being compliant with the requirements laid down in GDPR.347 However, as mentioned by

338 David Wright, The state of the art in privacy impact assessment, Comput Law Secur Rev 2012; 54–61. 339 Mark Elliot, Elaine Mackey, Kieron O’Hara and Caroline Tudor, The Anonymisation Decision-making Framework, UKAN Publications, 2016. 340 Ibid. 341 Section 3.1. 342 Ibid. 343 Lokke Moerel, Big Data Protection: How to Make the Draft EU Regulation on Data Protection Future Proof? (inaugural lecture 2014) 53-54, last accessed 10 May 2016. 344 Article 29 Working Party, 'Statement of the WP29 on the impact of the development of Big Data on the protection of individuals with regard to the processing of their personal data in the EU’. 345 Moerel and Prins (n 111). 346 Ibid. 347 Ibid.

53

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology recent doctrine “despite the longstanding role of, and intensified recent attention to, risk management in data protection, it is still a developing field that lacks many of the widely accepted principles and tools of risk management in other areas.”348

The already existing risk assessment procedures in the field of privacy and data protection are mainly designed to concentrate on tangible harms like security weakness or financial loss.349 In the context of BD the spectrum of the potential risks increases significantly. 350 This requires an intensive extension in the scope of the current risk assessments as well as further examination of possible ethical implications impacted by BD. 351 Part of this ethical challenges include: stigmatization, unfair discrimination and narrowcasting. This is why, Jules Polonetsky, Omer Tene, and Joseph Jerome introduced a Data Benefit Analysis framework which goes beyond the DPIA and has been designed to be used in the context of BD.352 In order to facilitate the potential of BD it is necessary to establish a way for assessing not just the risks but also the benefits created by innovative information uses. This evaluation method can help with the “utility versus privacy” challenge which was already presented in the previous chapter and it consists of two key elements. The first one is related to the evaluation of the “raw value” of the benefit (e.g. benefit’s nature, benefit’s size, and the potential beneficiaries). The second element is the discount value score, which can be obtained by discounting of the raw value score by the probability that the benefit can be realized.353 This approach complements and supports assessments such as DPIA, especially in BD projects. Nonetheless, in the scenario of BD there could be many different unexpected outcomes and further research in this field is necessary in order to facilitate the development of better risk assessment methods. As already mentioned in the beginning of this chapter researches such as the book of Mayer-Schonberger and Kenneth Cukier can contribute significantly for the achievement of this goal.354

348 Christopher Kuner, Fred H. Cate, Christopher Millard, Dan Jerker B. Svantesson & Orla Lynskey, Risk management in data protection, International Data Privacy Law, vol. 5, no. 2, 95 (2015). See also Jules Polonetsky, Omer Tene & Joseph Jerome, Benefit-Risk Analysis for BD Projects, (2014). 349 Jules Polonetsky, Omer Tene & Joseph Jerome, Benefit-Risk Analysis for Big Data Projects, (2014), last accessed 10 March 2016. 350 Ibid. 351 Ibid. 352 Ibid. 353 Office of Management & Budget, Circular No. A-94 Revised, Memorandum for Heads of Executive Departments and Establishments, Guidelines and Discount Rates for Benefit-Cost Analysis of Federal Programs (Oct. 29, 1992),< http://www.whitehouse.gov/omb/circulars_a094> last accessed 11 June 2016. 354 Viktor Mayer-Schönberger and Kenneth Cukier, Big Data: A Revolution That Will Transform How We Live, Work, And Think

54

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

The next section will analyze data protection by design (DPbD) and data protection by default as the next mitigating tools introduced in the Regulation. The inherent connection between DPIAs and DPbD is crucial in understanding the complexity of the challenges which regulators and organizations are facing. 355 Moreover, these concepts provide better possibilities to address the privacy implications in a more meaningful and effective manner, as PbD could be described as a natural follow up of the DPIA and should be based on its outcomes.356

4.6. Data protection by design and by default

The official position of the European Commission towards security development in the terms of governance could be outlined as proclaiming the “technological governance as a good site for concrete, operationalized engagement with tensions between the protection of privacy and the pursuit of security.”357 The concept of PbD represents a pro-active privacy and data protection-friendly approach towards the development of a product from the very beginning rather than a subsequent more complicated and costly adjustment. 358 This stand is following Lessig’s vision of regulation built into code/architecture and furthermore this is the core element of the PbD principle.359 Moreover, Ann Cavoukian’s PbD theory can serve as a “win-win” manner360 and can be an appropriate middle ground between privacy and security. Anonymization and pseudonymization have important roles for achieving the primary aim of this concept – preserving privacy.361 Different key pillars were identified when analyzing PbD and one of them is dedicated on pseudonymity.362 Thus for all the rest, namely data minimization, user control, accountability, functional separation, and transparency, anonymization and pseudonymization are integral components. Anonymization or

355 Simon Davies, Why Privacy by Design is the next crucial step for privacy protection, last accessed 10 April 2016. 356Niels van Dijk Raphaël Gellert, and Kjetil Rommetveit, A risk to a right? Beyond data protection risk assessments, Computer Law & Security Review 32, ( 2 0 1 6 ), 286–306, last accessed 10 April 2016. 357 European Commission, “Ethical and Regulatory Challenges to Science and Research Policy at the Global Level,” 24. 358 Ibid. 359 Lawrence Lessig, Code: version 2.0, 2nd. ed. edn Basic Books, New York, (2006), 227. 360 See for instance Ann Cavoukian, “Privacy by Design.”, (2009), last accessed 21 April 2016. ; Ann Cavoukian, Scott Taylor, and Martin E. Abrams, “PbD: Essential for Organizational Accountability and Strong Business Practices,” Identity in the Information Society 3, (2010). 361 Ann Cavoukian ‘Operationalizing Privacy by Design: A Guide to Implementing Strong Privacy Practices’, (December 2012), 11. 362 Simon Davies, Why Privacy by Design is the next crucial step for privacy protection, last accessed 21 May 2016.

55

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology pseudonymization by default are good examples of how the appropriate technical measure prescribed by EU data protection legislation can be enforced automatically.363

The principles of PbD are also found in article 17 of the DPD364 when referring to the security measures that have to be taken by controllers and processors. More particularly, DPD implies this concept implicitly in Recital 46, according to which: “[ ]the processing of personal data requires that appropriate technical and organizational measures be taken, both at the time of the design of the processing system and at the time of the processing itself, particularly in order to maintain security and thereby to prevent any unauthorized processing.”365

The full integration of “data protection by design” and “data protection by default” into legal text is introduced in article 25 of the GDPR. It should be noted that data protection by default is still a debated concept. According to Cavoukian, it refers to “privacy as the default setting” - thus privacy should remain intact, if the individual does nothing.366 Conversely, privacy by default could be presumed as a matter of course in ICT development and operation. Therefore, the definition should encompass both privacy by design as a default (as mentioned above) and the default settings. 367 Moreover, article 25 provides as an option for data controllers the implementation of pseudonymization. In this context, the use of the pseudonymization should be prior to the processing and during the processing itself. The article links the appropriate technical and organizational measures strictly to the data minimization principle. The essence of the data minimization principle could be described as processing of personal data which should be strictly limited to what is necessary for the relevant purpose. 368 The European Data Protection Supervisor states that the concept of Privacy Enhancing Technologies (PETs) 369 is closely related to the principle of “data minimization” which is gaining popularity, and has been progressively

Moerel and Prins (n 111). 364 DPD (n 4). 365 Ibid, Recital 46. 366 Cavoukian (n 17). 367 Marit Hansen, ‘Data Protection by Default in Identity-Related Applications’ in Simone Fischer-Hübner, Elisabeth de Leeuw and Chris Mitchell (eds), Policies and Research in Identity Management: Third IFIP WG 11.6 Working Conference (Springer 2013), 4. 368 Christopher Kuner, European data protection law: corporate compliance and regulation, end, ed. edn, Oxford University Press, Oxford 2007, 74. 369 “Privacy-Enhancing Technologies is a system of ICT measures protecting informational privacy by eliminating or minimising personal data thereby preventing unnecessary or unwanted processing of personal data, without the loss of the functionality of the information system.”, see more in John Borking and Jan Huizenga, Handbook of Privacy and Privacy-Enhancing Technologies - The case of Intelligent Software Agents The Hague, 2003,< http://www.andrewpatrick.ca/pisa/handbook/Handbook_Privacy_and_PET_final.pdf > last accessed 10 June 2016.

56

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology implemented throughout the principle of “PbD” that is not only applied in the ICT field, but also used by organizations in general, and thus also facilitates the activities of the data protection authorities. 370 The way in which this concept can facilitate the preserving of privacy in these scenarios is through the incorporation of the data minimization principle enhanced by the use of appropriate anonymization or pseudonymization techniques (classified as PETs371). In other words, minimizing the collection of personal identifiable information by eliminating the personal identifiers linked to the data in the initial phase of every data processing procedure could reduce the risk of re-identification.

Furthermore, the inherent connection between the principle of PbD and anonymization consists in the fact that both concepts enable “a shift from zero-sum to positive-sum thinking”.372 This claim should be approached carefully, as demonstrated by the pragmatists “as the utility of data increases even a little, the privacy plummets”.373 In other words, on hand, it could be possible to have an effective/strong anonymization and at the same time the information remains in a useful format, however, on another hand this may not be possible.374 For instance, in the U.S. Heritage Health Prize (HHP) case study, the approach used succeeded to achieve such positive outcome. 375 The implementation of appropriate anonymization techniques based on the assessment of the potential re-identification risks reduces significantly the possibility that concerned individual in an anonymized dataset could be re-identified.376

PbD could therefore be a valuable risk mitigating tool in the re-identification debate, especially in the areas of location/mobile data, BD, and profiling. In the context of these challenges, one of the core principles of - data minimization should be applied differently, requiring organizations to anonymize/pseudonymize data when possible, implement appropriate security measures, and limit uses of information to those that are acceptable from not only a personal but also a societal point of view.377

370 Peter Hustinx, Privacy by Design: delivering the promises, P. Idis (2010) 3: 253., < http://www.springerlink.com/content/8258q1566232h0u4/fulltext.html> last accessed 28 June 2016. 371 ICO (n 12) 7. 372 Cavoukian and Castro (n 225). 373 Ohm (n 29), 1751. 374 Section 3.3. 375 See Khaled El Emam et al., “De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset,” Journal of Medical Internet Research 14, no. 1 (2012). 376 Ibid. 377 Tene and Jules Polonetsky (n10) 252.

57

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

4.6. Conclusion

Most of the privacy problems related to the new technologies can occur unexpectedly, and sometimes they can be completely invisible. 378 The DPIA and the general risk-based approach incorporated in the GDPR are supposed to play an essential role in the future privacy and data protection compliance. Due to the complexity of the technological arena and the lack of legal certainty in identifying risks they become a method of mitigation instead a straightforward solution. In order to achieve this objective, privacy enhancing techniques such as anonymization and pseudonymization should be essential elements in the data protection toolkit of every organization. This is not to say that by implementing these strategies all privacy risks could be ultimately eliminated, that is impossible. The approach towards the identified risks in the paper should be general, comprehensive and from the legal perspective risk tolerance. Furthermore, factors such as unified risk assessment strategies (e.g. DPIA), privacy by design and by default principles could facilitate the effectiveness of these techniques.

378 Chapter 3.

58

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

Chapter 5 – Conclusion

The enormous value of available information and the use of sophisticated systems such as Big Data (BD) analytics is nowadays one of the key elements towards progress and success of humankind. 379 The main issue is how to process data without disclosing information related to identifiable individuals, especially when this data has a sensitive character. 380 Anonymization has been frequently raised out as one of the possible solutions to this ongoing debate. This understanding has been strongly challenged by the different computer scientists and statisticians, who demonstrated the limitations and shortcomings of anonymization as a basis for policy. The research question of this paper was thus: how can risk-based approach be used to counteract the risks arising from anonymization and pseudonymization, as defined in the GDPR?

In direct relation to answering the question, further clarification was required in relation to the legal definition of the essential concept of personal data which has been strongly challenged by novel sophisticated technologies, the increasing amount of data used by them as well as data which is publicly available.381 A critical aspect to this definition relies on what constitutes an “identifiable natural person”, this being the cornerstone for the applicability of data protection rules. Therefore, the paper has illustrated that “identifiability” needs to be considered as a contextual concept and that it should be approached as such, especially with regard to the wide employment of the advanced data analytics which allows unpredictable and unknown correlations.

The legal analysis of the definitions of anonymization and pseudonymization under the DPD reveals that there is a divergence among the MS and even in the opinions of WP29 about the exact interpretation of these concepts which complicates the situation further. GDPR is meant to overcome these definitional “dead-ends” by its direct applicability. 382 The Regulation follows the DPD approach towards anonymized data by leaving it outside of the scope of the data protection rules. Furthermore, the introduction of “pseudonymization” explicitly in the legal text of the Regulation and clarifying that pseudonymized data should be treated as personal is bringing legal certainty.

379 Tene and Polonetsky (n 10). 380 Domingo-Ferrer, Sánchez, and Soria-Comas (n 11). 381 Costa and Poullet (n 62). 382 Wu (n 161).

59

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

The analysis of the challenges that surround the concepts of anonymization and pseudonymization revealed the tangible weaknesses in both privacy preserving techniques. The widely recognized threat of “re-identification” was presented and further clarified by the three landmark re-identification cases. 383 The progress in re-identification methods has emphasized the conflict between utility and privacy, which need to be balanced with regard to the use of anonymization and pseudonymization. Divergent views on the scale of the problems and how to address them were presented from three different perspectives: pragmatist, formalists, and middle ground.384 The middle ground perspective presented that there are certain misleading interpretations from both extreme sides and that the knowledge derived from them could be used in a constructive way.

GDPR will enter into force on 25 May 2018, until then there is an enormous amount of work which should be done in preparation of its implementation. On the basis of the analysis conducted in this thesis, it appears that the risk-based approach and the presented risk management mechanisms seem vague and are shadowed by uncertainty related to their final practical outcome. This vagueness is stressed by the nature and type of the presented genuine risks in Chapter 3.

As highlighted in this thesis both concepts anonymization and pseudonymization include risk as their inherent characteristic. The idea that risk tolerance should be considered in detriment of approaches aiming to the “perfect anonymization” or attainment of zero risks goals/solutions. The risk-based approach as set in the GDPR, provides the grounds to develop an adequate framework for a more efficient notion of anonymization. Nonetheless, it should not be considered as a panacea to the challenges arising from anonymization and pseudonymization. However, it does introduce useful tools, such as DPIA or data protection by design and by default, which can help to mitigate the potential threats to data protection that such practices involve.

383 See section 3.2. 384 See section 3.3.

60

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

Bibliography

Legislation and Case Law

Charter of Fundamental Rights of the European Union, 26 October 2012, 2012/ C 326/02.

Consolidated version of the Treaty on the Functioning of the European Union (Treaty of Lisbon) (TFEU), 26 October 2012, OJ C 326

Council of Europe, European Convention on Human Rights, [1950], CETS No. 005.

Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data, Official Journal L 281, 23/11/1995 P. 0031 – 0050.

Bundesdatenschutzgesetz (Federal Data Protection Act, Dec. 20, 1990, BGBl. I at 2954, as amended).

Proposal for a Regulation of the European Parliament and of the Council on the protection of individuals with regard to the processing of personal data and on the free movement of such data (General Data Protection Regulation), Brussels, 25.1.2012 COM(2012) 11 final 2012/0011 (COD), European Commission.

The European Parliament And The Council Of The European Union, “Regulation (Eu) 2016/679 Of The European Parliament And Of The Council Of 27 April 2016” 2014, No. April (2016).

Universal Declaration of Human Rights, 10 December 1948, 217 A (III) (UDHR)

European Court of Human Rights, S. and Marper v. the United Kingdom, Nos. 30562/04 and 30566/04, 04 December 2008.

European Court of Justice, Bodil Lindqvist, C-101/2001of 06 November 2003.

61

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

European Court of Justice, Volker and Markus Schecke GbR and Hartmut Eifert v. Land Hessen, Joined cases C-92/09 and C-93/09 of 09 November 2010.

Books, Articles and Papers

Aldhouse F., Anonymisation of personal data – A missed opportunity for the European Commission, Institute for Law and the Web, University of Southampton, (2014).

Bambauer D.E., The Myth of Perfection, 2 Wake Forest L. Rev. Online 22 (2012).

Borgesius Z., Frederik J., Singling Out People Without Knowing Their Names – Behavioural Targeting, Pseudonymous Data, and the New Data Protection Regulation, Computer Law & Security Review, (2016).

Brickell J. & Shmatikov V., The Cost of Privacy: Destruction of Data-Mining Utility in Anonymized Data Publishing, 14 Proc. Acm Sigkdd Int’l Conf. On Knowledge Discovery & Data Mining 70, (2008).

Bygrave L., Data Protection Law: Approaching Its Rationale, Logic and Limit, The Hague: Kluwer Law International, (2002).

Cavoukian A. & Castro D., Big Data and Innovation, Setting the Record Straight: Deidentification Does Work, (2014).

Cavoukian A., ‘Operationalizing Privacy by Design: A Guide to Implementing Strong Privacy Practices’, (2012).

Cavoukian A., Taylor S, and Abrams M. E., “Privacy by Design: Essential for Organizational Accountability and Strong Business Practices,” Identity in the Information Society 3, (2010).

Chawla S., Dwork C, McSherry F., Smith A. & Wee H., Toward Privacy in Public Databases, in 2 Theory Cryptography Conf. 363 (2005).

62

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

Costa L. & Poullet Y., Privacy and the Regulation of 2012, Computer Law and Security Review 28, no. 3 (2012).

De Hert P. & Papakonstantinou V., The New General Data Protection Regulation: Still a Sound System for the Protection of Individuals?, Computer Law and Security Review 32, no. 2 (2016).

De Montjoye Y-A, Radaelli L, Singh V. K., & Pentland A. Unique in the Shopping Mall: On the Reidentifiability of Credit Card Metadata, 347 SCI. 536, 537 (2015).

Delort P., Le Big Data, Que Sais-Je?, Paris: Presses Universitaires de France (2015).

Dinur I. & Nissim K, Revealing Information While Preserving Privacy, (2003).

Domingo-Ferrer J., Sánchez D., Soria-Comas J., Database Anonymization Privacy Models, Data Utility, and Microaggregation-based Inter-model Connections, , Morgan & Claypool Publishers), (2016).

Dwork C., Differeruial Privacy, Automata, Languages and Programming, 33rd Int'l Colloquium Proc. Part 111, 2 (2006).

El Emam K & al., “De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset,” Journal of Medical Internet Research 14, no. 1 (2012).

El Emam K. & Álvarez C., “A Critical Appraisal of the Article 29 Working Party Opinion 05/2014 on Data Anonymization Techniques,” International Data Privacy Law 5, no. 1 (2015).

Elliot M., Mackey E., O’Hara K. & Tudor C., The Anonymisation Decision-making Framework, UKAN Publications, (2016).

Esayas S.Y., The role of anonymization and pseudonymisation under the EU data privacy rules: beyond the ‘all or nothing’ approach, European Journal of Law and Technology, Vol 6, No 2, (2015).

63

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

Gellert R., Data protection: a risk regulation? Between the risk management of everything and the precautionary alternative, International Data Privacy Law, Vol. 5, No. 1, (2015).

Girard J., Klein D., Berg K., Strategic Data-Based Wisdom in the Big Data Era, IGI Global Book Series AKATM, (2015).

Hildebrandt M., Gutwirth S., ‘Defining Profiling: A New Type of Knowledge?’ in Profiling the European Citizen, Cross-Disciplinary Perspectives, Springer Science, (2008).

Hustinx P., Privacy by Design: delivering the promises, P. Idis 3: 253, (2010).

Kokott J. & Sobotta C., The distinction between privacy and data protection in the jurisprudence of the CJEU and the ECtHR, Oxford Journals Law International Data Privacy Law Volume 3, Issue 4, Pp. 222-228, (2013).

Koops B.-J., “The Trouble with European Data Protection Law,” International Data Privacy Law 4, no. 4 (2014).

Kosta E., “Peeking into the Cookie Jar: The European Approach towards the Regulation of Cookies,” International Journal of Law and Information Technology 21, no. 4 (2013).

Kuner C., Fred H. Cate, Christopher Millard, Dan Jerker B. Svantesson, and Orla Lynskey, Risk management in data protection, International Data Privacy Law, Vol. 5, No. 2, (2015).

Kuner C., The European Commission’s Proposed Data Protection Regulation: A Copernican Revolution in European Data Protection Law, in Bloomberg BNA Privacy and Security Law Report, (2012).

Lessig L., Code Version 2.0, Basic Books, (2006).

Lynskey O., The Foundation of the EU Data Protection, OUP (2015).

64

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

Mayer-Schönberger, V., and Cukier, K., Big Data, A Revolution That Will Transform How We Live, Work and Think, (Eamon Dolan/Houghton Mifflin Harcourt, Reprint edition, (2013).

Millard C., Cloud computing law, Oxford University Press, New York, NY, (2013).

Moerel L., Prins C., Privacy for the homo digitalis, Proposal for a new regulatory framework for data protection in the light of Big Data and the Internet of Things, (2016).

Narayanan A. & Felten E.W., No silver bullet: De-identification still doesn't work, (2014).

Narayanan A. & Shmatikov V., Robust De-Anonymization of Large Datasets, 29 Proc. IEEE Symposium on Security & Privacy 111, 111–12 (2008).

Ohm P., Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization, UCLA Law Review, vol. 57, (2010).

Pentland A. Society's nervous system: building effective government, energy, and public health systems, MIT Human Dynamics Laboratory, (2011).

Pfitzmann A., & Hansen M., Anonymity, Unlinkability, Undetectability, Unobservability, Pseudonymity, and Identity Management - A Consolidated Proposal for Terminology. Dresden/Kiel: TUD/ULD. , (2008).

Polonetsky J., Tene O., Finch K., Shades Of Gray: Seeing The Full Spectrum Of Practical Data De-Identification, in Santa Clara Law Review, (2016).

Polonetsky J., Tene O., Jerome J., Benefit-Risk Analysis for Big Data Projects, Fututre of Privacy Forum, (2014).

Reding V., Outdoing Huxley: Forging a High Level of Data Protection for Europe in the Brave New Digital World, Speech at Digital Enlightenment Forum, (2012).

65

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

Roosendaal A., Digital Personae and Profiles in Law, Protecting Individuals’ Rights in Online Context, Wolf Legal Publishers, (2014).

Rubenstein I. S., Big Data: The End of Privacy or a New Beginning?, International Data Privacy Law Advance, (2015).

Rubinstein I. S. & Hartzog W., Anonymization And Risk, New York University School Of Law Public Law & Legal Theory Research Paper Series Working Paper No. 15-36, (2015).

Schwartz P. & Janger E., Notification of Data Security Breaches. Michigan Law Review, Vol. 105, (2007).

Schwartz P. & Solove D., The PII Problem: Privacy and a New Concept of Personally Identifiable Information, 86 NYU L. Rev. 1814 (2011).

Shapiro S.S., Separating the Baby from the Bathwater, Toward a generic and practical framework for anonymization, Technologies for Homeland Security (HST), 2011 IEEE International Conference 16-17, Nov (2011).

Solove D.J., Access and Aggregation: Public Records, Privacy and the Constitution, 86 Minn. L. Rev. 1137, (2002).

Solove D.J., Understanding Privacy, Harvard University Press, (2008).

Sweeney L., Uniqueness of Simple Demographics in the U.S. Population, Laboratory for Int'l Data Privacy, Working Paper LIDAP-WP4, (2000).

Tene O. & Polonetsky J., BD for All: Privacy and User Control in the Age of Analytics, 11 NW. J. TECH. & INTELL. PROP. 239 (2013).

Torey Z. L., The Conscious Mind, MIT Press, (2014).

66

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

Tversky A. & Kahneman D., Judgment under uncertainty: Heuristics and biases. Science, New Series, Vol. 185, No. 4157. pp. 1124-1131, (1974).

Van Dijk N., Gellert R, & Rommetveit K., A risk to a right? Beyond data protection risk assessments, Computer Law & Security Review 32, (2 0 1 6).

Weber R. H., Can Data Protection be Improved through Privacy Impact Assessments, iJusletter IT No. 4, (2012).

Weber R. H., Privacy management practices in the proposed EU regulation, International Data Privacy Law, Vol. 4, No. 4 (2014).

Wright D., The state of the art in privacy impact assessment, Computer Law Security Review (2012).

Wu F.T., Defining Privacy And Utility In Data Sets, University Of Colorado Law Review vol. 84, (2013).

Bambauer J.Y., Tragedy of the Data Commons, 25 Harvard J.L. & Tech. 4, (2011). Documents and Reports

Article 29 Data Protection Working Party, Opinion 4/2007 on the concept of personal data, WP 136, 20 June 2007.

Article 29 Data Protection Working Party, Opinion 03/2013 on purpose limitation, WP 203, 02 April 2013.

Article 29 Data Protection Working Party, Opinion 03/2013 on open data and public sector information ('PSI') reuse, WP 207, 05 June 2013.

Article 29 Data Protection Working Party, Opinion 05/2014 on Anonymisation Techniques,WP 216, 10, April.2014.

67

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

Article 29 Data Protection Working Party, Letter to Google (signed by 27 national Data Protection Authorities), 16 October 2012.

Article 29 Data Protection Working Party, Statement of the WP29 on the impact of the development of Big Data on the protection of individuals with regard to the processing of their personal data in the EU, WP 221, 16 Sept. 2014.

Center for Information Policy Leadership, Hunton &Williams LLP, A Risk-based Approach to Privacy: Improving Effectiveness in Practice, 19 June 2014, https://www.hunton.com/files/upload/Post Paris_Risk_Paper_June_2014.pdf

Centre for Information Policy Leadership at Hunton & Williams LLP, Protecting Privacy in a World of Big Data Paper 2 The Role of Risk Management, DISCUSSION DRAFT 16 February 2016, https://www.informationpolicycentre.com/uploads/5/7/1/0/57104281/protecting_privacy_ in_a_world_of_big_data_paper_2_the_role_of_risk_management_16_february_2016.pdf

Commission Nationale de l’Informatique et des Libertés, ‘Deliberation No. 2013-420 of the Sanctions Committee of CNIL imposing a financial penalty against Google Inc’ (8 January 2014, English translation) (8 January 2014) https://www.cnil.fr/sites/default/files/typo/document/D2013-420_Google_Inc_EN.pdf

The Dutch Data Protection Authority confirms this interpretation in a report on Google and DoubleClick. College bescherming persoonsgegevens (Google) – ‘Investigation into the combining of personal data by Google, Report of Definitive Findings’(z2013-00194) (2013) www.cbpweb.nl/sites/default/files/downloads/mijn_privacy/en_rap_2013- googleprivacypolicy.pdf

ENISA, Privacy by Design in Big Data. An overview of privacy enhancing technologies in the era of Big Data analytics. (2015). https://www.enisa.europa.eu/news/enisa- news/privacy-by-design-in-big-data-an-overview-of-privacy-enhancing-technologies-in- the-era-of-big-data-analytics

European Council, Annex 2, Evaluation of the Implementation of the Data Protection Directive 244 (2012), http://ec.europa.eu/justice/data- protection/document/review2012/sec_2012_72_en.pdf

68

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

European Union Agency for Fundamental Rights and The Council of Europe, Handbook on European Data Protection Law, (2014), http://fra.europa.eu/sites/default/files/fra- 2014-handbook-data-protection-law-2nd-ed_en.pdf

Gloria González Fuster and Amandine Scherrer, Big data and smart devices and their impact on privacy, Study for LIBE Committee, (2015), http://www.europarl.europa.eu/RegData/etudes/STUD/2015/536455/IPOL_STU(2015)53 6455_EN.pdf

Informational Commissioner’s Office, ‘Anonymisation: managing data protection risk code of practice’, Press release, (2012). https://ico.org.uk/media/1061/anonymisation- code.pdf

Interactive Advertising Bureau Europe. Your Online Choices. A Guide to Online Behavioural Advertising, www.youronlinechoices.com/uk/about-behavioural-advertising

International Organization for Standardization, ISO/TS 25237:2008(E): Health informatics: pseudonymization. International Organization for Standardization Geneva, (2009), http://www.iso.org/iso/catalogue_detail?csnumber=42807

Borking J & Huizenga J., Handbook of Privacy and Privacy-Enhancing Technologies - The case of Intelligent Software Agents The Hague, (2003), http://www.andrewpatrick.ca/pisa/handbook/Handbook_Privacy_and_PET_final.pdf

Office of Management & Budget, Circular No. A-94 Revised, Memorandum for Heads of Executive Departments and Establishments, Guidelines and Discount Rates for Benefit- Cost Analysis of Federal Programs (1992), http://www.whitehouse.gov/omb/circulars_a094

Preliminary Opinion of the European Data Protection Supervisor Privacy and competitiveness in the age of Big Data: The interplay between data protection, competition law and consumer protection in the Digital Economy March (2014), https://secure.edps.europa.eu/EDPSWEB/webdav/shared/Documents/Consultation/Opinio ns/2014/14-03-26_competitition_law_big_data_EN.pdf

Russom, P., ‘TDWI Best Practices Report: Managing Big Data’, Fourth Quarter 2013. Cited in Preliminary Opinion of the European Data Protection Supervisor Privacy and competitiveness in the age of Big Data: The interplay between data protection,

69

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

competition law and consumer protection in the Digital Economy March (2014), https://secure.edps.europa.eu/EDPSWEB/webdav/shared/Documents/Consultation/Opinio ns/2014/14-03-26_competitition_law_big_data_EN.pdf Miscellaneous

Aggarwal C.C., On k-Anonymity and the Curse of Dimensionality, Proceedings Of The 31st International Conference On Very Large Data Bases 901, 909 (2005), http://www.vldb2005.org/program/paper/fri/p901-aggarwal.pdf [https://perma.cc/QZ9E- HQDV].

Alvaro A., ‘Draft Amendments to the General Data Protection Regulation’ (21 February 2013) https://www.huntonprivacyblog.com/2013/03/21/libe-committee-debates- proposed-eu-general-data-protection-regulation/

Antti A., Risk-Based Approach as a Solution to Secondary Use of Personal Data, University of Helsinki, Faculty of Law, 2014, https://helda.helsinki.fi/handle/10138/136409

Bambauer J.Y., Is De-Identification Dead Again?, Info/L. Blog (Apr. 28, 2015), https://blogs.law.harvard.edu/infolaw/2015/04/28/is-de-identification-dead-again/

Barbaro M. & Zeller T., Jr., A Face Is Exposed for AOL Searcher No. 4417749, N.Y. TIMES, Aug. 9, 2006: http://www.nytimes.com/2006/08/09/technology/09aol.html

Cavoukian A., “Privacy by Design.”, (2009), http://www.privacybydesign.ca/content/uploads/2009/01/privacybydesign.pdf

Storm H., Data Protection Regulation provokes intense lobbying, EJC News,v49 issue 7/8, (2013), http://www.ejcancer.com/pb/assets/raw/Health%20Advance/journals/ejc/EJCNews_July2 013_bothstories.pdf

Davies S., Why Privacy by Design is the next crucial step for privacy protection, (2010), http://i-comp.org/wp-content/uploads/2013/07/privacy-by-design.pdf

Google Adwords, ‘About the Google Display Network’ (publication date unknown) https://adwords.google.com/support/aw/bin/answer.py?hl=en&answer=57174

70

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

http://newsroom.fb.com/company-info/

Kuneva M., Roundtable on Online Data Collection, Targeting and Profiling, (2009) http://www.webopedia.com/TERM/B/big_data.html

Lohr S., The Privacy Challenge in Online Prize Contests, N.Y. Times (May 21, 2011), http://bits.blogs.nytimes.com/2011/05/21/the-privacy-challenge-in-online-prize-contests/

Maldoff G., Top 10 operational impacts of the GDPR: PART 8 – Pseudonymization, Feb. 12.2016, https://iapp.org/news/a/top-10-operational-impacts-of-the-gdpr-part-8- pseudonymization/

Recommendations to Identify and Combat Privacy Problems in the Commonwealth: Hearing on H.R. 351 Before the H. Select Comm. on Information Security, 189th Sess. (Pa. 2005) (statement of Latanya Sweeney, Associate Professor, Carnegie Mellon University), available at http://dataprivacylab.org/dataprivacy/talks/Flick-05-10.html

Olejnik L., Acar G., Castelluccia C., Diaz C., The Leaking Battery: A Privacy Analysis of the HTML5 Battery Status API, 2015, http://eprint.iacr.org/2015/616.pdf.

Oxford Online Dictionary, Anonymized http://www.oxforddictionaries.com/definition/english/anonymize

Reding V., The EU Data Protection Regulation: Promoting Technological Innovation and Safeguarding Citizens’ Rights. (2014), http://europa.eu/rapid/press-release_SPEECH- 14-175_en.htm?locale=en

The Netflix Prize Rules, Netflix Prize, http://netflixprize.com/assets/rules.pdf

Tockar A., Riding with the Stars: Passenger Privacy in the NYC Taxicab Data Set, September 15, 2014, http://research.neustar.biz/au-thor/atockar.

Yahoo! Rationale for Amendments to Draft Data Protection Regulation as Relate to Pseudonymous,

71

Filip Stoitsev (ANR 729037) Master Thesis: Law and Technology

http://www.centerfordigitaldemocracy.org/sites/default/files/Yahoo_on_Pseudonymous_ Data-1.pdf

Amazon EU Sarl, Proposed amendments to MEP Gallo’s opinion on data protection https://wiki.laquadrature.net/images/7/71/AMAZON-amendments.pdf

72