Pseudonymization and De-identification Techniques

Anna Pouliou Athens – November 21, 2017 The Value of Data Benefits of Data Analytics

● Product/consumer safety

● Improvement of customer experience

● Overall improvement of products and services in many sectors, some critical

● Safety and efficiency of operations

● Regulatory compliance

● Predictive maintenance of equipment

● Reduction of costs

● … De-Identification in the General Data Protection Regulation GDPR and De-Identification - Definition

● Recital 26 and Art.5:

“The processing of in such a way that the data can no longer be attributed to a specific data subject without the use of additional information.”

The additional information must be “kept separately and be subject to technical and organizational measures to ensure non attribution to an identified or identifiable person”.

=> A enhancing technique where PII is held separately and securely to ensure non-attribution. GDPR and De-Identification - Incentives

● Art 5: It may facilitate processing personal data beyond original collection purposes

● Art 89 (1): It is an important safeguard for processing personal data for scientific, historical and statistical purposes

● Art 25 (1): It is a central feature of “privacy by design”

● Art 32 (1): Controllers can use it to help meet the Regulation’s data security requirements

● Art 15-20: Controllers do not need to provide data subjects with access, rectification erasure or data portability if they can no longer identity a data subject.

● Art 40 (20) d: It encourages controllers to adopt codes of conduct that promote pseudonymization Interpreting De-Identification The Article 29 Working Party

● Opinion 05/2014 on Anonymization Techniques

● It examines the effectiveness and limits of anonymization techniques against the legal framework of the EU

● It identifies 7 techniques for de-identification:

● Noise Addition

● Substitution/Permutation

● Differential Privacy

● Aggregation/K-Anonymity

● L-Diversity

● Pseudonymization – Hash Functions

● Pseudonymization – Tokenization

● The opinion states that anonymization results in processing personal data in a manner to “irreversibly prevent identification.” The Court of Justice of the European Union

● Case C-582/14 Patrick Breyer v. Bundesrepublik Deutschland ● A CJEU ruling that touched fundamental DP law questions and reversed the WP29 interpretation ● Request for a preliminary ruling under Article 267 TFEU from the Bundesgerichtshof (Federal Court of Justice, Germany)

“However, it must be determined whether the possibility to combine a dynamic IP address with the additional data held by the internet service provider constitutes a means likely reasonably to be used to identify the data subject.”

“Thus, as the Advocate General stated essentially in point 68 of his Opinion, that would not be the case if the identification of the data subject was prohibited by law or practically impossible on account of the fact that it requires a disproportionate effort in terms of time, cost and man-power, so that the risk of identification appears in reality to be insignificant.” The UK ICO

Data Protection Bill, House of Lords second reading – Information Commissioner’s briefing (9 October 2017):

● Clause 162: Re-identification of de-identified personal data ● 33. In her evidence to Parliament during the passage of the Digital Economy Act 2017, the Commissioner recommended that Government consider stronger sanctions for deliberate and negligent re-identification of anonymised data. She is pleased that the government has included such an offence for knowingly or recklessly re -identifying de -identified personal data without the consent of the data controller. The rapid evolution of technology and growth in the digital economy has led to a vast increase in the availability and value of data. There is a clear need for extensive data processing to be accompanied by robust safeguards to guard against misuse and uphold the law.

● 34. The offence is accompanied by appropriate defences including that the re- identification was necessary for the purpose of preventing or detecting crime; was justified in the public interest in particular circumstances; or the person had the consent of the data controller. There are good reasons to have these defences - for example, for organisations testing security and anonymisation techniques. This would allow security testing and research to take place in appropriate circumstances. De-Identification in Practice Key Definitions

● “Direct Identifier”: data that identifies a person without additional information or by linking to information (i.e. SSN, passport number).

● “Indirect Identifier”: data that identifies an individual indirectly (i.e. IP address, cookies, location, license plate)

● “Aggregation”: process by which information is compiled and expressed in summary form

The 7 De-Identification Techniques of WP29

1. Noise Addition: identifiers are expressed imprecisely (i.e., weight is expressed inaccurately +/- 10 kg).

2. Substitution/Permutation: identifiers are shuffled within a table or replaced with random values (i.e. a specific blood type is replaced with “Magenta”).

3. Differential Privacy: identifiers of one data set are compared against an anonymized data set held by a third party with instructions of the noise function and acceptable amount of data leakage

4. Aggregation/K -Anonymity : identifiers are generalized into a range or group (i.e. age 43 is generalized to age group 40-55)

5. L-Diversity: identifiers are first generalized, then each attribute within an equivalence class is made to occur at least “L” times. (i.e. properties are assigned to personal identifiers, and each property is made to occur with a dataset, or partition, a minimum number of times).

6. Pseudonymization – Hash Functions: Identifiers of any size are replaced with artificial codes of a fixed size (i.e. blood type 0+ is replaced with “01”, blood type A- with “02”, blood type A+ is replaced with “03” etc).

7. Pseudonymization – Tokenization: identifiers are replaced with a non-sensitive identifier that traces back to the original data, but are not mathematically derived from the original data (i.e. a credit card number is exchanged in a token vault with a randomly generated token number). Pseudonymization and De-Identification: a useful tool

● Pseudonymization and De-ID practices already a reality

● Varying levels of de-identification (data sets vary per industry)

● Each company can define their own technical measures and use of such tools

● Legal obligation in certain regions (i.e. foreseen in Swiss Law for employee data)

● Contractual commitment to customers to use certain data in an “aggregate or otherwise de-identified format” in certain situations

● “Customer Data De-Identification, Anonymization, and Aggregation Policies” – summary of technical measures ‰ Transparency & Clarity Thank you