Data Anonymization

Home , Personal data

Data Sara Szoc, CrossLang Workshop Anonymization Data Anonymization • Concept Introduction • Methods • Risks • Practical tips What ? • Process of removing private or confidential information from raw data • Results in anonymous data that cannot be associated with any individual or company What is data Why ? anonymization • Protection of identity and private activities • Financial aspect How ? • Using anonymization technique(s) • Selection and assessment based on use case

Personal or identifiable data: Information that can lead to the identification of an individual (or a group of individuals)

• Direct identifiers person/company name, surname, email address Personal containing name, phone number, id card/social security number, medical record number … Data • Indirect identifiers date of birth, gender, zipcode can uniquely identify about 80% of the US population

• Pseudonymous or encrypted data can be used to re-identify a person and thus remains personal data “Personal data that has been rendered anonymous in such a way that the individual is not or no longer identifiable is no longer Personal considered personal data. Data For data to be truly anonymised, the anonymisation must be irreversible.”

(source: General Data Protection Regulation) • Sensitive personal data • can cause harm or embarrassment to the individual • for limited dissemination only racial/ethnic origin, political/religious beliefs, genetic data, biometric data (fingerprints), health Sensitive information, sexual orientation … (GDPR) Data • Sensitive business information • poses a risk to the company in question if discovered trade secrets, acquisition plans, financial data, supplier and customer information • Structured data • stored in a structured way • easily searchable Structured • relational databases, spreadsheets, data in versus formats such as JSON, XML, CSV … unstructured • Unstructured data data • anything else • difficult to search • text files, reports, email messages, audio files, images … Before anonymization

Anonymization methods After anonymization suppression

masking Before anonymization

Anonymization methods After anonymization classification Name Age Location Illness John 40 Brussels Flu Ashley 56 Antwerp Multiple Sclerosis Luke 80 Berlin Lung cancer Anonymization Roman 71 Munchen Multiple Sclerosis methods Name Age Location Illness swapping Luke 39 Belgium Flu perturbation Ashley 57 Belgium Multiple Sclerosis John 81 Germany Lung cancer generalization Roman 72 Germany Multiple Sclerosis • Reversible process by using a key

• Still to be treated as personal data because enables re-identification Pseudonymization Name Pseudonymized Anonymized John q0fdGL xxxxx Ashley s8fhPd xxxxx Luke EiuD5j xxxxx Roman qOerd xxxxx Luke EiuD5j xxxxx • K-anonymity, Differential privacy • Focus on structured data Measuring anonymization Gender Age Location Illness male 40-50 Belgium Flu and risks male 40-50 Belgium Multiple Sclerosis female >50 Germany Lung cancer female >50 Germany Multiple Sclerosis 2-anonymous data • Tools for structured data • ARX • Cornell Anonymization Toolkit

Existing tools • Tools for unstructured data • MITRE Identification Scrubber Toolkit (MIST) • Natural Language processing tools (e.g.OpenNLP or Stanford CoreNLP Named Entity Recognizers) There is no “one fits all solution”, but different factors need to be taken into consideration:

Practical tips • Analyze nature of data • Analyze recipients (conclusions) • Analyze risks (de-anonymization risk management) • Analyze data utility • Run anonymization process inside organization