Anonymisation Models for Text Data: State of the Art, Challenges and Future Directions
Total Page:16
File Type:pdf, Size:1020Kb
Anonymisation Models for Text Data: State of the Art, Challenges and Future Directions Pierre Lison1, Ildiko´ Pilan´ 1, David Sanchez´ 2, Montserrat Batet2, and Lilja Øvrelid3 1Norwegian Computing Center, Oslo, Norway 2Universitat Rovira i Virgili, CYBERCAT, UNESCO Chair in Data Privacy, Spain 3Language Technology Group, University of Oslo, Norway fplison,[email protected] fdavid.sanchez,[email protected] liljao@ifi.uio.no Abstract tion techniques that render personal data no longer personal. Access to anonymised data is a prerequi- This position paper investigates the problem of automated text anonymisation, which is a pre- site for research advances in many scientific fields, requisite for secure sharing of documents con- notably in medicine and the social sciences. By fa- taining sensitive information about individuals. cilitating open data initiatives, anonymised data can We summarise the key concepts behind text also help empower citizens and support democratic anonymisation and provide a review of current participation. For structured databases, anonymi- approaches. Anonymisation methods have so sation can be enforced through well-established far been developed in two fields with little mu- privacy models such as k-anonymity (Samarati, tual interaction, namely natural language pro- cessing and privacy-preserving data publish- 2001; Samarati and Sweeney, 1998) or differen- ing. Based on a case study, we outline the ben- tial privacy (Dwork et al., 2006). These privacy efits and limitations of these approaches and models and their implementations are, however, discuss a number of open challenges, such as difficult to apply to unstructured data such as texts. (1) how to account for multiple types of seman- In fact, text anonymisation has been traditionally tic inferences, (2) how to strike a balance be- enforced manually, a process that is costly, time- tween disclosure risk and data utility and (3) consuming and prone to errors (Bier et al., 2009). how to evaluate the quality of the resulting These limitations led to the development of various anonymisation. We lay out a case for moving beyond sequence labelling models and incor- computational frameworks designed to extend auto- porate explicit measures of disclosure risk into mated or semi-automated anonymisation to the text the text anonymisation process. domain (Meystre et al., 2010;S anchez´ and Batet, 2016; Dernoncourt et al., 2017). 1 Introduction In this paper, we review the core concepts un- Privacy is a fundamental human right (Art. 12 of derlying text anonymisation, and survey the ap- the Universal Declaration of Human Rights) and proaches put forward to solve this task. These a critical component of any free society, among can be divided into two independent research di- others to protect citizens against social control, rections. On the one hand, NLP approaches rely stigmatisation, and threats to political expression. on sequence labelling to detect and remove prede- Privacy is also protected by multiple national and fined categories of entities that are considered sen- international legal frameworks, such as the General sitive or of personal nature (such as names, phone Data Protection Regulation (GDPR) introduced in numbers or medical conditions). On the other Europe in 2018. This right to privacy imposes hand, privacy-preserving data publishing (PPDP) constraints on the usage and distribution of data in- approaches take the notion of disclosure risk as cluding personal information, such as emails, court starting point and anonymise text by enforcing a pri- cases or patient records. In particular, personal vacy model. Anonymisation consists of a sequence data cannot be distributed to third parties (or even of transformations (such as removal or generalisa- used for secondary purposes) without legal ground, tion) on the document to ensure the requirements such as the explicit and informed consent of the derived from the privacy model are fulfilled. individuals to whom the data refers. This position paper makes the case that none As informed consent is often difficult to obtain of these approaches provide a fully satisfactory in practice, an alternative is to rely on anonymisa- account of the text anonymisation problem. We 4188 Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 4188–4203 August 1–6, 2021. ©2021 Association for Computational Linguistics illustrate their merits and shortcomings on a case Direct Identifier: A (set of) variable(s) unique study and discuss three open challenges: for an individual (a name, address, phone 1. How to ensure that anonymisation is robust number or bank account) that may be used against multiple types of semantic inferences, to directly identify the subject. based on background knowledge assumed to Quasi Identifier: Information (such as gender, be available to an adversary ; nationality, or city of residence) that in iso- 2. How to transform the text in order to minimise lation does not enable re-identification, but the risk of disclosing personal data, yet retain may do so when combined with other quasi- as much semantic content as possible ; identifiers and background knowledge. 3. How to empirically evaluate the quality (in Confidential Attribute: Private personal infor- terms of disclosure risk and utility preserva- mation that should not be disclosed (such as a tion) of the resulting anonymisation. medical condition). Identity Disclosure: Unequivocal association of We argue in this paper that NLP and PPDP ap- a record/document with a subject’s identity. proaches should be viewed as complementary (one focusing on linguistic patterns, the other on disclo- Attribute disclosure: Unequivocal inference of sure risk) and that future anonymisation approaches a confidential attribute about a subject. for text should seek to reconcile these two views. Anonymisation: Complete and irreversible re- In particular, we contend that text anonymisation moval from a dataset of any information that, editor model models should combine a data-driven directly or indirectly, may lead to a subject’s (which selects masking operations on the docu- data being identified. ment) with an adversary seeking to infer confiden- tial attributes from edited documents. De-identification: Process of removing specific, predefined direct identifiers from a dataset. 2 What is Anonymisation? Pseudonymisation: Process of replacing direct The most common definition of privacy amounts to identifiers with pseudonyms or coded values self-determination, which is the ability of individ- (such ”John Doe” ! ”Patient 3”). The map- uals, groups or organisations to seclude informa- ping between coded values and the original tion about themselves selectively (Westin, 1967). identifiers is then stored separately. Information related to an identified or identifiable Table 1: Key terms related to data anonymisation. person is known as personal data, or more precisely personally identifiable information (PII). Datasets with PII cannot be released without control as this terms “anonymisation” and “de-identification” in- would impair the privacy of the data subjects. terchangeably (Chevrier et al., 2019). GDPR-compliant anonymisation is the complete 2.1 Legal Requirements and irreversible process of removing personal iden- Various legal frameworks regulate how PII can be tifiers, both direct and indirect, that may lead to an collected and processed. In particular, the Gen- individual being identified. Direct identifiers cor- eral Data Protection Regulation introduced in Eu- respond to values such as names or social security rope (GDPR, 2016) states that data owners must numbers that directly disclose the identity of the have a legal basis for processing PII, the most im- individual. However, removing direct identifiers is portant one being the explicit consent of the data not sufficient to eliminate all disclosure risks, as subjects.Alternatively, data owners may choose to individuals may also be re-identified by combining anonymise the data to ensure it can no longer be at- several pieces of information together with some tributed to specific individuals. Anonymised data is background knowledge. For instance, the combi- no longer regulated by the GDPR and can therefore nation of gender, birth date and postal code can be freely released. be exploited to identify between 63 and 87% of Table1 defines some of the key terms related the U.S. population, due to the public availability to data anonymisation (Elliot et al., 2016). This of US Census Data (Golle, 2006). These types of terminology is, however, not always applied con- personal identifiers are called quasi-identifiers and sistently, as several authors seem to use e.g. the encompass a large variety of data types such as 4189 demographic and geospatial data. Anonymisation disclosure, which is usually more harmful from a therefore necessitates both the removal of direct privacy perspective, should also be avoided. identifiers and the masking of quasi-identifiers. The removal of personal information necessarily Other legal frameworks have adopted a different entails some data utility loss. Because the ultimate approach. In the US, the Health Insurance Porta- purpose behind data releases is to produce usable bility and Accountability Act (HIPAA) (HIPAA, data, the best anonymisation methods are those 2004) lists 18 data types, such as patient’s name, that optimise the trade-off between minimising the address or social security number, which