The Quasi-Identifiers Are the Problem
Total Page:16
File Type:pdf, Size:1020Kb
The Quasi-identifiers are the Problem: Attacking and Reidentifying k-anonymous Datasets Working Paper Aloni Cohen∗ May 25, 2021 Abstract Quasi-identifier-based (QI-based) deidentification techniques are widely used in practice, including k-anonymity [Swe98, SS98], `-diversity [MKGV07], and t-closeness [LLV07]. We give three new attacks on QI-based techniques: one reidentification attack on a real dataset and two theoretical attacks. We focus on k-anonymity, but our theoretical attacks work as is against `-diversity, t-closeness, and other QI-based technique satisfying modest requirements. Reidentifying EdX students. Harvard and MIT published data of 476,532 students from their online learning platform EdX [HRN+14]. This data was k-anonymized to comply with the Family Educational Rights and Privacy Act. We show that the EdX data does not prevent reidentification and disclosure. For example, 34.2% of the 16,224 students in the dataset that earned a certificate of completion are uniquely distinguished by their EdX certificates plus basic demographic information. We reidentified 3 students on LinkedIn with high confidence, each of whom also enrolled in but failed to complete an EdX course. The limiting factor of this attack was missing data, not the privacy protection offered by k-anonymity. Downcoding attacks. We introduce a new class of privacy attacks called downcoding attacks, which recover large fractions of the data hidden by QI-based deidentification. We prove that every minimal, hierarchical QI-based deidentification algorithm is vulnera- ble to downcoding attacks by an adversary who only gets the deidentified dataset. As such, any privacy offered by QI-based deidentification relies on distributional assumptions about the dataset. Our first attack uses a natural data distribution (i.e., clustered heteroskedastic data) and a tree-based hierarchy, and allows an attacker to completely recover a constant fraction of the deidentified records with high probability. Our second attack uses a less natural data distribution and hierarchy, and allows an attacker to recover 3=8ths of every record with 99% probability. We convert the downcoding attacks into strong compound predicate singling-out attacks against QI-based deidentification, greatly improving over the prior work. (Non-compound) PSO attacks|recently introduced by Cohen and Nissim in the context of data anonymization under the General Data Protection Regulation (GDPR) [CN19]. ∗Hariri Institute for Computing, Boston University; Boston University School of Law. We thank Kobbi Nissim for many helpful discussions throughout this project. This work was supported by the DARPA SIEVE program under Agreement No. HR00112020021 and the National Science Foundation under Grant Nos. CNS-1915763 and SaTC-1414119. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not reflect the views of our funders. 1 Why another attack against k-anonymity? Our attacks rebut the two main justifica- tions for continued use of QI-based deidentification—unchallenged justifications that convince policymakers. Our downcoding attacks demonstrate that QI-based techniques may offer no meaningful protection even if every attribute is treated as a quasi-identifier. Our EdX attack demonstrates that deidentification by experts in accordance with strict privacy regulations does not prevent real-world attack. Furthermore, the attacks demonstrate QI-based techniques may not achieve regulatory goals. The EdX data release enabled disclosure of FERPA-protected data. Minimal hierarchical mechanisms likely fail to anonymize under GDPR. 1 Introduction Quasi-identifier-based (QI-based) deidentification is widely used in practice, especially in healthcare settings.1 k-anonymity and its refinements `-diversity and t-closeness are the most well-known QI- based techniques (also called partition-based or syntactic techniques). k-anonymity aims to capture a sort of anonymity of a crowd [SS98]. A data release is k-anonymous if any individual row in the release cannot be distinguished from k − 1 other individuals in the release using certain attributes called quasi-identifiers. We present three new attacks on QI-based deidentification techniques. We usually speak about k-anonymity specifically, but everything applies without modification to `-diversity [MKGV07], t-closeness [LLV07], and most other QI-based techniques (see Appendix B). Motivation Motivating our new attacks requires understanding the status quo of the debate on QI-based deidentification. At the heart of the debate is an unspoken, unexamined assumption that is fundamental to the whole QI-based program: If every attribute is quasi-identifying, then k-anonymity provides meaningful protection. Treating every attribute as quasi-identifying defines away one major critique|namely, that the ex ante categorization of attributes as quasi-identifying or not is untenable and reckless. Moreover, when all attributes are quasi-identifying, the dis- tinctions among k-anonymity's refinements|and the attacks that motivated them|collapse (see Appendix B). Defenders of QI-based deidentification unfairly dismiss attacks by academic researchers. This move is unconvincing to academic privacy researchers but very effective in policy spheres. First, they dismiss many attacked datasets as \improperly de-identified” [CEE14]. \Proper de-identification” must be done by a \statistical expert" and in accordance with procedures outlined in regulation [EEJAM11], the increasing availability of easy-to-use software syntactic de-identification notwith- standing. Second, they brazenly dismiss attacks carried out by privacy researchers because they are privacy researchers. That these attacks are published in \research based articles within the highly specialized field of computer science" is used to argue that re-identification requires a \highly skilled `expert' " and therefore is of little concern [otICCC14]. Summarizing, justifications for the use of QI-based deidentification techniques rest on two ar- guments that have gone hitherto unchallenged. First, k-anonymity provides meaningful protection when every attribute is a quasi-identifier. Second, no attacks have been shown against datasets de-identified by experts and in accordance with strict privacy regulations, let alone simple attacks. 1https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html 2 Significance This paper refutes both of the above arguments. Our attacks also demonstrate that QI-based techniques fail to satisfy three properties of a worthwhile measure of privacy of a computation (see Appendix A): • Post-processing: Further processing of the output without access to the data should not diminish privacy. • Composition: A combination of two or more private mechanisms should also be private (with worse parameters). • No distributional assumptions: A general-purpose privacy guarantee should not rely on the data being nicely distributed. Finally, our attacks demonstrate that QI-based techniques do not achieve regulatory goals. The EdX data release enables disclosure of data that the general counsel of Harvard considered FERPA- protected. Minimal hierachical mechanisms likely fail to anonymize under GDPR, by way of the connection to predicate singling-out attacks [CN19, ACNW20]. Contributions We present a reidentification attack on the EdX dataset using LinkedIn, and analyze other avenues of disclosure for that dataset, finding that tens of thousands of students are potentially vulnerable. We define downcoding attacks and prove that minimal hierarchical k-anonymity enables such attacks. This is a new type of minimality-based attack that defeats previous defenses. We define compound predicate singling-out attacks, strengthening predicate singling-out attacks which were introduced as a necessary condition for data anonymization under GDPR [CN19]. We prove that minimal hierarchical k-anonymity enables such attacks. These are the first attacks that work even when every attribute is a quasi-identifier. As such, they apply to QI-based techniques beyond k-anonymity, and refute a foundational assumption of QI-based deidentification. ZIP Income COVID ZIP Income COVID ZIP Income COVID 91010 $125k Yes 9101? $75{150k ? 91010 $125{150k ? 91011 $105k No 9101? $75{150k ? 9101? $100{125k ? X = 91012 $80k No Y = 9101? $75{150k ? Z = 9101? $75{150k ? 20037 $50k No 20037 $0{75k ? 20037 $0{75k No 20037 $20k No 20037 $0{75k ? 20037 $0{75k ? 20037 $25k Yes 20037 $0{75k ? 20037 $25k Yes Figure 1: An example of downcoding. Y is a minimal hierarchical 3-anonymized version of X (treating every attribute as part of the quasi-identifier and leaving the generalization hierarchy implicit). Z is a downcoding of Y: it refines X and strictly refines Y. Organization The remainder of Section 1 gives an introduction to downcoding attacks. Section 2 introduces notation. Section 3 defines k-anonymity and related concepts, along with hierarchical and minimal k-anonymity. Section 4 describes the EdX dataset and shows that it is vulnerable 3 to reidentification. Section 5 defines downcoding attacks and proves that minimal hierarchical k- anonymous mechanisms enable them. Section 6 defines compound predicate singling-out attacks and proves that minimal hierarchical k-anonymous mechanisms enable them. The Appendix adds additional discussion. Responsible disclosure After reidentifying one EdX student, we reported the vulnerability to Harvard and MIT who promptly replaced the dataset with a heavily redacted one. We were advised by our IRB