Data De-Identification, Pseudonymization, and Anonymization: Legal and Policy Considerations
Total Page:16
File Type:pdf, Size:1020Kb
Data De-identification, Pseudonymization, and Anonymization: Legal and Policy Considerations Daniel Barth-Jones Assistant Professor of Clinical Epidemiology Mailman School of Public Health, Columbia University Kate Godfrey Vice President, Government Affairs, Chief Compliance & Privacy Officer Adaptive Biotechnologies Nancy L. Perkins Counsel Arnold & Porter May 26, 2021 De-identification Concepts and Methodologies 2 HIPAA’s Identification Spectrum No De- “Breach LDS Fully Information Identified Safe” Identified (Totally Safe, PHI But Useless) Research, Treatment, Useful for Permitted Uses: Any Public Health, Payment, Purpose Breach Healthcare Operations • Protected Health Information (PHI) Avoidance Operations • Limited Data Set (LDS) §164.514(e) • Eliminate 16 Direct Identifiers (Name, Address, SSN, etc.) • LDS w/o 5-digit Zip & Date of Birth (LDS-“Breach Safe”) 8/24/09 FedReg • Eliminate 16 Direct Identifiers and Zip5, DoB • Safe Harbor De-identified Data Set (SHDDS) §164.514(b)(2) • Eliminate 18 Identifiers (including Geo < 3 digit Zip, All Dates except Yr) • Expert Determination Data Set (EDDS) §164.514(b)(1) • Verified “very small” Risk of Re-identification 3 GDPR’s Identification Spectrum Article 11 Article 3 Article 4 (1) Recital 26 Controller can’t Pseudonymized Personal Data Anonymized Identify Escapes “No Longer Articles Identifiable” 15 – 20 Can’t be Identified Escapes GDPR Similar to without Additional HIPAA EDDS? Information HIPAA’s Identification Spectrum No Information “Breach De- Identified LDS Fully Safe” Identified (Totally Safe, But Useless) Research, Treatment, Any Useful for Public Health, Payment, Permitted Uses: Purpose Breach Healthcare Operations Avoidance Operations 4 <-or in other words, “Science” 5 U.S. State Specific Re-identification Risks: Population Uniqueness Uniqueness Population Risks: Re-identification Specific State U.S. Log Scale * * 1/10-> 1/Million 4/10,000 Risk 0.0000001 0.000001 0.00001 0.0001 1E-09 1E-08 0.001 0.01 0.1 1 CA Graph © DB-J 2013 2013 DB-J Graph © Data Source: 2010 U.S. CensusDecennial U.S. 2010 Source: Data TX NY FL IL *HIPAA Safe Harbor Risk EstimateRiskHarbor Safe *HIPAA PA OH MI GA NC NJ VA WA MA IN AZ † † or GeographicorthanUnits smaller 3-digit Zip Codes (Z3). HIPAA Safe Harbor does not permit any Dates more specificthethan doesSafeyear,Harbormorenot Datesanypermit HIPAA TN MO MD WI MN CO AL SC LA KY OR OK PR CT IA MS AR KS UT NV NM WV NE ID HI ME NH RI MT DE SD AK ND VT DC WY Population Sizes ( Statesbyordered as a Quasi-Identifieras alsoGender included OtherAsian, White,Hispanic,Black, Coding:Race 3-digit = ZipCode Z3 5-digit = ZipCode Z5 ofYearBirth = YoB BirthYr Mnth & = MoB DateofBirth = DoB Safe Harbor Safe Not Safe Harbor Safe Not Quasi-Identifier Compliant Combined YoB,Z3,Race YoB,Z3 MoB,Z3 DoB,Z3 YoB,Z5 MoB,Z5 DoB,Z5 Legend ) † 6 The Inconvenient Truth: “De-identification leads to information loss which Complete may limit the usefulness of the resulting health Protection information” (p.8, HHS De-ID Guidance Nov 26, 2012) Bad Decisions / Bad Science Ideal Situation (Perfect Information & Trade-Off between Perfect Information Protection) Quality and Unfortunately, Privacy Protection Log Scale Log not achievable due to Poor mathematical Disclosure Protection Disclosure Privacy constraints No Protection Protection No Information Optimal Precision, Information Lack of Bias 7 8 Unfortunately, de- identification public policy has often been driven by largely anecdotal and limited evidence, and re- identification demonstration attacks targeted to particularly vulnerable individuals, which fail to provide reliable evidence about real world re- identification risks 9 Misconceptions about HIPAA De-identified Data: “It doesn’t work…” “easy, cheap, powerful re-identification” (Ohm, 2009 “Broken Promises of Privacy”) *Pre-HIPAA Re-identification Risks {Zip5, Birth date, Gender} able to identify 87%?, 63%, 28%? of US Population (Sweeney, 2000, Golle, 2006, Sweeney, 2013) • Reality: HIPAA compliant de-identification provides important privacy protections • Safe harbor re-identification risks have been estimated at 0.04% (4 in 10,000) (Sweeney, NCVHS Testimony, 2007) • Reality: Under HIPAA de-identification requirements, re-identification is expensive and time-consuming to conduct, requires substantive computer/mathematical skills, is rarely successful, and usually uncertain as to whether it has actually succeeded 10 Misconceptions about HIPAA De-identified Data: “It works perfectly and permanently…” • Reality: • Perfect de-identification is not possible. • De-identifying does not free data from all possible subsequent privacy concerns. • Data is never permanently “de-identified”… • There is no 100% guarantee that de-identified data will remain de-identified regardless of what you do with it after it is de- identified. 11 Re-identification Demonstration Attack Summary • Publicized attacks are on data without HIPAA/SDL de-identification protection. • Many attacks targeted especially vulnerable subgroups and did not use sampling to assure representative results. • Press reporting often portrays re-identification as broadly achievable, when there isn’t any reliable evidence supporting this portrayal. 12 Re-identification Demonstration Attack Summary • For Ohm’s famous “Broken Promises” attacks (Weld, AOL, Netflix) a total of n=4 people were re-identified out of 1.25 million. • For attacks against HIPAA de-identified data (ONC, Heritage*), a total of n=2 people were re-identified out of 128 thousand. • ONC Attack Quasi-identifiers: Zip3, YoB, Gender, Marital Status, Hispanic Ethnicity • Heritage Attack Quasi-identifiers*: Age, Sex, Days in Hospital, Physician Specialty, Place of Service, CPT Procedure Codes, Days Since First Claim, ICD-9 Diagnoses (*not complete list of data available for adversary attack) • Both were “adversarial” attacks. • For all attacks listed, a total of n=268 were re-identified out of 327 million opportunities. Let’s get some perspective on this… 13 Obviously, This slide is BLACK . So clearly, De-identification Doesn’t Work. 14 Precautionary Principle or Paralyzing Principle? “When a re-identification attack has been brought to life, our assessment of the probability of it actually being implemented in the real-world may subconsciously become 100%, which is highly distortive of the true risk/benefit calculus that we face.” – DB-J 15 Re-identification Demonstration Attack Summary • What can we conclude from the empirical evidence provided by these 11 highly influential re-identification attacks? • The proportion of demonstrated re-identifications is extremely small. • Which does not imply data re-identification risks are necessarily very small (especially if the data has not been subject to Statistical Disclosure Limitation methods). • But with only 268 re-identifications made out of 327 million opportunities, Ohm’s “Broken Promises” assertion that “scientists have demonstrated they can often re-identify with astonishing ease” seems rather dubious. • It also seems clear that the state of “re-identification science”, and the “evidence”, it has provided needs to be dramatically improved in order to better support good public policy regarding data de-identification. 16 17 *Pre-HIPAA Risk Baseline *HIPAA Safe Harbor Risk Baseline 18 Re-identification Risks: Population Uniqueness Starting Assumptions? 100% State-Specific Box/Whiskers 1 Combined 10% Quasi-Identifier Legend 0.1 DoB = Date of Birth MoB = Birth Mnth & Yr 1% YoB = Year of Birth 0.01 Z5 = 5-digit Zip Code Z3 = 3-digit Zip Code 0.1% Race Coding: 0.001 *HIPAA Safe Harbor Risk Estimate White, Black, Hispanic, * 4/10,000 Asian, Other 0.01% Gender also included 0.0001 as a Quasi-Identifier Log Log Scale 0.001% Not Safe Harbor 0.00001 Compliant † 0.0001% 0.000001 0.00001% 0.0000001 0.000001 % 1E-08 Safe Harbor 0.0000001% 1E-09 Graph © DB-J 2013 † HIPAA Safe Harbor does not permit any Dates more specific than the year, Data Source: 2010 U.S. Decennial Census or Geographic Units smaller than 3-digit Zip Codes (Z3). 19 Anonymization Under the GDPR 20 How Do We Achieve Anonymization? Under GDPR, re-identification of a data subject (even by the company that anonymized the data) must be impossible. That doesn’t sound that hard…what’s all the fuss about? The challenge arises when considering whether the exposed data (what remains after anonymization) can be paired with other available information to construe the identity of data subjects Recital 26 states: o “To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly. To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments.” The plot thickens the EPDB published guidance stating that the test for whether a person is “identifiable” does not necessarily mean that an actual subject is identified—just that it would be possible to identify the subject 21 Is Anonymization Possible? Experts Weigh In… IAPP stated true anonymization may be near impossible, especially considering the study highlighted by the NYT that showed it’s possible to personally identify 87 percent of the U.S. population based on just three data points: five-digit ZIP code, gender, and date-of-birth o And while these data points in a vacuum may be non-identifiable, storing them collectively is what makes it possible to identify an individual The UK Information Commissioner’s Office provided a warning to companies: o “In order to be truly anonymised under the GDPR, you must strip personal data of sufficient elements that mean the individual can no longer be identified. However, if you could at any point use any reasonably available means to re-identify the individuals to which the data refers, that data will not have been effectively anonymised but will have merely been pseudonymised.