Data De-identification, Pseudonymization, and Anonymization: Legal and Policy Considerations

Daniel Barth-Jones Assistant Professor of Clinical Epidemiology Mailman School of Public Health, Columbia University

Kate Godfrey Vice President, Government Affairs, Chief Compliance & Officer Adaptive Biotechnologies

Nancy L. Perkins Counsel Arnold & Porter May 26, 2021 De-identification Concepts and Methodologies

2 HIPAA’s Identification Spectrum

No De- “Breach LDS Fully Information Identified Safe” Identified (Totally Safe, PHI But Useless)

Research, Treatment, Useful for Permitted Uses:  Any Public Health, Payment, Purpose Breach Healthcare Operations • Protected Health Information (PHI) Avoidance Operations • Limited Data Set (LDS) §164.514(e) • Eliminate 16 Direct Identifiers (Name, Address, SSN, etc.) • LDS w/o 5-digit Zip & Date of Birth (LDS-“Breach Safe”) 8/24/09 FedReg • Eliminate 16 Direct Identifiers and Zip5, DoB • Safe Harbor De-identified Data Set (SHDDS) §164.514(b)(2) • Eliminate 18 Identifiers (including Geo < 3 digit Zip, All Dates except Yr) • Expert Determination Data Set (EDDS) §164.514(b)(1) • Verified “very small” Risk of Re-identification

3 GDPR’s Identification Spectrum Article 11 Article 3 Article 4 (1) Recital 26 Controller can’t Pseudonymized Anonymized Identify

Escapes “No Longer Articles Identifiable” 15 – 20 Can’t be Identified Escapes GDPR Similar to without Additional HIPAA EDDS? Information HIPAA’s Identification Spectrum No Information “Breach De- Identified LDS Fully Safe” Identified

(Totally Safe, But Useless) Research, Treatment, Any Useful for Public Health, Payment, Permitted Uses: Purpose Breach Healthcare Operations Avoidance Operations 4 <-or in other words, “Science”

5 U.S. State Specific Re-identification Risks: Population Uniqueness (States ordered by Population Sizes) CA TX NY FL IL PA OH MI GA NC NJ VA WA MA IN AZ TN MO MD WI MN CO AL SC LA KY OR OK PR CT IA MS AR KS UT NV NM WV NE ID HI ME NH RI MT DE SD AK ND VT DC WY 1 Combined Risk Quasi-Identifier Legend 1/10-> 0.1 DoB = Date of Birth MoB = Birth Mnth & Yr YoB = Year of Birth

0.01 Z5 = 5-digit Zip Code Z3 = 3-digit Zip Code Race Coding: White, Black, Hispanic, 0.001 Asian, Other * 4/10,000 Gender also included *HIPAA Safe Harbor Risk Estimate as a Quasi-Identifier

0.0001

Not Safe Harbor Compliant

0.00001 DoB,Z5 † Log Scale Log MoB,Z5 0.000001

1/Million YoB,Z5

0.0000001 DoB,Z3

MoB,Z3

1E-08 YoB,Z3 Safe Harbor

Data Source: 2010 U.S. Decennial Census YoB,Z3,Race 1E-09 Graph © DB-J 2013 † HIPAA Safe Harbor does not permit any Dates more specific than the year, or Geographic Units smaller than 3-digit Zip Codes (Z3). 6 The Inconvenient Truth: “De-identification leads to information loss which Complete may limit the usefulness of the resulting health Protection information” (p.8, HHS De-ID Guidance Nov 26, 2012) Bad Decisions / Bad Science Ideal Situation (Perfect Information & Trade-Off between Perfect Information Protection) Quality and Unfortunately, Privacy Protection

Log Scale Log not achievable due to Poor mathematical Disclosure Protection Disclosure Privacy constraints No Protection Protection No Information Optimal Precision, Information Lack of Bias 7 8 Unfortunately, de- identification public policy has often been driven by largely anecdotal and limited evidence, and re- identification demonstration attacks targeted to particularly vulnerable individuals, which fail to provide reliable evidence about real world re- identification risks

9 Misconceptions about HIPAA De-identified Data: “It doesn’t work…” “easy, cheap, powerful re-identification” (Ohm, 2009 “Broken Promises of Privacy”) *Pre-HIPAA Re-identification Risks {Zip5, Birth date, Gender} able to identify 87%?, 63%, 28%? of US Population (Sweeney, 2000, Golle, 2006, Sweeney, 2013)

• Reality: HIPAA compliant de-identification provides important privacy protections • Safe harbor re-identification risks have been estimated at 0.04% (4 in 10,000) (Sweeney, NCVHS Testimony, 2007) • Reality: Under HIPAA de-identification requirements, re-identification is expensive and time-consuming to conduct, requires substantive computer/mathematical skills, is rarely successful, and usually uncertain as to whether it has actually succeeded

10 Misconceptions about HIPAA De-identified Data: “It works perfectly and permanently…” • Reality: • Perfect de-identification is not possible. • De-identifying does not free data from all possible subsequent privacy concerns. • Data is never permanently “de-identified”… • There is no 100% guarantee that de-identified data will remain de-identified regardless of what you do with it after it is de- identified.

11 Re-identification Demonstration Attack Summary

• Publicized attacks are on data without HIPAA/SDL de-identification protection. • Many attacks targeted especially vulnerable subgroups and did not use sampling to assure representative results. • Press reporting often portrays re-identification as broadly achievable, when there isn’t any reliable evidence supporting this portrayal.

12 Re-identification Demonstration Attack Summary

• For Ohm’s famous “Broken Promises” attacks (Weld, AOL, Netflix) a total of n=4 people were re-identified out of 1.25 million. • For attacks against HIPAA de-identified data (ONC, Heritage*), a total of n=2 people were re-identified out of 128 thousand. • ONC Attack Quasi-identifiers: Zip3, YoB, Gender, Marital Status, Hispanic Ethnicity • Heritage Attack Quasi-identifiers*: Age, Sex, Days in Hospital, Physician Specialty, Place of Service, CPT Procedure Codes, Days Since First Claim, ICD-9 Diagnoses (*not complete list of data available for adversary attack) • Both were “adversarial” attacks. • For all attacks listed, a total of n=268 were re-identified out of 327 million opportunities. Let’s get some perspective on this…

13 Obviously, This slide is BLACK

.

So clearly, De-identification Doesn’t Work.

14 Precautionary Principle or Paralyzing Principle?

“When a re-identification attack has been brought to life, our assessment of the probability of it actually being implemented in the real-world may subconsciously become 100%, which is highly distortive of the true risk/benefit calculus that we face.” – DB-J

15 Re-identification Demonstration Attack Summary • What can we conclude from the empirical evidence provided by these 11 highly influential re-identification attacks? • The proportion of demonstrated re-identifications is extremely small. • Which does not imply data re-identification risks are necessarily very small (especially if the data has not been subject to Statistical Disclosure Limitation methods). • But with only 268 re-identifications made out of 327 million opportunities, Ohm’s “Broken Promises” assertion that “scientists have demonstrated they can often re-identify with astonishing ease” seems rather dubious. • It also seems clear that the state of “re-identification science”, and the “evidence”, it has provided needs to be dramatically improved in order to better support good public policy regarding data de-identification.

16 17 *Pre-HIPAA Risk Baseline

*HIPAA Safe Harbor Risk Baseline

18 Re-identification Risks: Population Uniqueness Starting Assumptions?

100% State-Specific Box/Whiskers 1 Combined 10% Quasi-Identifier Legend 0.1 DoB = Date of Birth MoB = Birth Mnth & Yr 1% YoB = Year of Birth 0.01 Z5 = 5-digit Zip Code Z3 = 3-digit Zip Code 0.1% Race Coding: 0.001 *HIPAA Safe Harbor Risk Estimate White, Black, Hispanic, * 4/10,000 Asian, Other 0.01% Gender also included 0.0001 as a Quasi-Identifier

Log Log Scale 0.001% Not Safe Harbor 0.00001 Compliant † 0.0001% 0.000001

0.00001% 0.0000001

0.000001 % 1E-08

Safe Harbor 0.0000001% 1E-09 Graph © DB-J 2013 † HIPAA Safe Harbor does not permit any Dates more specific than the year, Data Source: 2010 U.S. Decennial Census or Geographic Units smaller than 3-digit Zip Codes (Z3).

19 Anonymization Under the GDPR

20 How Do We Achieve Anonymization?

 Under GDPR, re-identification of a data subject (even by the company that anonymized the data) must be impossible. That doesn’t sound that hard…what’s all the fuss about?  The challenge arises when considering whether the exposed data (what remains after anonymization) can be paired with other available information to construe the identity of data subjects  Recital 26 states: o “To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly. To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments.”  The plot thickens the EPDB published guidance stating that the test for whether a person is “identifiable” does not necessarily mean that an actual subject is identified—just that it would be possible to identify the subject

21 Is Anonymization Possible? Experts Weigh In…

 IAPP stated true anonymization may be near impossible, especially considering the study highlighted by the NYT that showed it’s possible to personally identify 87 percent of the U.S. population based on just three data points: five-digit ZIP code, gender, and date-of-birth o And while these data points in a vacuum may be non-identifiable, storing them collectively is what makes it possible to identify an individual  The UK Information Commissioner’s Office provided a warning to companies: o “In order to be truly anonymised under the GDPR, you must strip personal data of sufficient elements that mean the individual can no longer be identified. However, if you could at any point use any reasonably available means to re-identify the individuals to which the data refers, that data will not have been effectively anonymised but will have merely been pseudonymised. This means that despite your attempt at anonymisation you will continue to be processing personal data. You should also note that when you do anonymise personal data, you are still processing the data at that point.”  Ireland’s Data Protection Commission (yes…the same DPC we all know from Schrems) stated: o “There is a lot of research currently underway in the area of anonymisation, and knowledge about the effectiveness of various anonymisation techniques is constantly changing. It is therefore impossible to say that a particular technique will be 100% effective in protecting the identity of data subjects.”

22 What Makes Anonymization So Hard?

The many different options for re-identification! 1. Singling Out: occurs where it is possible to distinguish the data relating to one individual from all other information in a dataset a) This may be because information relating to one individual has a unique value; such in a data set which records the height of individuals (e.g., where only one person is 190cm tall, that individual is singled out) b) It might also occur if different data related to the same individuals is connected in the data set and one individual has a unique combination of values (e.g., there might be only one individual in a dataset who is 160cm tall and was born in 1990, even though there are many others who share either the height or year of birth) 2. Data Linking: any linking of identifiers in a data set will make it more likely that an individual is identifiable a) The more identifiers that are linked together in a data set, the more likely it is that the person to whom they relate will be identified or identifiable (e.g., taken individually the first and second name “John” and “Smith” might not be capable of distinguishing one of a large company’s customers from all other customers, but if the two pieces of information are linked, it is far more likely that “John Smith” will refer to a unique, identifiable individual) 3. Inference: it may be possible to infer a link between two pieces of information in a set of data, even though the information is not expressly linked a) Although the data may not point directly to certain pieces of information in a dataset, an inference might be drawn between the two pieces of information, allowing some individuals to be identified (e.g., if a dataset contains statistics regarding the seniority and pay of the employees of a company, even though the data doesn’t point directly to the salaries of individuals, an inference may be drawn allowing some individuals to be identified) 23 Different Techniques for Anonymization

1. Randomization: involve the alteration of the data, in order to cut the link between the individual and the data, without losing the value in the data a) Randomization may include the addition of “noise”, or random small changes, into data, to limit the ability of an intruder to connect the data to an individual (e.g., in a database which records the height of individuals, small increases or decreases could be made to the height of each data subject, and the data can be stated to be accurate only within the range of the additions and subtractions) b) “Permutation” is another type of randomization technique. This involves swapping certain data between the records of individuals, making it more difficult to identify individuals by linking together different information relating to them (e.g., in the case of the height of individuals, instead of adding random noise to the data, the height values for different individuals is moved around, so that is no longer connected to other information about that individual) 2. Generalization: involves reducing the granularity of data, so that only less precise data is disclosed a) This means that it will be less likely that individuals can be singled out, as more people are likely to share the same values (e.g., a data base containing the age of data subjects might be adjusted so that it is only recorded what band of ages an individual falls within (e.g., 18-25; 25-35; 35-45; etc.)) 3. Masking: involves removing obvious or direct personal identifiers from data a) Masking alone often allows a very high risk of identification, and so will not normally be considered anonymisation in itself. This is because such a technique would allow all of the original unmasked data to be seen, making it at risk of data matching techniques being used to reveal the identity of data subjects.

24 Genetic Data…Is it Possible to Safeguard?

 The Agencia Española Protección Datos (Spanish EDPB) recommends the hash technique for anonymization of data processing activities of genetic data o A hash function is a process which transforms any random dataset in a fixed length character series, regardless of the size of input data o Randomization is often combined with the hash function process to reduce the likelihood of re-identification

 Art. 40 Codes of Conduct Approach o The BBMRI‐ERIC GDPR Code of Conduct for health research (http://code-of-conduct-for-health-research.eu) o Guidelines being developed by Alliance Against Cancer (https://www.alleanzacontroilcancro.it/en/commissione-acc-gdpr/) for personal data processing in the context of medical and research activities of Italian research hospitals (IRCCS) o Guidelines for epigenetic data in the case of rare diseases by an international working group addressing ethical issues relating to human Locus Specific Variation Database (LSDB)

 What do YOU think? What do you do at your company?

25 US State Law Developments and Policy Questions

26 What Does US State Law Say About De-identification?

 How is “personal information” or “personal data” defined?

 What is considered “de-identified” or “anonymized” information?

 What treatment is required for de-identified information?

 What is desirable, necessary, helpful?

27 California Consumer Privacy Act (CCPA)

“Personal Information” • “Information that identifies, relates to, describes, is reasonably capable of being associated with, or could reasonably be linked, directly or indirectly, with a particular consumer or household,” and • inferences drawn from any such identifiable information … to create a profile about a consumer Examples (not exclusive): Identifiers: names, addresses, email addresses, IP addresses, geolocation data, characteristics of protected classes (race, religion, etc.), Internet or other electronic network activity, such as browsing history

• BUT: These identifiers are “personal information” ONLY if they “are reasonably capable of being associated with, or could reasonably be linked” to, a particular consumer or household. Does it matter who could reasonably make that linkage? California Privacy Rights Act of 2020 (CPRA)

“Personal Information” Same definition as in CCPA

“Household” “A group, however identified, of consumers who cohabitate with one another at the same residential address and share use of common device(s) or service(s).”

“Device” “Any physical object that is capable of connecting to the Internet, directly or indirectly, or to another device.”

• So, for a “household,” is knowing that a device is shared by persons with a CA address enough to make all actions of that device “personal information”? Virginia Consumer Data Protection Act (VA CDPA)

“Personal Data” “Information that is linked or reasonably linkable to an identified or identifiable natural person.”

“Identified or identifiable natural person” “A person who can be readily identified, directly or indirectly.”

• Is a person “identified” simply by being distinguishable from other individuals?

• How little effort is involved in being able to “readily” identify an individual?

• What if some effort would be required and no effort is actually made? Proposed New York Data Accountability and Transparency Act (NY DATA)

“‘Personal Information’ shall mean data relating to an identified or identifiable natural person provided further that: personal information shall include but is not limited to” (among other things): • Name, date of birth, postal address, Social Security number • Marital status, sexual orientation, physical characteristic or description • Employment, leases, assets, income • Political information • Web browsing history, interaction with an advertisement • Historical or real-time geolocation data • An inference drawn from other information to create a profile about an individual

• So is marital status, or income, or being tall, or being a Democrat, itself “personal information”? De-identified and Anonymous Information

 Does a reference to “de-identified” information mean that the information was derived from identifiable information?

 Is “anonymized” information a broader or narrower category of data?

 Are there beneficial policy reasons to discourage linking anonymous information to an individual and thereby creating personal information?

 Is the answer different if the information was “de-identified” by stripping identifiable information of personal identifiers?

 Are there beneficial policy reasons not to discourage such linkage and instead to impose strict limits on use and disclosure of the resulting personal information?

32 CCPA: Deidentified Information

• “Deidentified” Information: Information that cannot reasonably identify, relate to, describe, be capable of being associated with, or be linked, directly or indirectly, to a particular consumer, provided that a business that uses the information: • Has technical safeguards that prohibit reidentification of the consumer to whom the information may pertain; • Has business processes that specifically prohibit reidentification of the information; • Has implemented business processes to prevent inadvertent release of deidentified information; and • Makes no attempt to reidentify the information • Is information “capable” of being linked to a consumer if someone could do that? • Do the references to “reidentification” tell us that “deidentified” information is exclusively information derived from personal information? CPRA: Deidentified Information

• “Deidentified” Information: Information that cannot reasonably be used to infer information about, or otherwise be linked to, a particular consumer, provided that a business that uses the information: • Takes reasonable measures to ensure that the information cannot be associated with a consumer or household; • Publicly commits to maintain and use the information in deidentified form and not to attempt to reidentify the information, except that the business may attempt to reidentify the information solely for the purpose of determining whether its de-identification processes satisfy the requirements of [the CPRA]; and • Contractually obligates any recipients of the information to comply with [the CPRA]. • Does this mean that if a real estate company, for example, has information about the details of vacant apartments in a new development, it must take all these steps with respect to that information? VA CDPA: De-identified Data

Much like the CPRA: • “‘De-identified data’: Data Information that cannot reasonably be linked to an identified or identifiable natural person, or a device linked to such person.” • A controller that processes “de-identified data” must: • Take reasonable measures to ensure that the data cannot be associated with a natural person; • Publicly commit to maintaining and using de-identified data without attempting to re-identify the data; and • Contractually obligate any recipients of the de-identified data to comply with all provisions of [the VA CDPA]. NY DATA Bill: De-identified Data

• Data that cannot be linked to a known natural person without additional information not available to the covered entity (i.e., pseudonymized data); or • Data that: • Has been modified to a degree that the risk of re-identification is small as determined by a person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for de-identifying data; • Is subject to a public commitment by the controller not to attempt to re-identify the data; and • To which one or more enforceable controls to prevent re-identification has been applied (whether legal, administrative, technical or contractual). • So here, not all de-identified data is subject to “re-identification” controls Consumer Requests: CCPA & CPRA

• CCPA . . . shall not be construed to require a business to: • . . . reidentify or otherwise link information that, in the ordinary course of business, is not maintained in a manner that would be considered personal information. • CPRA . . . shall not be construed to require a business, service provider, or contractor to: • Reidentify or otherwise link information that, in the ordinary course of business, is not maintained in a manner that would be considered personal information; . . . • Maintain information in identifiable, linkable or associable form, or collect, obtain, retain, or access any data or technology, in order to be capable of linking or associating a verifiable consumer request with personal information. Consumer Requests: VA CDPA

• Nothing in [the VA CDPA] shall be construed to require a controller or processor to: A. (i) re-identify de-identified data or pseudonymous data or (ii) maintain data in identifiable form, or collect, obtain, retain, or access any data or technology, in order to the be capable of associating an authenticated consumer request with personal data; or

B. Comply with an anticipated consumer rights request, if the controller:

• is not reasonably capable of associating, or it would be unreasonably burdensome for the controller to associate, the request with the personal data;

• does not use the personal data to recognize or respond to the specific consumer who is the subject of the personal data, or associate the personal data with other personal data about the same specific consumer; and

• does not sell the personal data to any third party or otherwise voluntarily disclose the personal data to any third party other than a processor, except as otherwise permitted [under the VA CDPA]. • Under what circumstances would a controller maintain “personal data” but not reasonably capable of associating a verifiable consumer request with the personal data? Speaker Contact Information

Daniel Barth-Jones Nancy L. Perkins Columbia University Arnold & Porter [email protected] [email protected]

Kate Godfrey Adaptive Biotechnologies [email protected]

39