Synthetic Data Paradigm for Using and Sharing Data by Khaled El Emam and Richard Hoptroff
Total Page:16
File Type:pdf, Size:1020Kb
The Synthetic Data Paradigm for Using and Sharing Data by Khaled El Emam and Richard Hoptroff Synthetic data provides a privacy protective mechanism to broadly use and share data for secondary purposes. Using and sharing data for secondary purposes can facilitate innovative big data initiatives and partnerships to develop novel analytics solutions. This Executive Update provides an overview of the use cases for synthetic data, how to generate synthetic data, and some legal considerations associated with synthetic data’s use. The Challenge An ongoing challenge with big data and other secondary analytics initiatives is getting access to data. Secondary analytics are typically new and were unanticipated when the data was originally collected. These can be novel analyses to understand customer behavior, develop new products, and generate new revenue. The analysis can be performed internally within the organization that collected the data (use) or can be shared with external partners (e.g., academics, startups, or specialized consultancies). Contemporary legal regimes allow the processing of personal client/patient information upon obtaining some form of consent or authorization for that purpose. In some jurisdictions the consent can be broad, allowing multiple uses of the data, even if these were unanticipated. But in other jurisdictions, such as in Europe under the General Data Protection Regulation (GDPR), the consent must be very specific. This means that any new uses of the data cannot be performed without reconsent (unless there is another legal basis). Another legal basis to process data for new, unanticipated purposes is to de-identify the data. De-identification renders the data nonpersonal and therefore effectively removes many constraints against performing novel analytics on it. De-identified data makes it difficult (i.e., a low probability) to assign The Executive Update is a publication of Cutter Consortium’s Data Analytics & Digital Technologies practice. ©2019 by Cutter Consortium. All rights reserved. Unauthorized reproduction in any form, including photocopying, downloading electronic copies, posting on the Internet, image scanning, and faxing, is against the law. Reprints make an excellent training tool. For information about reprints and/or back issues of Cutter Consortium publications, call +1 781 648 8700 or email [email protected]. ISSN: 2381-8816. DATA ANALYTICS & DIGITAL TECHNOLOGIES EXECUTIVE UPDATE | Vol. 19, No. 6 identity to records. Many methods for de-identification have been developed by the statistical disclosure control community over the last few decades. In practice, we can turn to a risk-based approach along with recommendations from standards and regulatory bodies. With a risk-based approach to de-identification, the data is transformed, and administrative and technical controls are put in place. The sidebar “Transformations to De-Identify Data” provides some typical ways that data can be transformed during de-identification. In addition to these transformations, there are also controls, which can mean, for example, appropriate access controls, encryption of devices with data, and data user agreements in which users commit to not re-identifying the data. This approach has worked well for many years. Transformations to De-Identify Data De-identification is used as a mechanism to ensure that the risk of re-identification of individuals in data is very small. This can entail adding technical and administrative controls, as well as transform- ing the data. The following are typical data transformations that can be performed on structured data and documents during the de-identification process: • Suppression — replacing a variable with some nonmeaningful text (e.g., “***” or a null value) • Generalization — replacing the value of a variable with a coarser representation (e.g., replacing a date of birth with an age or an age range) • Generalization and replacement — an extension of generalization by replacing the value with another value selected randomly from a generalization range (e.g., when generalizing a date of birth to a five-year interval, an age or even a date of birth selected with equal probability from within that range would be selected to replace the original value) • Date offsetting — applied only to dates to perturb their values by a fixed or random amount • Noise injection — sampling and using random values from a predefined distribution to perturb the values (e.g., adding Gaussian noise to an income value) However, specific uses cases exist where something beyond de-identification will be needed. These are situations in which data needs to be shared very broadly, where the overhead of signing agreements is high, as well as in situations where the costs of implementing the necessary controls are high given the exploratory nature of the planned analysis and the need for rapid generation of nonpersonal data. In these situations, synthetic data can provide a solution. The sidebar “Use Cases for Synthetic Data” offers examples of where synthetic data can provide a good solution. ©2019 Cutter Consortium | www.cutter.com Page | 2 DATA ANALYTICS & DIGITAL TECHNOLOGIES EXECUTIVE UPDATE | Vol. 19, No. 6 Use Cases for Synthetic Data There are specific use cases for which synthetic data provides an ideal solution, including: • Hackathons and data competitions/challenges. These require data sets that can be distributed widely with minimal demands on the entrants. • Proof-of-concept and technology evaluations. Oftentimes technology developers or technol- ogy acquirers need to quickly evaluate whether a new technology works well in practice; they need realistic data with which to work, with minimal constraints. • Algorithm testing. One of the biggest challenges when developing artificial intelligence and machine learning algorithms is getting a sufficient number of data sets, that are both large enough and sufficiently realistic on which to test the algorithms. • Software testing. Testing data-driven applications requires realistic data for functional and performance testing. Random data cannot replicate what will happen when a system goes into production. • Open data. Sharing complex data sets publicly is challenging because of privacy concerns. This can now be achieved by sharing synthetic data instead. • Data exploration. Organizations that want to maximize the use of their data can make synthetic versions available for exploration and initial assessment by potential users; if the exploration yields positive results, users could then go through a process to obtain access to the de-identified data. • Algorithm development. Data analysis programs can be developed on synthetic data and then submitted to the data custodian for execution on the real data; this brings the verified code to the data rather than sharing the data itself. • Simple statistics. When the desired analytics require only a handful of variables, it is possible to use synthetic data as a proxy for real data and to produce more or less the same results. • Education and training. Synthetic data can be used for teaching practical courses on data analysis and for software training. What Is Synthetic Data? Synthetic data is fake data generated from real data. To generate a synthetic data set, we would take a real data set and model its distributions (e.g., shape and variance) and structure (e.g., correlations among the variables). We then use that model to generate or synthesize the observations that make up the synthetic data set. This model is sometimes referred to as the “synthesizer .” Because the values in the synthetic data ©2019 Cutter Consortium | www.cutter.com Page | 3 DATA ANALYTICS & DIGITAL TECHNOLOGIES EXECUTIVE UPDATE | Vol. 19, No. 6 are generated from the model (and typically randomly sampled), they should replicate the statistical properties of the original data sets. The model can be a regression model of some sort, a machine learning model, an artificial neural network, or a genetic optimization algorithm. Multiple techniques have been tried, and they vary in complexity and accuracy. The synthetic data can have the same number of observations as the original data set. The synthesizer model can also be used to generate a much larger population than the original data set (e.g., to create data for performance testing of a software application) or to create a much smaller data set than the original (e.g., to synthesize a patient population with a rare disease). If we model the structure of the original data very accurately, then we will end up effectively replicating the original data, which would have privacy implications. This is called “overfitting.” As an example, if the data has a single male born in 1959 with prostate cancer and the model replicates that specific person, then the model is too specific in this instance. Therefore, the model should capture only partial structure. There is a balancing act between how accurate the model needs to be and how close the synthetic data is to the original data. It is the classic tradeoff between data utility and data privacy. When done well, the synthetic data retains enough statistical properties of the original data and has a very low risk of identifying individuals in the data. This allows synthetic data to be shared more freely. Figure 1 illustrates these tradeoffs. Figure 1 — The tradeoffs in the process of generating synthetic data. ©2019 Cutter Consortium | www.cutter.com Page | 4 DATA ANALYTICS & DIGITAL TECHNOLOGIES EXECUTIVE UPDATE | Vol. 19, No. 6 Various national statistical agencies, such as the US Census Bureau, have created public data sets using synthetic data generation techniques. This approach has been gaining momentum over the last few years. Managing Privacy Risks When evaluating the privacy risks with de-identified data, there are two kinds of disclosures to manage: identity disclosure and attribute disclosure. Identity Disclosure Identity disclosure is when an adversary is able to correctly assign an identity to a record in a data set and, by doing so, can learn something new about that individual. If an adversary assigns an identity to a record but does not learn anything new, then, arguably, that is a meaningless identity disclosure.