<<

The Synthetic Paradigm for Using and Sharing Data by Khaled El Emam and Richard Hoptroff

Synthetic data provides a privacy protective mechanism to broadly use and share data for secondary purposes. Using and sharing data for secondary purposes can facilitate innovative big data initiatives and partnerships to develop novel analytics solutions. This Executive Update provides an overview of the use cases for , how to generate synthetic data, and some legal considerations associated with synthetic data’s use.

The Challenge

An ongoing challenge with big data and other secondary analytics initiatives is getting access to data. Secondary analytics are typically new and were unanticipated when the data was originally collected. These can be novel analyses to understand customer behavior, develop new products, and generate new revenue. The analysis can be performed internally within the organization that collected the data (use) or can be shared with external partners (e.g., academics, startups, or specialized consultancies).

Contemporary legal regimes allow the processing of personal client/patient information upon obtaining some form of consent or authorization for that purpose. In some jurisdictions the consent can be broad, allowing multiple uses of the data, even if these were unanticipated. But in other jurisdictions, such as in Europe under the General Data Protection Regulation (GDPR), the consent must be very specific. This that any new uses of the data cannot be performed without reconsent (unless there is another legal basis).

Another legal basis to process data for new, unanticipated purposes is to de-identify the data. De-identification renders the data nonpersonal and therefore effectively removes many constraints against performing novel analytics on it. De-identified data makes it difficult (i.e., a low probability) to assign

The Executive Update is a publication of Cutter Consortium’s Data Analytics & Digital Technologies practice. ©2019 by Cutter Consortium. All rights reserved. Unauthorized reproduction in any form, including photocopying, downloading electronic copies, posting on the Internet, image scanning, and faxing, is against the law. Reprints make an excellent training tool. For information about reprints and/or back issues of Cutter Consortium publications, call +1 781 648 8700 or email [email protected]. ISSN: 2381-8816. DATA ANALYTICS & DIGITAL TECHNOLOGIES EXECUTIVE UPDATE | Vol. 19, No. 6 identity to records. Many methods for de-identification have been developed by the statistical disclosure control community over the last few decades. In practice, we can turn to a risk-based approach along with recommendations from standards and regulatory bodies.

With a risk-based approach to de-identification, the data is transformed, and administrative and technical controls are put in place. The sidebar “Transformations to De-Identify Data” provides some typical ways that data can be transformed during de-identification. In addition to these transformations, there are also controls, which can , for example, appropriate access controls, encryption of devices with data, and data user agreements in which users commit to not re-identifying the data. This approach has worked well for many years.

Transformations to De-Identify Data

De-identification is used as a mechanism to ensure that the risk of re-identification of individuals in data is very small. This can entail adding technical and administrative controls, as well as transform- ing the data. The following are typical data transformations that can be performed on structured data and documents during the de-identification process:

• Suppression — replacing a variable with some nonmeaningful text (e.g., “***” or a null value)

• Generalization — replacing the value of a variable with a coarser representation (e.g., replacing a date of birth with an age or an age )

• Generalization and replacement — an extension of generalization by replacing the value with another value selected randomly from a generalization range (e.g., when generalizing a date of birth to a five-year interval, an age or even a date of birth selected with equal probability from within that range would be selected to replace the original value)

• Date offsetting — applied only to dates to perturb their values by a fixed or random amount

• Noise injection — and using random values from a predefined distribution to perturb the values (e.g., adding Gaussian noise to an income value)

However, specific uses cases exist where something beyond de-identification will be needed. These are situations in which data needs to be shared very broadly, where the overhead of signing agreements is high, as well as in situations where the costs of implementing the necessary controls are high given the exploratory nature of the planned analysis and the need for rapid generation of nonpersonal data. In these situations, synthetic data can provide a solution. The sidebar “Use Cases for Synthetic Data” offers examples of where synthetic data can provide a good solution.

©2019 Cutter Consortium | www.cutter.com Page | 2

DATA ANALYTICS & DIGITAL TECHNOLOGIES EXECUTIVE UPDATE | Vol. 19, No. 6

Use Cases for Synthetic Data

There are specific use cases for which synthetic data provides an ideal solution, including:

• Hackathons and data competitions/challenges. These require data sets that can be distributed widely with minimal demands on the entrants.

• Proof-of-concept and technology evaluations. Oftentimes technology developers or technol- ogy acquirers need to quickly evaluate whether a new technology works well in practice; they need realistic data with which to work, with minimal constraints.

testing. One of the biggest challenges when developing artificial intelligence and is getting a sufficient number of data sets, that are both large enough and sufficiently realistic on which to test the algorithms.

• Software testing. Testing data-driven applications requires realistic data for functional and performance testing. Random data cannot replicate what will happen when a system goes into production.

• Open data. Sharing complex data sets publicly is challenging because of privacy concerns. This can now be achieved by sharing synthetic data instead.

• Data exploration. Organizations that want to maximize the use of their data can make synthetic versions available for exploration and initial assessment by potential users; if the exploration yields positive results, users could then go through a process to obtain access to the de-identified data.

• Algorithm development. Data analysis programs can be developed on synthetic data and then submitted to the data custodian for execution on the real data; this brings the verified code to the data rather than sharing the data itself.

• Simple statistics. When the desired analytics require only a handful of variables, it is possible to use synthetic data as a proxy for real data and to produce more or less the same results.

• Education and training. Synthetic data can be used for teaching practical courses on data analysis and for software training.

What Is Synthetic Data?

Synthetic data is fake data generated from real data. To generate a synthetic , we would take a real data set and model its distributions (e.g., shape and ) and structure (e.g., correlations among the variables). We then use that model to generate or synthesize the observations that make up the synthetic data set. This model is sometimes referred to as the “synthesizer .” Because the values in the synthetic data

©2019 Cutter Consortium | www.cutter.com Page | 3

DATA ANALYTICS & DIGITAL TECHNOLOGIES EXECUTIVE UPDATE | Vol. 19, No. 6 are generated from the model (and typically randomly sampled), they should replicate the statistical properties of the original data sets. The model can be a regression model of some sort, a machine learning model, an artificial neural network, or a genetic optimization algorithm. Multiple techniques have been tried, and they vary in complexity and accuracy. The synthetic data can have the same number of observations as the original data set. The synthesizer model can also be used to generate a much larger population than the original data set (e.g., to create data for performance testing of a software application) or to create a much smaller data set than the original (e.g., to synthesize a patient population with a rare disease).

If we model the structure of the original data very accurately, then we will end up effectively replicating the original data, which would have privacy implications. This is called “overfitting.” As an example, if the data has a single male born in 1959 with prostate cancer and the model replicates that specific person, then the model is too specific in this instance. Therefore, the model should capture only partial structure. There is a balancing act between how accurate the model needs to be and how close the synthetic data is to the original data. It is the classic tradeoff between data utility and data privacy. When done well, the synthetic data retains enough statistical properties of the original data and has a very low risk of identifying individuals in the data. This allows synthetic data to be shared more freely. Figure 1 illustrates these tradeoffs.

Figure 1 — The tradeoffs in the process of generating synthetic data.

©2019 Cutter Consortium | www.cutter.com Page | 4

DATA ANALYTICS & DIGITAL TECHNOLOGIES EXECUTIVE UPDATE | Vol. 19, No. 6

Various national statistical agencies, such as the US Bureau, have created public data sets using synthetic data generation techniques. This approach has been gaining momentum over the last few years.

Managing Privacy Risks

When evaluating the privacy risks with de-identified data, there are two kinds of disclosures to manage: identity disclosure and attribute disclosure.

Identity Disclosure Identity disclosure is when an adversary is able to correctly assign an identity to a record in a data set and, by doing so, can learn something new about that individual. If an adversary assigns an identity to a record but does not learn anything new, then, arguably, that is a meaningless identity disclosure.

Consider the data set in Table 1. Imagine that an adversary knows Hiroshi is in the data set. Hiroshi is Japanese, and there is only one Japanese record in the data. In this case, the adversary will discover Hiroshi’s income. Therefore, something new is learned, and this is a meaningful identity disclosure.

Now consider the data set in Table 2. If the adversary knows that Hiroshi was born in 1959 and has an income of $120k (and that Hiroshi is in the data set), then it will be clear that the first record is Hiroshi’s. But the adversary needed all the information in the data to match that data to a real person. This is an identity disclosure with no information gain. In practice, only identity disclosures that provide information gain will matter.

Attribute Disclosure Attribute disclosure is when the adversary learns something new without being able to assign an identity to a particular record. For example, consider the oncology data set in Table 3. Assume that the adversary knows that Hiroshi is in the data set and knows his decade of birth. The adversary will learn that Hiroshi has been diagnosed with prostate cancer without knowing which record belongs to him (it could be any one of the first three records). In this case, something new was learned without assigning an identity to a record.

Table 1 – Example data set where a Table 2 – Example data set to illustrate identity meaningful identity disclosure can occur. disclosure with no information gain.

©2019 Cutter Consortium | www.cutter.com Page | 5

DATA ANALYTICS & DIGITAL TECHNOLOGIES EXECUTIVE UPDATE | Vol. 19, No. 6

Table 3 — Sample oncology data set from a clinic for patients who visited on 2 April 2019.

Attribute disclosure is the essence of data analysis. Imagine a larger data set of males living in a particular region. And let’s say that in this hypothetical data set, 5% of males born in the 1950s were diagnosed with prostate cancer, which is much higher than the national rate. We have just learned from this data set that males born in the 1950s living in that region have an unusually high diagnosis rate for prostate cancer.

Learning that males born in the 1950s are diagnosed with prostate cancer at an unusually high rate in a particular region may or may not cause harm to individuals — it will depend on how that information is used. If it is used to provide better services to patients, then that would be a desirable outcome. If it is used to deny bank loans to men in that approximate age range, then that would be a harmful outcome. Therefore, attribute disclosure, or inferences from data, by itself is not a problem. It is the types of decisions made on the basis of these inferences from data that matter — and these are ethical questions that are orthogonal to de-identification and disclosure control.

In summary, the primary disclosure control concern when sharing data for secondary purposes is to protect against identity disclosure that has information gain.

Identity Disclosure for Synthetic Data

Broadly speaking, there are three types of synthetic data:

1. Fully synthetic data. All records and variables in the data set are created from the synthesizer model.

2. Partially synthetic data. Only some variables are synthesized; the rest retain their original values.

3. Hybrid synthetic data. Some records are synthesized, and some original records are included.

For partially synthetic data, if the variables that are retained are the ones that an adversary would know and the remainder synthesized, then identity disclosure is possible but with limited information gain. For example, consider Table 1. If an adversary knows that Hiroshi is in the data and the “Origin” variable is retained but the income is synthesized, then an adversary would know which record belongs to Hiroshi but would not learn his correct income. This is trickier because the synthesized income would have to be

©2019 Cutter Consortium | www.cutter.com Page | 6

DATA ANALYTICS & DIGITAL TECHNOLOGIES EXECUTIVE UPDATE | Vol. 19, No. 6 sufficiently different to be able to claim that the information gain is indeed limited. For example, if the synthesized income is $125k, would that be different enough?

Unless some additional de-identification is applied to the data, it is not clear how hybrid synthetic data would be considered nonpersonal. When any original data is released, there is a risk of identity disclosure because some original or real personal information is included.

The statistical community has considered fully synthetic data to have a low risk of identity disclosure, and the few attempts to measure it have found that to be the case. The argument has some merit in that the synthetic data can at best have imperfect clones of records about real individuals, assuming that the synthesizer model is not overfitted.

Therefore, in general, fully synthetic data is preferred to reliably protect against identity disclosure.

Data Utility

The utility of synthetic data directly relates to the quality of the synthesizer model. If the synthesizer model captures, say, all primary, secondary, and tertiary relationships among the variables, then the synthetic data will likely replicate many analyses that would have been performed on the original data.

In the case where the data set is simple, with few variables and a single observation per individual, capturing relationships among the variables can be more readily achieved by the synthesizer, giving higher-utility data. In cases where the original data is more complex, it is more difficult to capture the structure fully, and hence the synthetic data will have (relatively) less utility.

The value of synthetic data is in situations summarized in the sidebar “Use Cases for Synthetic Data.” We will examine two of these below: innovative technology evaluation and exploratory analysis.

Innovative Technology Evaluation Technology evaluation situations are increasingly common. There are many startups, established vendors, and academics developing innovative data shaping, handling, cleansing, integration, and analysis tools. It is difficult for organizations to evaluate these tools on their own data sets because they would need to de-identify the data, ensure that all the technology providers have adequate security and privacy controls, sign data-sharing agreements with each, and then ensure that the data is destroyed after the evaluation. To do this at scale on a continuous basis can become a burden. In such instances, realistic synthetic data can be used to evaluate the technologies quickly to filter them, and then only the small number of providers that make it through the filter may obtain access to real data to confirm their findings and perform deeper analysis.

©2019 Cutter Consortium | www.cutter.com Page | 7

DATA ANALYTICS & DIGITAL TECHNOLOGIES EXECUTIVE UPDATE | Vol. 19, No. 6

Exploratory Analysis For exploratory analysis, an analyst can understand the data and its limitations, prepare some analysis code, and test some general hypotheses. If that effort yields interesting results, then it would be worthwhile to obtain the de-identified data (or the original data) by going through the more complex data access process.

Synthetic data is close enough to real data, and is easy enough to obtain, that it can solve some practical problems and enable the acceleration of innovative analytics efforts.

Statutes and Contracts

Because many statutes and contracts did not anticipate synthetic data, some interpretation guidance is needed. In this Update, we consider only a couple of questions that we have encountered.1

Can synthetic data be treated in the same manner as de-identified data, meaning that the various obli- gations in privacy regulations would not apply (e.g., GDPR)? Given that fully synthetic data is considered to have a low probability of identity disclosure, and that is the definition of de-identified data, then a strong case can be made that it should indeed be treated in the same manner. If that were not the case, then there would not be an incentive for creating synthetic data, and, in general, data protection authorities have tended to encourage the use of technologies that increase data protection.

A second scenario is if, when a data custodian shares personal information with a subcontractor or processor and the data-sharing contract is silent on whether the processor can de-identify that personal data and use it for secondary purposes, can the processor do so? In the US under the Health Insurance Portability and Accountability Act (HIPAA), for example, if such an agreement (called a “Business Associate Agreement”) does not specifically permit de-identification, then de-identification is considered to be prohibited. However, one can argue that fully synthetic data does not have any of the original data, and, therefore, the prohibition would not apply, as the synthetic data is a completely new data set.

Let’s consider a situation where a researcher publishes a paper in a journal with a complex regression model that predicts the likelihood of diagnosis with prostate cancer. The publication has all the error estimates for the model. Then let’s say we use that published model to generate synthetic data by sampling from the model. In that case, whatever restrictions that applied to the data that the researcher used to produce the regression model do not carry over to the published model. Those restrictions could even have included a prohibition on the researcher’s creating more than one regression model.

1 This section is for informational purposes only and is not intended to, nor shall it be construed as, providing any legal opinion or conclusion. It does not constitute legal advice and is not a substitute for obtaining professional legal counsel from a qualified attorney on your specific matter.

©2019 Cutter Consortium | www.cutter.com Page | 8

DATA ANALYTICS & DIGITAL TECHNOLOGIES EXECUTIVE UPDATE | Vol. 19, No. 6

One can argue that once the regression model is produced, sampling from it is not covered by restrictions on the original data. If that were not the case, then subsequent uses of the regression model for decision making, to provide care or influence the determinants of diagnosis, would inherit the data restrictions, which would be quite a limitation and obstacle to data analysis and model building.

The Challenges

For certain data types, the generation of synthetic data can be somewhat straightforward. For example, to generate synthetic data from cross-sectional data with a handful of variables, irrespective of the number of observations, would be considered a classroom exercise. Handling longitudinal data with a large number of variables increases the complexity of synthesis, and sophisticated modeling techniques would be needed in that case. Domain knowledge of the type of data will be important in such cases to ensure the use of appropriate modeling techniques and the maintenance of deterministic or structural relationships.

Data sets with geographic markers also introduce additional complexity in that these can potentially directly identify individuals, households, or business premises. Furthermore, accounting for date and time of events in a meaningful way is important in the synthesis process. For example, events that have a natural order (e.g., discharges from hospitals always following admissions) should appear in that order in the synthetic data.

On the privacy side of the ledger, if the synthesis process is not for fully synthetic data, then it is important to understand how identity disclosure is measured and minimized. A user organization faces risk if it doesn’t address the possibility of identity disclosure.

Conclusion

Fully synthetic data can open new opportunities for sharing data for broad secondary uses and disclosure of data. Fully synthetic data makes it possible to directly address difficulties in obtaining access to data or in making data available to large communities of analysts.

Methods for the generation of synthetic data have been studied in the statistical disclosure control com- munity for more than 25 years, and there is a growing literature on techniques and experiences. We are seeing commercial offerings entering the market to support the adoption of this approach to sharing and obtaining access to data.

A key advantage of synthetic data is that it is generated from random data following the original data's distributions and correlations, rather than perturbing or transforming real individual data. It provides a compelling solution in terms of cost effectiveness and the ability to address regulatory requirements in certain situations. Such use cases include hackathons/datathons/challenges, open data, software and algorithm testing, education, and simple analytics, to name a few.

©2019 Cutter Consortium | www.cutter.com Page | 9

DATA ANALYTICS & DIGITAL TECHNOLOGIES EXECUTIVE UPDATE | Vol. 19, No. 6

About the Authors Khaled El Emam is Professor at the University of Ottawa, Canada, Faculty of Medicine, where he previously held the position of Canada Research Chair in Electronic Health Information. He is also founder, President, and CEO of Privacy Analytics Inc. As an entrepreneur, Dr. El Emam has founded or cofounded five companies involved with data management and data analytics over the last two decades. He has worked in technical and management positions in academic and business settings in the UK, Scotland, Germany, Japan, and Canada. In 2003–2004, Dr. El Emam earned the of top systems and software engineering scholar worldwide by Journal of Systems and Software, based on his research on measurement and quality evaluation and improvement. Previously, he was Senior Research Officer at the National Research Council of Canada and served as head of the Quantitative Methods Group at the Fraunhofer Institute in Kaiserslautern, Germany. Dr. El Emam holds a PhD in electrical and electronics engineering from King’s College at the University of London, UK. He can be reached at [email protected].

Richard Hoptroff is a long-time technology inventor, investor, and entrepreneur. He is the founder of Hoptroff London, a timing synchronization service provider. Dr. Hoptroff has leveraged his expertise in timing technology and software to develop a hyperaccurate, synchronized timestamping solution for the financial services sector, based on a unique combination of grandmaster atomic clock engineering and proprietary software. In 2013, he established a new commercial category when he brought to market the first commercial atomic timepiece and atomic wristwatch. Dr. Hoptroff is also the cofounder of Right Information Systems, a neural net forecasting software company eventually sold to Cognos, and founder of Flexipanel Ltd., a company supplying Bluetooth modules to the electronics industry. He has worked as a postdoctoral researcher at the Research Laboratory for Archaeology and the History of Art at Oxford University, UK. Dr. Hoptroff holds a PhD in physics (optical computing and artificial intelligence) from King’s College at the University of London, UK. He can be reached at [email protected].

©2019 Cutter Consortium | www.cutter.com Page | 10

Cutter Consortium Membership The Ultimate Access to the Experts Cutter Consortium Membership opens up multiple avenues to interact with Cutter’s experts to brainstorm and gain guidance to transform your organization and boost success.

Like everything business What Do You Get from Cutter Membership? technology, one size does not fit all. •• Get guidance in leveraging new strategies, emerging technologies, and business management practices to enable digital transformation and boost competitive advantage That’s why we encourage you to choose the Membership •• Learn how to mine data to create new products and services and improve customer experience that’s right for your organization. Whether you choose Digital •• Get input on how to reduce expenses through more cost-effective strategies Transformation & Innovation, Enterprise-wide, Practice-specific, •• Gain insights and get ideas on achieving sustainable innovation, or CIO Membership you’ll see successful change management, and prudent risk management a strong return. •• Get coaching and insights on leadership and team-building practices that boost productivity Contact us at +1 781 648 8700 or •• Discover vendor-agnostic advice to ensure unbiased purchasing [email protected] to arrange a decisions sample inquiry call with a Cutter expert and see for yourself how quickly your return on Membership can be realized. Test Drive Cutter Consortium Today

Have a question? A challenge you’re trying to overcome? Have you reached a bottleneck and need some expert advice? Find out for yourself how valuable Cutter Membership can be — request a sample inquiry with a Cutter expert now.

Call +1 781 648 8700 or email [email protected].

“Thanks to Cutter´s experts, in a short time Cutter Consortium we have been able to: Access to the Experts •• Improve our capabilities, using the 37 Broadway, Suite 1 contract scorecard for outsourcing Arlington, MA 02474 USA •• Gain key insights and practical strategies Tel: +1 781 648 8700 for responding to challenges we face in Fax: +1 781 648 8707 building trust and partnerships within the Web: www.cutter.com organization and outside our institutional Email: [email protected] borders. •• Improve our methods for working with “The information technology revolution people and anticipating the challenges continues to change how businesses work of cultural transformation within our and how people work in and with them. organization. Today’s demands on software to provide •• Have immediate access to key ’experts’ useful, usable, secure, and scalable services at critical times, who spend quality time stress all aspects of systems development. with our teams, provide state-of-the- These include architecture, technology art orientation, and help us implement selection, product development, and lifecycle projects. management. Cutter provides comprehensive •• Implement agile, effective, and articu- coverage for these crucial topics — through lated innovation around a common research reports and customized virtual vision for achieving quality education “I have personally been able to leverage training delivered globally — helping us to in Mexico.” Cutter’s services since 1999. Among the understand needs and solutions that span attributes that differentiate Cutter from traditional silos.” — Arturo Cervantes other firms, two remain at the top of my Director General of Information Systems list: Thought Leadership and Real Value. — Eric Schoen for Evaluation Results, National Institute for Thought Leadership is driven by Cutter’s Director of Engineering, i2k Connect LLC Educational Evaluation, Mexico experts. Real Value, driven by ’Access to the Former Chief Software Architect, experts,’ pushes us beyond just understand- Schlumberger ing the options. We can develop relationships with the experts, and tailor the options they present so that they can be quickly and practically executed within our organization, enabling our Business Technology team to continually improve, engage, and contribute to business growth.”

— Doug Mikaelian VP Business Technology, Dairy Farmers of America