A Unified Framework for Generating Synthetic Population with Gaussian

SynC: A Unified Framework for Generating Synthetic Population with Gaussian Copula Colin Wan,1 Zheng Li,2,3∗, Alicia Guo,4 Yue Zhao5 1Department of Statistical Sciences, University of Toronto, Toronto, ON, Canada 2Northeastern University, Toronto Campus, Toronto, ON, Canada 3Arima Inc., Toronto, ON, Canada 4PwC Canada, Toronto, ON, Canada 5H. John Heinz III College, Carnegie Mellon University, Pittsburgh, PA, USA [email protected], [email protected], [email protected], [email protected] Abstract data is required. For the downscaled synthetic population to be useful, it needs to be fair and consistent. The first condi- Synthetic population generation is the process of combin- tion means that simulated data should mimic realistic distri- ing multiple socioeconomic and demographic datasets from different sources and/or granularity levels, and downscaling butions and correlations of the true population as closely as them to an individual level. Although it is a fundamental step possible. The second condition implies that when we aggre- for many data science tasks, an efficient and standard frame- gate downscaled samples, the results need to be consistent work is absent due to various reasons like privacy and secu- with the original data. A more rigorous analysis is provided rity. In this study, we propose a multi-stage framework called in the later section. SynC (Synthetic Population via Gaussian Copula) to fill the Synthetic data generation draws lots of attention from gap. SynC may first remove potential outliers in the data and then fits the filtered data with a Gaussian copula model to cor- data scientists and is often seen as a privacy-preserving way rectly capture dependencies and marginal distributions of the to augment training data in situations where data collection sampled survey data. Finally, SynC leverages predictive mod- is difficult. In applications where large-scale data collection els to merge datasets into one and then scales them accord- involves manual surveys (e.g., demographics), or when the ingly to match the marginal constraints. We make three key collected data is highly sensitive and cannot be fully released contributions in this work: 1) propose a novel framework for to the public (e.g., financial or health data), synthetically generating individual level data from aggregated data sources generated datasets become an ideal substitute. For exam- by combining state-of-the-art machine learning and statistical ple, due to the Pseudoanonymisation article of the General techniques, 2) demonstrate its value as a feature engineering Data Protection Regulation (2014), organizations across the tool, as well as an alternative to data collection in situations world are forbidden to release personally identifiable data. where gathering is difficult through two real-world datasets, 3) release an easy-to-use framework implementation for re- As a result, such datasets are often anonymized and aggre- producibility, and 4) ensure the methodology is scalable at gated (such as geographical aggregation, where variables are the production level and can easily incorporate new data. summed or averaged across a certain region). Being able to reverse-engineer the aggregation, therefore, is a key step to reconstruct the lost information. Introduction Common techniques for synthetic population generation Synthetic population is used to combine socioeconomic and are synthetic reconstruction (SR) (1996) and combinatorial demographic data from multiple sources, such as census optimization (CO) (2001; 2000). Both approaches have spe- and market research, and downscale them to an individual cific data requirements and limitations which usually can- level. Often for privacy reasons, sensitive information such not be easily resolved. To address these challenges, we pro- as name, personal contact detail, family history, wealth and pose a new framework called SynC (Synthetic Population health records are considered personally-identifiable, and with Gaussian Copula) to simulate microdata by sampling hence in many jurisdictions, releasing such information is features in batches. The concept is motivated by (Jeong et strictly prohibited. As an alternative, however, such data can al. 2016) and (Kao et al. 2012), which are purely based be released at aggregated regional levels (e.g., only aver- on copula and distribution fitting. The rationale behind our ages or percentages are released for a region with multiple framework is that features can be segmented into distinct residents). However, practitioners often find individual level batches based on their correlations, which reduces the high data far more appealing, as aggregated data lacks informa- dimensional problem into several sub-problems in lower di- tion such as variances and distributions of residents within mensions. Feature dependency in high dimensions is hard that region, and therefore an alternative to real population to evaluate via common methods due to its complexity and ∗Corresponding Author computation requirements, and as such, Gaussian copulae, Copyright c 2020, Association for the Advancement of Artificial a family of multivariate distributions that is capable of cap- Intelligence (www.aaai.org). All rights reserved. turing dependencies among random variables, becomes an ideal candidate for the application. Copula-Based Population Generation In this study, we make the following contributions: Copula is a statistical model used to understand the depen- 1. We propose a novel combination framework which, to dency structures among different distributions (refer to Pro- the best of our knowledge, is the first published effort to posed Framework Section for details), and has been wdiely combine state-of-the-art machine learning and statistical used in microsimulation tasks (2012). However, downscal- instruments (e.g., outlier detection, Gaussian copula, and ing is not possible, and the sampled data stay at the same predictive models) to synthesize population data. level of granularity as the input. Jeong et al. discuss an en- hanced version of IPF where the fitting process is done via 2. We demonstrate SynC’s value as a feature engineering copula functions (2016). Similar to IPF, this algorithm relies tool, as well as an alternative to data collection in situ- on the integrity of the input data, which, as discussed be- ations where gathering is difficult through two real-world fore, can be problematic in real-world settings. Our method, datasets in the automotive and market research industries. SynC, is less restrictive on the initial condition of the input 3. To foster reproducibility and transparency, all code, fig- dataset as it only requires aggregated level data. Therefore ures and results are openly shared1. The implementation SynC is more accessible compared to previous approaches. is readily accessible to be adapted for similar use cases. Proposed Framework 4. We ensure the methodology is scalable at the production level and can easily incorporate new data without the need Problem Description to retrain the entire model. Throughout this paper, we assume that the accessibility to 1 d T th X = X1; :::; XM , where Xi = [Xi ; :::; Xi ] is a D di- Related Works mensional vector representing input features. Each m 2 M represents an aggregation unit containing nm individuals. Synthetic Reconstruction d When referring to specific individuals, we use xm;k to de- Synthetic reconstruction (SR) is the most commonly used note the dth feature of the kth individual who belongs to technique to generate synthetic data. This approach recon- the aggregation unit m. For every feature d 2 D and aggre- d structs the desired distribution from survey data while con- gation unit m, we only observe Xm at an aggregated level, d Pnm d strained by the marginal. Simulated individuals are sampled which is assumed to be an average (i.e. Xm = k=1 xm;k) from a joint distribution which is estimated by an iterative in the case of numerical measurements, or a percentage (i.e. process to form a synthetic population. Typical iterative pro- d Pnm d Xm = k=1 xm;k=nm) in the case of binary measure- cedures used to estimate the joint distribution are iterative ments. We use the term coarse data to refer to this type proportional fitting (IPF) and matrix ranking. The IPF algo- of observations. In applications, aggregation levels can be rithm fits a n-dimensional contingency table base on sam- geopolitical regions, business units or other types of seg- pled data and fixed marginal distributions. The inner cells mentation that make practical sense. are then scaled to match the given marginal distribution. The Our proposed framework, SynC, attempts to reverse- process is repeated until the entries converge. engineer the aggregation process to reconstruct the unob- IPF has many advantages like maximizing entropy, min- xd; :::; xd d served 1 nm for each feature and aggregation level imizing discrimination information (1968) and resulting in m. We assume that M is sufficiently large, especially rela- maximum likelihood estimator of the true contingency table tive to the size of D, so that fitting a reasonably sophisticated (1991). However, IPF is only applicable to categorical vari- statistical or machine learning model is permissible, and we ables. The SynC framework incorporates predictive models also assume that the aggregation units, nm, are modest in to approximate each feature, which can be used to produce size so that not too much information is lost from aggrega- real-valued outputs as well and probability distribution

Load more