Mtcopula: Synthetic Complex Data Generation Using Copula Fodil Benali, Damien Bodénès, Nicolas Labroche, Cyril De Runz
Total Page:16
File Type:pdf, Size:1020Kb
MTCopula: Synthetic Complex Data Generation Using Copula Fodil Benali, Damien Bodénès, Nicolas Labroche, Cyril de Runz To cite this version: Fodil Benali, Damien Bodénès, Nicolas Labroche, Cyril de Runz. MTCopula: Synthetic Complex Data Generation Using Copula. 23rd International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP), 2021, Nicosia, Cyprus. pp.51-60. hal-03188317 HAL Id: hal-03188317 https://hal.archives-ouvertes.fr/hal-03188317 Submitted on 1 Apr 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. MTCopula: Synthetic Complex Data Generation Using Copula Fodil Benali, Damien Bodénès Nicolas Labroche, Cyril de Runz Adwanted Group BDTLN - LIFAT, University of Tours Paris, France Blois, France {fbenali,dbodenes}@adwanted.com {nicolas.labroche,cyril.derunz}@univ-tours.fr ABSTRACT Nevertheless, recently, there has been a growing interest in Nowadays, marketing strategies are data-driven, and their quality Copula-based models for estimating [1, 26] and sampling [10, 29] depends significantly on the quality and quantity of available data. from a multivariate distribution function. Copula [15] are joint As it is not always possible to access this data, there is a need for probability distributions in which any univariate continuous synthetic data generation. Most of the existing techniques work probability distribution can be plugged in as a marginal. The well for low-dimensional data and may fail to capture complex Copula captures the joint behavior of the variables and models dependencies between data dimensions. Moreover, the tedious the dependence structure, whereas each marginal models the task of identifying the right combination of models and their individual behavior of its corresponding variable. Thus, our prob- respective parameters is still an open problem. In this paper, we lem turns into building a joint probability distribution that present MTCopula, a novel approach for synthetic complex data best fits the marginal distribution of each variable and allows generation based on Copula functions. MTCopula is a flexible capturing different dependencies between these variables. This and extendable solution that automatically chooses the best Cop- problem is often understood as a structure learning task that can ula model, between Gaussian Copula and T-Copula models, and be solved in a constructive way while attempting to maximize the best-fitted marginals to catch the data complexity. It relies the likelihood or some information theory criterion [22]. on Maximum Likelihood Estimation to fit the possible marginal Copula is a flexible mathematical tool that can support differ- distribution models and introduces Akaike Information Crite- ent configurations in terms of marginal fitting distribution and rion to choose both the best marginals and Copula models, thus copula models. To choose the best configuration is not simple. removing the need for a tedious manual exploration of their pos- For instance, the literature Copula-based data generators use sible combinations. Comparisons with state-of-art synthetic data Gaussian Copula model but this model has difficulties to cap- generators on a real use case private dataset, called AdWanted, ture tail dependencies, which may affect the quality of the data and literature datasets show that our approach preserves better generation. the variable behaviors and the dependencies between variables In this work, we present MTCopula, a flexible and extendable in the generated synthetic datasets. Copula-based approach to model and generate complex data (e.g., multivariate time series) with automatic optimization of Copula configurations. Our contributions are the following: (1) 1 INTRODUCTION we formalize the problem of synthetic complex data generation, (2) we propose an approach MTCopula to learn Copulas and Nowadays, data are the new gold. Unfortunately, it is difficult to automatically choose the marginals and Copula models that best get this valuable data as sometimes companies do not have the fit the data we want to generate, and(3) we describe experiments means to collect large data sets relevant to their business. Others showing how well MTCopula preserves implicit relationships have difficulties sharing sensitive data due to the business con- between variables in the synthetic datasets on a real use case and tract confidentiality or record privacy25 [ ], which is the case of state-of-the-art datasets. ad planning, our industrial context. In this specific context, only This paper is organized as follows: Section 2 presents the very few high quality and complex data (multidimensional, mul- related works. Sections 3 and 4 introduce the main concepts tivariate, categorical/continuous, time series, 4C2.), supposedly related to dependency structures and Copulas. Section 5 provides representative of the whole dataset, are available for generating the problem description while Section 6 describes MTCopula, a large and realistic synthetic dataset. Therefore, there is a true our solution to model and generate data with their structure need for a realistic complex data generator. dependencies. Section 7 presents the experiments performed to Our objective is to generate new data that maintains the same show the properties and the efficiency of our approach. Finally, characteristics as the original data, such as the distribution of Section 8 presents the conclusion and opens future works. attributes and dependency between them. Moreover, it must be structurally and formally resembling the original data so that any 2 RELATED WORK work done on the original data can be done using the synthetic data [21]. This cannot be done using the usual one-dimensional The fundamental idea of the process of synthetic data generation synthetic data generation [17] method because, when applying involves sampling data from a pre-trained statistical model, then it in a high dimensional context, it does not allow to model the use the sample data in place of the original data. In this section, dependency between variables. To tackle those issues, several we study related works with regard to this preliminary notion recent works focused on deep learning approaches such as Gener- and our problem, which is the generation of synthetic complex ative Adversarial Network (GAN), but those approaches require data. Complex data denotes a case where data can be a mixture a large amount of data for the learning step and thus can not be of continuous and categorical variables, in a high dimen- used for our problem. sional context, and with the possibility of having temporal relations in the order of variables (time series) and dependencies in variables’ distributions tails. © Copyright 2021 for this paper held by its author(s). Use permitted under Creative Commons License Attribution 4.0 International (CC First, our problem is not about generating data from specifi- BY 4.0). cations: it is rather about generating synthetic data from real data samples, which, for different reasons, are generally available a need for parameter calibration automation. Before introducing in small quantities but with good quality. Therefore approaches the Copula, we present the dependency structure notions in the such as AutoUniv1 cannot be applied. next section. Second, in the simplest case of one-dimensional synthetic data generation, sampling from a random variable - with a known 3 DEPENDENCY STRUCTURES probability distribution 퐹 is usually done using the classical ap- One of our goals is to capture the dependency structure relation- proach Inverse Transform Sampling (ITS) [17], in which pseudo- ship D between data/variables to finally be able to generate data random samples *1, ...,*# are generated from a uniform distribu- respecting those dependencies. This section focuses on the main » ¼ −1 ¹ º −1 ¹ º tion * on 0, 1 and then transformed by 퐹- *1 , ..., 퐹- *# . measures used to summarize dependency between components The issue with applying such an approach in high dimensional of a random vector. synthetic data generation is that it will not allow modeling the dependency between variables. As a consequence, it generates an 3.1 Pearson Product–Moment Correlation independent joint distribution. Therefore, this approach cannot The Pearson product-moment correlation d is a measure of the capture the dependency structure, which is one of our problem’s linear relationship between two random variables - , - . A rela- key elements. 1 2 tionship is linear when a change in one variable is associated with Then, traditionally, a perturbation technique, called General a proportional change in the other variable. Pearson correlation Additive Data Perturbation (GADP) has been widely used for takes values in the interval [-1, 1], and it is defined as: synthetic data generation [14]. The principle consists in fitting a multivariate Gaussian distribution on the input data, - ∼ N(`, Σ). 퐶>E ¹- ,- º After that, the estimated multivariate Gaussian variable - is used ¹ º ¹ º 1 2 d -1,-2 = 퐶>A -1,-2 = p p . (1) to generate