<<

MTCopula: Synthetic Complex Data Generation Using Copula Fodil Benali, Damien Bodénès, Nicolas Labroche, Cyril de Runz

To cite this version:

Fodil Benali, Damien Bodénès, Nicolas Labroche, Cyril de Runz. MTCopula: Synthetic Complex Data Generation Using Copula. 23rd International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP), 2021, Nicosia, Cyprus. pp.51-60. ￿hal-03188317￿

HAL Id: hal-03188317 https://hal.archives-ouvertes.fr/hal-03188317 Submitted on 1 Apr 2021

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. MTCopula: Synthetic Complex Data Generation Using Copula

Fodil Benali, Damien Bodénès Nicolas Labroche, Cyril de Runz Adwanted Group BDTLN - LIFAT, University of Tours Paris, France Blois, France {fbenali,dbodenes}@adwanted.com {nicolas.labroche,cyril.derunz}@univ-tours.fr

ABSTRACT Nevertheless, recently, there has been a growing interest in Nowadays, marketing strategies are data-driven, and their quality Copula-based models for estimating [1, 26] and [10, 29] depends significantly on the quality and quantity of available data. from a multivariate distribution function. Copula [15] are joint As it is not always possible to access this data, there is a need for probability distributions in which any univariate continuous synthetic data generation. Most of the existing techniques work can be plugged in as a marginal. The well for low-dimensional data and may fail to capture complex Copula captures the joint behavior of the variables and models dependencies between data dimensions. Moreover, the tedious the dependence structure, whereas each marginal models the task of identifying the right combination of models and their individual behavior of its corresponding variable. Thus, our prob- respective parameters is still an open problem. In this paper, we lem turns into building a joint probability distribution that present MTCopula, a novel approach for synthetic complex data best fits the marginal distribution of each variable and allows generation based on Copula functions. MTCopula is a flexible capturing different dependencies between these variables. This and extendable solution that automatically chooses the best Cop- problem is often understood as a structure learning task that can ula model, between Gaussian Copula and T-Copula models, and be solved in a constructive way while attempting to maximize the best-fitted marginals to catch the data complexity. It relies the likelihood or some information theory criterion [22]. on Maximum Likelihood Estimation to fit the possible marginal Copula is a flexible mathematical tool that can support differ- distribution models and introduces Akaike Information Crite- ent configurations in terms of marginal fitting distribution and rion to choose both the best marginals and Copula models, thus copula models. To choose the best configuration is not simple. removing the need for a tedious manual exploration of their pos- For instance, the literature Copula-based data generators use sible combinations. Comparisons with state-of-art synthetic data Gaussian Copula model but this model has difficulties to cap- generators on a real use case private dataset, called AdWanted, ture tail dependencies, which may affect the quality of the data and literature datasets show that our approach preserves better generation. the variable behaviors and the dependencies between variables In this work, we present MTCopula, a flexible and extendable in the generated synthetic datasets. Copula-based approach to model and generate complex data (e.g., multivariate ) with automatic optimization of Copula configurations. Our contributions are the following: (1) 1 INTRODUCTION we formalize the problem of synthetic complex data generation, (2) we propose an approach MTCopula to learn Copulas and Nowadays, data are the new gold. Unfortunately, it is difficult to automatically choose the marginals and Copula models that best get this valuable data as sometimes companies do not have the fit the data we want to generate,3 and( ) we describe to collect large data sets relevant to their business. Others showing how well MTCopula preserves implicit relationships have difficulties sharing sensitive data due to the business con- between variables in the synthetic datasets on a real use case and tract confidentiality or record privacy25 [ ], which is the case of state-of-the-art datasets. ad planning, our industrial context. In this specific context, only This paper is organized as follows: Section 2 presents the very few high quality and complex data (multidimensional, mul- related works. Sections 3 and 4 introduce the main concepts tivariate, categorical/continuous, time series, 푒푡푐.), supposedly related to dependency structures and Copulas. Section 5 provides representative of the whole dataset, are available for generating the problem description while Section 6 describes MTCopula, a large and realistic synthetic dataset. Therefore, there is a true our solution to model and generate data with their structure need for a realistic complex data generator. dependencies. Section 7 presents the experiments performed to Our objective is to generate new data that maintains the same show the properties and the of our approach. Finally, characteristics as the original data, such as the distribution of Section 8 presents the conclusion and opens future works. attributes and dependency between them. Moreover, it must be structurally and formally resembling the original data so that any 2 RELATED WORK work done on the original data can be done using the synthetic data [21]. This cannot be done using the usual one-dimensional The fundamental idea of the process of synthetic data generation synthetic data generation [17] method because, when applying involves sampling data from a pre-trained , then it in a high dimensional context, it does not allow to model the use the sample data in place of the original data. In this section, dependency between variables. To tackle those issues, several we study related works with regard to this preliminary notion recent works focused on deep learning approaches such as Gener- and our problem, which is the generation of synthetic complex ative Adversarial Network (GAN), but those approaches require data. Complex data denotes a case where data can be a mixture a large amount of data for the learning step and thus can not be of continuous and categorical variables, in a high dimen- used for our problem. sional context, and with the possibility of having temporal relations in the order of variables (time series) and dependencies in variables’ distributions tails. © Copyright 2021 for this paper held by its author(s). Use permitted under Creative Commons License Attribution 4.0 International (CC First, our problem is not about generating data from specifi- BY 4.0). cations: it is rather about generating synthetic data from real data samples, which, for different reasons, are generally available a need for parameter calibration automation. Before introducing in small quantities but with good quality. Therefore approaches the Copula, we present the dependency structure notions in the such as AutoUniv1 cannot be applied. next section. Second, in the simplest case of one-dimensional synthetic data generation, sampling from a random variable 푋 with a known 3 DEPENDENCY STRUCTURES probability distribution 퐹 is usually done using the classical ap- One of our goals is to capture the dependency structure relation- proach Inverse Transform Sampling (ITS) [17], in which pseudo- ship D between data/variables to finally be able to generate data random samples 푈1, ...,푈푁 are generated from a uniform distribu- respecting those dependencies. This section focuses on the main [ ] −1 ( ) −1 ( ) tion 푈 on 0, 1 and then transformed by 퐹푋 푈1 , ..., 퐹푋 푈푁 . measures used to summarize dependency between components The issue with applying such an approach in high dimensional of a random vector. synthetic data generation is that it will not allow modeling the dependency between variables. As a consequence, it generates an 3.1 Pearson Product– Correlation independent joint distribution. Therefore, this approach cannot The Pearson product-moment correlation 휌 is a measure of the capture the dependency structure, which is one of our problem’s linear relationship between two random variables 푋 , 푋 . A rela- key elements. 1 2 tionship is linear when a change in one variable is associated with Then, traditionally, a perturbation technique, called General a proportional change in the other variable. Pearson correlation Additive Data Perturbation (GADP) has been widely used for takes values in the interval [-1, 1], and it is defined as: synthetic data generation [14]. The principle consists in fitting a multivariate Gaussian distribution on the input data, 푋 ∼ N(휇, Σ). 퐶표푣 (푋 , 푋 ) After that, the estimated multivariate Gaussian variable 푋 is used ( ) ( ) 1 2 휌 푋1, 푋2 = 퐶표푟 푋1, 푋2 = p p . (1) to generate the synthetic data 푌 by adding a noise variable 푒, 푌 = 푉푎푟 (푋1) 푉푎푟 (푋2) 푋 + 푒. where 푒 is a Gaussian error. The problem with this method The problem with 푃푒푎푟푠표푛 푐표푟푟푒푙푎푡푖표푛 휌 is that it is not in- is that it does not allow us to best model the marginal behaviors variant under non-linear strictly increasing transformations of of variables since it considers only Gaussian marginal distribu- the marginals [9]. tions by construction, which can be limiting as observed in our experiments. Moreover, it does not model the tail dependence 3.2 as is consider the correlation matrix Σ only. Another variant of GADP is the Dirichlet multivariate synthesizer based on MLE In practice, we have a monotonic relationship between measure- [24]. The problem with MLE for multivariate distribution fitting ments in which variables tend to change together, but not neces- is that it has to be maximized over a potentially high-dimensional sarily at a constant rate. In this case, rank correlation statistics are parameter space, which is computationally very expensive. well suited for determining whether there is a correspondence The rise of deep learning in the last years has brought forth between random variables. We mention here the two important new techniques such as generative adversarial rank correlation measures, namely 푆푝푒푎푟푚푎푛 and 퐾푒푛푑푎푙푙. networks (GANs)[18, 23]. These techniques perform better than Definition 3.1 (Spearman 휌푠 correlation). Let (푋1,푋2) be a bi- state-of-the-art works in many fields but require large datasets for variate random vector with continuous marginal dfs 퐹1 and 퐹2. training, which can be a significant problem because collecting The Spearman’s factor 휌푠 is defined by: data is often expensive or time-consuming. Even when data is already collected, this type of method cannot be applied due to 휌푠 (푋1, 푋2) = 휌(퐹1 (푋1), 퐹2 (푋2)). (2) privacy or confidentiality issues. Moreover, GANs, like most of deep learning approaches, act as a black-box and does not allow a Definition 3.2 (Kendall’s 휏 correlation). Kendall’s 휏 is defined business expert to understand how the synthetic data are actually as the probability of concordance minus the probability of dis- generated. cordance of two random variables 푋1 and 푋2: Recently, there has been a growing interest in Copula-based modeling and synthetic data generation. Despite the fact that 휏 (푋1, 푋2) = 푃 ((푋11, 푋21)(푋12, 푋22) > 0)− (3) Copula models can best model dependencies and the marginal be- 푃 ((푋11, 푋21)(푋12, 푋22) < 0), haviors of variables, most contributions suggested for synthetic data generation [10, 19] have focused on a single model: the where (푋11, 푋21) and (푋12, 푋22) are independent and identically Gaussian Copula. However, this model assumes a structure de- distributed copies of (푋1, 푋2). pendency that may only loosely capture the between Both Kendall’s 휏 and Spearman’s 휌푠 are dependence invariant variables [11] as it does not allow to model the tail dependence. with respect to monotone transformations of the marginals. Their In addition, these contributions use the Pearson correlation fac- of values is the interval [-1, 1] [3]. tor to estimate the correlation matrix, which is not invariant under strictly monotone non linear transformation, and while 3.3 Tail Dependence this hypothesis is crucial in the Copula’s context. As a conse- Understanding the dependence structure of rare events is funda- quence, this impacts structure dependency preservation during mental in order to best model random variables behaviors. Mea- the copula learning fitting. Nevertheless, Copulas with both mar- sures of dependence like 푃푒푎푟푠표푛 푙푖푛푒푎푟 푐표푟푟푒푙푎푡푖표푛, 푆푝푒푎푟푚푎푛 ginal fittings and its dependency structure allow for a transparent and 퐾푒푛푑푎푙푙 푐표푟푟푒푙푎푡푖표푛 are not able to correctly capture and explanation of the generated data. characterize the joint occurrence of large and small values of In conclusion, Copulas seems to be the best solution for gen- random variables [8]. The Pearson correlation describes how erating datasets based on complex tiny real datasets, but there is well two random variables are linearly correlated with respect to their entire distribution. However, this information is not useful 1https://archive.ics.uci.edu/ml/datasets/AutoUniv to model the extreme behavior of two random variables [27]. To evaluate tail dependence, the tail dependence coefficient is Mikosch [13]: “[Copula] generate all multivariate distributions calculated as follows: with flexible marginals”. Equation 8 describes the construction of the Copula that cap- Definition 3.3 (Upper and lower tail dependence coefficient). The tures and estimates dependence between the standardized vari- upper tail dependence coefficient of a bivariate distribution is ables [3]. A typical example of this construction is the Gaussian defined as: Copula, which is obtained by taking G in (Eq.8) as the multivari- 푢푝푝푒푟 −1 −1 휆 = lim 푃 (푋2 > 퐹 (푡)|푋1 > 퐹 (푡)). (4) ate standard Gaussian d.f. This illustrates the founding principle 푡→1− 2 1 of Copula that states that the dependence of data can be modeled The lower tail dependence coefficient is: independently from the marginals. It is thus possible to represent 휆푙표푤푒푟 = lim 푃 (푋 ≤ 퐹 −1 (푡)|푋 ≤ 퐹 −1 (푡)). different original distributions just by changing the marginal + 2 2 1 1 (5) 푡→0 distributions. Using those definitions, we are now able to introduce the Real-world high dimensional data may have different marginals Copula on which our approach is based. and joint distributions. Therefore, Copulas seem to be the right tools to overcome these difficulties. 4 COPULA This section is devoted to summarizing Copula principles as they 4.2 The Invariance Principle Of Copula are the key part for data generation that conserve dependencies. Here, we would like to mention one of the principal properties A deeper explanation about copula can be found in [15]. of copulas inferred from 푆푘푙푎푟’푠 Theorem 4.1. This theorem is central for data generation using copula as it guarantees that 4.1 Copula Foundations the normalization applied on marginals by their respective cu- A 퐶표푝푢푙푎 is a Latin term which means 푙푖푛푘. In recent years, due mulative distribution functions 퐹, does not alter the measure of to its ability to catch the core of multivariate data distributions dependence between the variables that we want to capture with and their dependencies, copula was applied in a wide range of the copula. areas such as econometric modeling [20] and quantitative risk Theorem 4.2 (Invariance Principle of Copula). Let 푋 = (푋1, management [12]. ..., 푋푗 ,..., 푋푑 ) be a d-dimensional random vector with continuous This concept was first introduced in statistical modeling in joint distribution 퐻, marginal distribution functions 퐹푖 , 푖 = 1,..., 푑 1959 by 푆푘푙푎푟 [28] to describe the function that “join together” and a copula퐶. Let푇푟1,...,푇푟푑 be strictly increasing transformations one-dimensional distribution functions to form a multivariate on range 푋1, .., 푋푑 respectively. Then 퐶 is also the copula of the distribution function. It is based on Sklar’s Theorem 4.1. random variable (푇푟1 (푋1), ..., 푇푟 푗 (푋푗 ),..., 푇푟푑 (푋푑 )). Theorem 4.1 (Sklar’s theorem). Let (푋1, ..., 푋푗 , ..., 푋푑 ) be a Thus, Copulas, that describe the dependence of the compo- d-dimensional random vector with joint distribution function 퐻 nents of a random vector, are invariant under increasing trans- and marginal distribution functions 퐹푖 , 푖 = 1,..., 푑, then there exists formations of each variable. The power of this theorem manifests 푑 푑 a d-copula 퐶 : [0, 1] → [0, 1], such that for all 푥 in R , the joint itself when moving from the multivariate distribution function distribution function can be expressed as: (퐻) to the corresponding random vectors (푋). In particular, when we want to sample from a multivariate distribution function. It 퐻 (푥1, .., 푥 푗 , ..푥푑 ) = 퐶(퐹1 (푥1), .., 퐹푗 (푥 푗 ), ..퐹푑 (푥푑 )) (6) gives us guarantees about dependency preservation when stan- with associated density function ℎ, expressed by the multiplica- dardizing variables with their marginal distributions in order tion of the copula density function 푐 and marginal densities: to capture dependency by taking 푇푟푖 =퐹푖 (cumulative distribu- 푑 tion functions 퐹 are strictly increasing by construction). After Ö 푖 ℎ(푥1, ..., 푥푑 ) = 푐(퐹1 (푥1), . . . , 퐹푑 (푥푑 )) × 푓푘 (푥푘 ). (7) that in order to return to the original data shape, we apply the = −1 푘 1 inverse distribution 퐹푖 (or the quasi-inverse) by taking 푇푟푖 = −1 Conversely, Copula 퐶 corresponding to a multivariate distribu- (퐹푖 표퐹푖 )(푥푖 ) is a strictly increasing transformation in the range tion function 퐺 which marginal distribution functions 퐹푖 for 푖 = of 푋푖 . 1,.., 푑, can be expressed as: −1 −1 푑 4.3 Families Of Copulas 퐶(푢1..,푢 ) = 퐺 (퐹 (푢1).., 퐹 (푢 )) , ∀(푢1, ..,푢 ) ∈ [0, 1] (8) 푑 1 푑 푑 푑 In practice, there are many bivariate Copula families like the −1 where 푢푖 = 퐹푖 (푥푖 ) and 퐹푖 is the inverse of the marginal distri- elliptical copulas, archimedean Copulas, and extreme-value Cop- bution function of 퐹푖 . ulas [3], but only a few multivariate ones. This section focuses on the elliptical family because it contains two multivariate Copulas, The first equation of the Sklar’s Theorem (Eq.6) describes the the Gaussian Copula, and T-Copula. role of the Copula function which is connecting or coupling the marginal distribution functions 퐹1,..., 퐹푑 to form the multivariate 4.3.1 Multivariate Gaussian Copula. distribution function 퐻. This allows large flexibility in construct- Definition 4.3 (Multivariate Gaussian Copula). The multivari- ing statistical models by considering, separately, the univariate ate Gaussian Copula is the result of applying the inverse state- behavior of the components of a random vector and their depen- ment of Sklar’s theorem (Eq.8) to the multivariate Gaussian dis- dence properties captured by some copulas. In particular, Copulas tribution with zero vector and correlation matrix 푃. can serve for modeling situations where a different distribution is needed for each marginal, providing a valid substitute to several The main drawback of Gaussian Copula is that it does not classical multivariate distribution functions such as Gaussian, allow to capture tail dependence. The upper and the lower tail Laplace, Gamma, Dirichlet, etc. This particularity represents one dependence coefficient between two variables푋 ( 푖 , 푋푗 ) with cor- of the main advantages of the Copula’s concept, as explained by relation factor 휌, are the same and are given by [3]: 푈푖 = (퐹1 (푋1), ..., 퐹푑 (푋푑 )), 푖 ∈ {1, ..., 푛}. (12) Consequently, 휃 could be estimated using data distribution fitting techniques such as Maximum Likelihood Estimation (MLE). However, in reality, the marginals of 퐻 are unknown. For this (a) Gaussian Copula. (b) T-Student Copula . reason, the marginals have to be estimated before that 휃 can be estimated. The Copula learning process, schematized in Figure Figure 1: Comparison of Tail Dependency Capture by 2, is structured in two steps – Marginal Distribution Fitting, and Gaussian Copula and T-Copula with Standard Gaussian Copula Fitting – that are described in the following. Marginals and 휏 = 0.7. 4.4.1 Marginal Distribution Fitting. Modeling marginal distri- bution 퐹1, ..., 퐹푑 can be achieved commonly in two ways [3, 4]: √ the first approach consists in fitting parametric distribution to  푥 1 − 휌  each marginal, i.e., we assume 푋 ∼ 푓 (.;훾 ), the parameter 훾 is 휆 = lim 2 1 − Φ √ = 0. (9) 푗 푗 푗 푗 푥→∞ 1 + 휌 commonly estimated by maximum likelihood: 푛 4.3.2 Multivariate T-Copula. Ö 훾ˆ푗 := 푎푟푔max 푓푗 (푥푖 푗 ;훾푗 ), 푗 ∈ {1, ...,푑}. (13) 훾 Definition 4.4 (Multivariate T-Copula). The multivariate T- 푗 푖=1 Copula yields from applying the inverse statement of Sklar’s The associated marginal distribution function 퐹푗 is then es- theorem (Eq.8) to the multivariate Student distribution. timated by 퐹푗 (.;훾ˆ푗 ). The second approach consists of modeling In this case, considering (Eq.8), 퐺 corresponds to the multi- the non-parametric marginals using the empirical distribution function 퐹ˆ defined as: variate T-Student d.f 푇푑 (··· ; 푃, 휈) with matrix 푗 푑×푑 −1 푃 ∈ [−1, 1] and 휈>0 degree of freedom. Further 푇 is the 푛 휈 1 Õ inverse of the univariate standard student c.d.f. 푇휈 . The main 퐹ˆ (푥) = 1 푓 표푟 푎푙푙 푥. (14) 푗 푛 + 1 {푥푖 ≤푥 } advantage of the T-Copula comparing to the Gaussian Copula is 푖=1 its ability to capture the tail dependence among extreme values 푢푝푝푒푟 4.4.2 Copula Fitting. In both previous cases, we end up with [16]. The upper tail dependence coefficient 휆푖 푗 between two data on the Copula scale, which will be used to estimate the variables (푋푖 , 푋푗 ) is equal to lower tail dependence coefficient Copula parameters 휃 of the chosen multivariate Copula family: 푙표푤푒푟 휆푖 푗 , because T-Copula is symmetric and is given by: p ! (푢 , ...,푢 ) = (퐹ˆ (푥 ), .., 퐹ˆ (푥 )), 푓 표푟 푖 = 1, .., 푛. (15) √ 1 − 휌푖 푗 푖1 푖푑 1 푖1 푑 푖푑 = − + 휆푖 푗 2푇휈+1 휈 1 p . (10) Similar to marginal distribution parameters estimation, one 1 + 휌푖 푗 method is Maximum likelihood estimation, which is commonly 4.3.3 Illustration. To compare T-Copula and Gaussian Cop- used to estimate the parameters vector 휃 of the Copula-based ula’s ability to capture tail dependence, Figure 1 shows two scatter on pseudo-Copula data. If parametric marginal models (Eq.13) plots that represent a bivariate distribution constructed using the are used, then we talk about inference for marginals approach two mentioned Copulas. (IFM)[6] and if the empirical distribution of (Eq.14) is applied One important common characteristic in this comparison is then we have a semi-parametric approach [5] also known as that both Copulas use the Kendall’s 휏 of two random variables (푋푖 , Canonical MLE (CMLE), and the is given by: 푡 푋푗 ) that has the same form for both T-Copula 퐶푃,휈 and Gaussian 푛 Copula 퐶Φ and it is defined by [4]: Ö 푃 L(휃 |푢1, ..,푢푗 , ..,푢푑 ) = 푐(푢푖1, ..,푢푖 푗 , ..,푢푖푑 |휃). (16) 휋 푖=1 휌 = 푠푖푛( 휏 (푋 , 푋 )), (11) 푖 푗 2 푖 푗 The success of the first approach (IFM) depends on finding where 휌푖 푗 is the Pearson correlation between the pair (푋푖 , 푋푗 ). appropriate parametric models for the marginals. If the marginals As we can notice from the lower left and upper right corners of are misidentified, the estimated parameter vector 휃 will be biased the two scatter plots, the constructed bivariate distributions have [7]. significantly different behavior in their bivariate tails, although Finally, another simple method, called the method of moments, they have the same marginals and correlation factor. In fact, in is based on the invariance property of Kendall’s 휏 under strictly the Gaussian Copula (left scatter), there seems to be no strong increasing transformations of the marginals. The method consists dependence in the lower left and upper right corners, while the of calculating Kendall’s 휏 for each bivariate marginal of the Cop- T-Copula with three degrees of freedom (right scatter) emerges ula and then using relationship in (Eq.11) to infer an estimate of to have more mass and more structure in the lower and upper the entire correlation matrix 푃 of the considered elliptical Copula tail. (Gaussian or T) [3]. In the case of T-Copula, to estimate the remaining parameter 4.4 Copula Learning 휈, MLE is generally used with correlation matrix held fixed [4]. Estimating Copula 퐶 as in (Eq.6) that belongs to a parametric fam- 5 PROBLEM FORMULATION ily of Copulas 퐶휃 such as the 푇 and 퐺푎푢푠푠푖푎푛 Copula, consists in estimating the vector 휃 of unknown parameters. If the marginal Our objective is, given a set of complex and representative obser- distribution 퐹1, ...,퐹푑 are known, the following sample would rep- vations (e.g. media channels with their user targets and respective resent independent, identically distributed (푖푖푑) random samples daytime audiences) 퐿표 , to generate a synthetic dataset 퐿푠 which is of Copula. similar to the original dataset 퐿표 under the following properties. Figure 2: Copula Learning Process.

ˆ Î푛 ˆ • For each attribute (variable) in the dataset, the generated where L(휃 |푥1, .., 푥 푗 , .., 푥푑 ) = 푖=1 ℎ(푥1, .., 푥 푗 , .., 푥푑 |휃) is values must be consistent with the distribution of the the ML estimation of the model ℎ with parameters 휃, and variable. 푘 is the number of parameters. 휃ˆ is given by: • Dependence between variables must remain the same in the new dataset. 푛 ˆ Ö 휃 = 푎푟푔max ℎ(푥푖1, ..., 푥푖 푗 , ..., 푥푖푑 ;휃). (20) 휃 This objective can be reformulated as: find automatically the 푖=1 statistical model that best fits the process of data generation. Therefore, using Copula and according to Section 4.4, this can 6 SOLUTION DESCRIPTION be done by, first, estimating marginals parameters, and, second, This section illustrates the general problem and describes its so- estimating Copula distribution parameters. The fitting will al- lution in the specific context of complex data generation with most never be exact, so the problem consists of determining the multivariate time series paired with categorical variables as found model parameters that minimize the relative amount of the lost in our problem of media channel data generation. Our system, information. which is called MTCopula, is broken down into three steps: (1) In the literature, the Akaike Information Criterion (AIC) [2] data preparation, (2) copula model learning, and (3) synthetic is often used to this extent, but not in the context of automatic data generation. Noticeably, only step (1) is specific to our prob- determination of the best marginals or Copula models for data lem, while steps (2) and (3) are entirely generic to any complex generation. Noticeably, AIC provides a trade-off between the synthetic data generation scenario. and the model’s simplicity by penalizing pro- portionally to the number of parameters. This, in turn, allows 6.1 Data Preparation decreasing the risk of overfitting and underfitting at the same 6.1.1 General Pipeline. Copula, as a multivariate distribution time. In what follows, we formulate our problem based on AIC function, requires a continuous representation of independent without loss of generality as any other test could have been used, and identically distributed 푑-dimensional random variables. Due such as the Kolmogorov-Smirnov test, which does not penalize to this requirement, the multiple multivariate time series in the models with more parameters. Based on AIC, our synthetic data input must be preprocessed before learning the Copula model generation problem becomes the following two-steps optimiza- that, in a next step, generates synthetic data. Figure 3 illustrates tion problem: the different steps of our data preparation process. (1) Sampling values consistent with each variable behavior consists in finding the corresponding marginal distribu- tion density function (푓푗 , 훾푗 ) such that:

푚푖푛푖푚푖푧푒 퐴퐼퐶 = 2푘 − 2 ln(L(ˆ 훾ˆ푗 |푥 푗 )), 푗 = 1..푑 (17)

ˆ Î푛 where L(훾ˆ푗 |푥 푗 ) = 푖=1 푓푗 (푥푖 푗 |훾ˆ푗 ) represents the maxi- mized likelihood function of a candidate marginal density 푓푗 with 푘-dimensional vector of parameters 훾ˆ푗 given by:

푛 Ö 훾ˆ푗 = 푎푟푔max 푓푗 (푥푖 푗 ;훾푗 ). (18) 훾 푗 푖=1 (2) Characterizing the inter-dependency behavior of variables together consists in finding the joint distribution density (copula parameters) (ℎ, 휃) that:

ˆ ˆ 푚푖푛푖푚푖푧푒 퐴퐼퐶 = 2푘 − 2 ln(L(휃 |푥1, .., 푥 푗 , .., 푥푑 )) (19) Figure 3: Data Preparation Workflow. Our preprocessing process includes data cleaning, which con- and the second one is parametric and uses MLE (Eq.13). Algo- sists of first removing missing values and normalizing data rep- rithm 1 presents the steps of MLE to fit the marginals and, most resentation (ex. lower casing). Then, each column representation importantly, AIC to automate the choice of the best marginal dis- of the multivariate time series data is converted into a row rep- tribution among a set of preselected distributions. Currently, we resentation of multiple time series. This allows to change the choose, without loss of generality, among the following bounded observation structure and, as a consequence, allows removing distributions: Truncated Gaussian, GaussianKDE (Kernel Density the dependence due to the time series nature where an observa- Estimator), Beta, Truncated Exponential, and Uniform. tion at time 푡 depends on previous time slots. In our case, the multivariate time series is defined by 6 time-dependent variables 1: Marginal Distribution Fitting and Selec- – {Women, Men} × {13 − 34 years, 34 − 65 years, 65 + years} – tion Using Maximum Likelihood and AIC. and two categorical features – the media channel and the day Input: 퐿푗 dataset of 푛 observation from the random variable 푋푗 , of the week – as visible in the first table in Figure 3. Each one Output: the best fitted distribution 퐹푗 with estimated will produce a six-time series paired with a vector of three cate- parameters 훾ˆ푗 . gorical variables (Target, Channel, and day). As a result of the 1 distributions = { Truncated Gaussian, GaussianKDE, Beta, preprocessing step, we have a set of independent and identically Exponential, Uniform, or any bounded distribution } ; distributed observation defined by a vector of continuous and 2 best_aic = +∞; categorical (discrete) variables as shown in Figure 3. 3 for dist in distributions do 4 fitted_params = 퐹푖푡(dist, 퐿푗 , method =’maximum 6.1.2 Categorical variables encoding and Copulas. Categorical likelihood’); data cannot be modeled directly by the Copula, so we propose to 5 aic = 퐴퐼퐶(dist, fitted_params); replace them with continuous data. To this end, we consider two 6 if aic ≤ best_aic then options. The first option consists of only considering distribution 7 best_aic = aic; based encoding but fails to model the dependence between values 8 훾ˆ푗 = fitted_params; of a . 9 퐹푗 = CDF(dist); The second option consists of performing first a one-hot encod- 10 end ing to capture dependence between values of the same categorical 11 end variable. Applying this to the Target variable allows to model the multivariate dependence between the different values of this vari- able (women 13-34, men 13-34, women 34-65, men 34-65, women The estimated marginal distributions are used to construct +65, men +65), and, as a consequence, models the multivariate pseudo-Copula observations via the probability integral trans- time series behavior. The distribution-based encoding technique formation as described in (Eq.15). A criterion, is used in order to transfer the discrete representation of the cat- such as AIC, is used to select the copula 퐶 that best fits pseudo- egorical variable to the continuous representation in the range Copula data and characterizes dependence between marginals. [0, 1]. Figure 4 illustrates distribution based encoding technique Algorithm 2 presents the steps of Copulas fitting using AIC. using the Truncated Gaussian. This process gives dense areas 6.2.2 Copula fitting. Most of the works, done in synthetic at the center of each interval and ensures that the numbers are data generation based on Copula, use a Gaussian copula with well differentiated. This facilitates the inverse process (decoding), MLE approach to estimate marginals. Our system gives flexibil- given a value 푣 ∈ [0, 1], we can identify the corresponding cate- ity in terms of Copula model choice based on AIC, which, in gory based on the value interval. Once the categorical variables turn, allows learning different Copula models and choose the are transformed, we have a set of observations of d-dimensional model which best fits the input data. For the moment, wefit continuous random variables (Table 3 Figure 3). This dataset will two models, Gaussian and T-Student Copula, as they are able to be the input of the next step in order to estimate the copula capture different dependence structures: linear like the correla- parameters. tion using Gaussian Copula, and non-linear behavior like the tail dependency using T-Copula. Interestingly, our work addresses a recurrent problem ob- served when using Copulas: most contributions use Gaussian copula paired with a Pearson Correlation [10, 19] in order to estimate the correlation factor of the Gaussian Copula. How- ever, the Pearson correlation factor is not invariant under strictly monotone non linear transformation, which may impact the pro- cess of estimation when standardizing with marginal distribution Figure 4: Categorical (Working Day, Saturday, Sunday) to functions. Our contribution MTCopula uses the Kendall’s 휏 in- Continuous Data Encoding using a Truncated Gaussian. version, which is based on the relationship between the Elliptical Copula (T-Copula or Gaussian Copula) correlation parameter 6.2 Copula Model Learning and the Kendall’s 휏 of two random variables (see Eq.11). For the T-Copula, another step is required to estimate the degrees of As we explain in Section 4.4, the Copula learning process is done freedom, which is based on MLE with the correlation matrix held in two steps: the marginal distribution fitting and the Copula fixed. fitting. 6.2.1 Marginal distribution fitting. Our system proposes two 6.3 Data Generation And Reconstruction methods to estimate the marginal distributions. The first one is For synthetic data generation, copula samples are generated by non-parametric, via empirical distribution, as described in (Eq.14), sampling from the Copula density function 푐 that corresponds Algorithm 2: Copula Fitting with AIC. (2) The main bottleneck of methods based on Copula is (푖) Input: Dataset 퐿 of 푛 observations from a d-dimensional vector to be able to choose among the marginal models, and 푋 , a method 푚 (e.g.: Kendall 휏 inversion) for parameters (푖푖) to choose among the Copula models that may have estimation and marginal distributions 퐹1, . . . , 퐹푑 . different properties to capture the dependency. MTCopula Output: the best fitted copula 퐶 with estimated parameters 휃. automatises the process by using the AIC criterion as a 1 copulas = { Gaussian Copula, T-Copula } ; measure to automatically determine the best model either 2 best_aic = +∞; for marginals or Copula. We show to which extent this 3 copula_data = standardize(퐿, 퐹1,..., 퐹푑 ); choice is efficient in our context. 4 for copula in copulas do (3) Finally, to answer the first question raised in this paper, 5 fitted_params = 퐹푖푡(copula, 퐿, method=m); we show the efficiency of MTCopula to generate multi- 6 aic = 퐴퐼퐶(copula, fitted_params); ple/multivariate time series based on our initial real in- 7 if aic ≤ best_aic then dustrial use case on media planning and synthetic media 8 best_aic = aic; channels data generation. 9 휃 = fitted_params; For our experiments, we use the 4 datasets presented in Table 1. 10 퐶 = copula; The XYZ dataset was generated using a mixture of Beta and 11 end Gaussian distributions with a correlation between Y and Z only, 12 end in order to simulate complex marginal distributions. The Abalone and Breast Cancer Wisconsin datasets come from the UCI dataset platform 3. The AdWanted dataset4 comes from Adwanted Group to the estimated Copula joint distribution function 퐶. Then, the company and provides a rich and real use case for our approach inverse probability transformation (퐹 −1) is applied to transform 푗 based on media channels. For this specific dataset, the input data, the Copula samples back to the natural distribution of the data which is 27000 instances in 10 dimensions, is first preprocessed (see Eq.8). Algorithm 3 presents the steps to sample based on following the methodology presented in Section 6.1 for Copula ( ) Copula 퐶 and fitted marginal distributions 퐹1, 퐹2, . . . , 퐹푑 . model learning. This produces a multivariate continuous with 1440 instances of 60 dimensions that we use in our tests. Algorithm 3: Sampling Based On Copula Number Attribute Number Input: Best Fitted Copula C with parameters vector 휃, Fitted Dataset Type Attributes Characteristics Instances marginal distributions (퐹1,훾ˆ1), (퐹2,훾ˆ2), ..., (퐹푑 ,훾ˆ푑 ). XYZ Multivariate 3 Continuous 1000 Continuous, Discret, Output: synthetic d-dimensional observation 푋e. Abalone Multivariate 8 4177 Categorical 1 Sampling d-dimensional copula data 푈 , 푈 ∼ (c, 휃); Breast Cancer − − − Multivarate 32 Continuous, Categorical 569 1 ( ) 1 ( ) 1 ( ) Wisconsin (Diagnostic) 2 Return 푋e = (퐹1 푈1,훾ˆ1 , 퐹2 푈2,훾ˆ2 , ..., 퐹푑 푈푑 ,훾ˆ푑 ); Multivarate Continuous, AdWanted 60 1440 Timeseries Categorical For the moment, our system MTCopula supports two Cop- Table 1: Datasets Used For Experiments. ula models: Gaussian and T-Copula. For generating correlated random variables, our method uses the Cholesky factorization, which is commonly used in Monte Carlo simulation to produce efficient estimates of simulated values [30]. 7.1 Copula For Synthetic Data Generation Once the synthetic data generation process is finished, a re- construction operation is performed in order to re-convert the This section evaluates to which extent Copula models answer categorical variable to its original representation by replacing in- our need to generate synthetic datasets that fit with our two terval values with their corresponding, most likely, categories. Fi- objectives presented in Section 5. nally, the row representation of the time series is re-transformed 7.1.1 Copula versus other state-of-art generators. We first eval- into a column representation. uate the ability of the Copula framework to generate synthetic data that better preserve dependency structure when compared 7 EXPERIMENTS to the following state-of-the-art approaches: ITS [17], GADP [14], In this section, we report the experiments that were conducted MLE and CMLE [5]. In order to show the Copula framework ef- to validate MTCopula ability to generate synthetic data2. In ficiency, we couple different marginals by changing the copula order to evaluate our approach, we answer the following research itself: either T-Copula or Gaussian copula. For the Gaussian Cop- questions: ula, we use different methods to estimate the correlation matrix (1) MTCopula relies on the central hypothesis that Copulas 푃: Gaussian Copula with Kendall’s 휏 (GCK), Gaussian Copula are pertinent to generate synthetic data. To confirm it, we with Spearman (GCS), and Gaussian Copula with Pearson (GCP). propose experiments where state-of-the-art generators We evaluate, on our four datasets, the dependence structure (ITS, GADP, MLE, and CMLE) are compared with different preservation based on the Root Mean Square Error (RMSE) be- Gaussian Copulas and T-Copula. As a Gaussian Copula is tween the correlation matrix of the original dataset and the gen- defined by its correlation matrix to model dependency, our erated dataset. The lower the RMSE, the better the dependency test incorporates several ways to estimate this correlation structure is captured. The final reported errors, presented in Ta- matrix: Kendall’s 휏, Pearson and Spearman coefficients. In ble 2, are averaged over 50 runs, except for MLE and CMLE conclusion, this validates the choices of both due to their time computation costs on the three most complex Copula and the Kendall’s 휏. 3https://archive.ics.uci.edu/ml/datasets.php 2The source codes are available at https://github.com/cderunz/MTCopula. 4The AdWanted dataset is not shareable due to privacy issues XYZ Breast Cancer Abalone AdWanted (resp. ≈ 0.65 and ≈ 0.50), are above the threshold 훼 = 0.05, so we dataset WD dataset dataset Mean Std Mean Std Mean Std Mean Std cannot reject the null hypothesis, that the synthetic and the real ITS[17] 0.4465 0.0155 0.4725 0.0018 0.7237 0.0027 0.3447 0.037 marginals are derived from the same distribution. Although the GADP[14] 0.1659 0.0137 0.2392 0.0397 0.2855 0.0169 0.2482 0.0224 푃-value of 푍 (≈ 0.09) is also slightly larger than 훼, it is MLE 0.4456 0.0147 0.4734 - 0.7266 - 0.8953 - CMLE[5] 0.1735 0.0132 0.4698 - 0.7120 - 0.8671 - significantly less accurate than the others. This is due to problems GCP 0.1639 0.0159 0.0794 0.0059 0.0571 0.0217 0.1057 0.0028 with the marginal fitting of this distribution. As a consequence, GCS 0.1579 0.0114 0.0785 0.0052 0.0554 0.0225 0.0982 0.0017 correlation is impacted between 푌 and 푍 as visible in Figure 6. GCK 0.1451 0.0105 0.0658 0.0053 0.0547 0.0161 0.0931 0.002 TC 0.1596 0.0111 0.0993 0.0095 0.0315 0.0091 0.0881 0.0005 The Figure 6 shows that globally data generation using Copula Table 2: RMSE Evaluation of Dependency Structure Preser- with structure dependency capture is able to answer our problem, vation using Different Methods: 퐺퐶푥 denotes Gaussian but the better we fit both marginals and Copula, the more realistic Copula and 푥 indicates the correlation (p: Pearson, s: the generated data are. As a consequence, our problem boils down Spearman, k: Kendall). TC denotes the T-Student Copula. to selecting the most effective marginals and Copula models to generate the most realistic data. That is the goal of our approach MTCopula, that relies on AIC as described in the next section. dataset. We observe clearly that the dependency structure is bet- ter respected with Copulas than with state-of-the-art approaches. For instance, on the Breast Cancer Wisconsin Dataset, the mean RMSE of ITS, GADP, MLE, and CMLE are higher than 0.2 when it is lower than 0.1 for any type of Copulas. 7.1.2 Choice of Dependency Structure Estimation Method. In order to validate our choice that Kendall’s 휏 is relevant and accu- rate to estimate and preserve dependency structure, we compare several methods to estimate the correlation matrix 푃 of the Gauss- (a) Real Data (b) Synthetic Data. ian Copula: Kendall, Spearman, and Pearson. Noticeably, we limit our study to Copula whose dependency structure D is expressed Figure 6: Pair Plot of XYZ Dataset and Synthetic Data Gen- as a correlation matrix. erated Using Kendall Method. From Table 2, we can observe that Kendall, Spearman, and Pearson methods, for which the 푅푀푆퐸 median is between 0.01 to 0.2 depending on the dataset, are significantly more accurate 7.2 Interest Of AIC For Models Selection than RMSE scores for ITS, GADP, MLE, CMLE methods for which This section presents the benefits of using AIC to determine the the means are respectively between 0.34 and 0.72, 0.16 and 0.28, best model for both marginal fitting and copula choices. 0.44 and 0.89, and between 0.17 and 0.86. We can also observe that the Gaussian Copula with Kendall performs slightly better 7.2.1 Choice of the marginals. To evaluate the importance than the Gaussian Copulas with both Pearson and Spearman. of AIC in selecting the most appropriate marginal distribution These results illustrate the robustness and the effectiveness of that best fits the behavior of marginal variables, we fit alistof Kendall’s method against the others method for correlation ma- bounded distributions: Beta distribution, Uniform distribution, trix estimation in the specific case of Gaussian copula. Therefore, Truncated Exponential, Truncated Gaussian, and Kernel density our choice of Kendall’s 휏 to capture the dependency structure estimation, using the MLE method for each variable. For each of is validated both experimentally and theoretically, as illustrated these distributions, we evaluate the AIC using the fitted parame- before in Section 3. The dependency structure estimation method ters. The distribution with the minimum value of AIC is selected choice is thus confirmed. to model the behavior of the variable. Note that we use a list of bounded distribution in order to avoid generating outliers. In addition, we incorporate a Kernel algorithm to fit more complex distribution shapes. Table 3 illustrates the evaluation of AIC of the marginal distributions fitting of XYZ dataset variables. From Table 3 we can observe that for both 푋 and 푌 variables. Beta distribution has a very small value of AIC (−11718.86 and −11001.61 respectively ). As a consequence, we notice that the real data distribution (blue color in Figure 7) and the fitted distri- bution (orange color Figure 7) are almost identical (see Figures Figure 5: Marginals Fitting Evaluation Using Two Sided 7a and 7b). While, for the variable 푍, the value of the minimum Kolmogorov-Smirnov Test with 훼 = 0.05 on XYZ dataset. AIC is not as small (4435.44) compared to the other variables. As a result, we observe a significant difference between the fitted and the real data distribution in Figure 7c. This is because AIC es- 7.1.3 Impact of the marginal fitting on the quality of data timates the relative amount of information lost by a given model: generation. Figure 5 illustrates a for the variation of the the less information a model loses, the higher the quality of that P-Value of the two 2-Samples Kolmogorov-Smirnov Test, which model. determines whether the synthetic attributes values and the real attributes values are derived from the same distribution. We 7.2.2 Choice of the copula models. In this experiment, we notice that for the first 2 variables 푋 and 푌 , the median 푃-value investigate the impact of the copula model choice on the quality Truncated Truncated Variable Beta KDE Uniform dataset (9507.26). This confirms the AIC interest in choosing the Exponential Gaussian X -11718.86 -281.47 4.0 98.99 133.62 best copula model that best fits the data generation process. Y -11001.61 -1116.15 3.96 -1497.21 -690.05 Z 240273.73 4435.44 5480.97 5040.59 4896.43 Database Copula Model AIC Value Table 3: AIC Evaluation on XYZ Dataset Marginals. XYZ Gaussian 3993.73 T-Student 3998.18 Abalone Gaussian 12388.88 T-Student 9507.26 AdWanted Gaussian 202532.88 T-Student 127444.74 Table 4: AIC Evaluation of Gaussian and T-Copula Models.

(a) X Fitting using (b) Y Fitting using (c) Z Fitting using Beta distribution Beta distribution KDE distribution

Figure 7: Marginal Distribution Obtained After Fitting Using Algorithm 1. of data generation, and we demonstrate the importance of AIC to choose the best copula model. To this end, we fit two copulas models, the Gaussian and the T-Copula, on two different datasets XYZ and Abalone. For both models, we use the Kendall method to estimate the correlation matrix 푃. The degree of freedom 휈 of T-Copula is estimated by the CMLE method with correlation matrix 푃 held fixed. Results are averaged after 10 runs. Figure 8 (b) Synthetic Data using (a) Real Data. illustrates the RMSE evaluation of the dependency preservation T-Copula. using the two copulas. Figure 9: Pair Plot illustration of Abalone Dataset.

Through this section, we have demonstrated the effectiveness of MTCopula to select among different combinations of marginal fittings and Copula models, the most appropriate models that best represent the process of data generation, and we showed the importance and the relevance of the AIC criterion in this process.

7.3 MTCopula Applied To Media Channels (a) RMSE Variation XYZ (b) RMSE Variation Abalone The objective of this experimentation is to measure the effec- tiveness of MTCopula on real media dataset as provided by Figure 8: Dependency Structure Preservation Evaluation 퐴푑푤푎푛푡푒푑 company. According to Table 4, as AIC for T-Copula using different Copula Models. (127푘) is lower than AIC for the Gaussian Copula (202푘), MT- Copula is capable to automatically select the T-Copula for this From Figure 8a), we can observe that, for XYZ dataset, the dataset to sample synthetic multivariate time series. These data Gaussian Copula performs better than the T-Copula. On the other will be used in the following experiments to evaluate the business- side, as shown in Figure 8b, T-Copula outperforms the Gaussian related qualities of the generated data. The results, in terms of Copula on Abalone dataset. This is because XYZ dataset does not RMSE, presented in Table 1, confirm this choice, as T-Copula expose a tail dependence structure (see Figure 6a). Consequently, obtain a slightly better performance: ≈ 0.088 with standard devi- the use of T-Copula will impact the correlation matrix (see eq. ation ≈ 0.0005 for T-Copula and ≈ 0.093 with 10) by considering dependencies in the tails that do not appear ≈ 0.002 for Gaussian Copula with Kendall’s 휏. in original data. Conversely, Abalone dataset shows a lower tail To study the utility of the generated time series, we compare dependence structure as illustrated in Figure 9a. As a result, using each time series in the generated dataset with its counterpart a T-Copula for data generation will correct the dependencies in from the same target user category, the same day in the week, tails, while it is not the case with the Gaussian Copula. For the and the same channel in the real data set. For each pair, we mea- moment, we use the T-Copula only for tail dependence modeling, sure the MAE variation of the statistical properties of time series, which has a symmetric tail structure, the reason for which, we respectively the Min, Max, Mean, Median, Standard deviation, do not control the upper tail structure in the generated synthetic and 95 . Figure 10 shows the MAE of those measures. data as shown in Figure 9b. Results in Table 4 confirm those From this Figure, we can observe an overall variation smaller conclusions. For 푋푌푍 dataset, the 푚푖푛 퐴퐼퐶 that best fits the data than 0.2, which is a very good result as it is significantly smaller corresponds to the Gaussian Copula (3993.73). On the other hand, than the observed standard deviation of those statistics in the the T-Copula has the minimal value of 퐴퐼퐶 that best fits Abalone original dataset (respectively ≈ 1.66, ≈ 0.54, ≈ 0.46, ≈ 0.44, and 1.44). Noticeably, for the min , because we have a stan- 9 ACKNOWLEDGMENTS dard deviation ≈ 0.2, this result reflects the ability of MTCopula This work is funded by the ANRT CIFRE Program (2019/0877). to preserve the time series’s characteristics when generating syn- thetic data. This overall good business-related performance gives REFERENCES guarantees on the utility of the synthetic time series in several [1] Ruzanna Ab Razak and Noriszura Ismail. 2019. Dependence Modeling and situations when access to the real data is not possible. Portfolio Risk Estimation using GARCH-Copula Approach. Sains Malaysiana 48, 7 (2019), 1547–1555. [2] Hirotugu Akaike. 1974. A new look at the statistical model identification. IEEE transactions on automatic control 19, 6 (1974), 716–723. [3] Claudia Czado. 2019. Analyzing Dependent Data with Vine Copulas. Lecture Notes in Statistics, Springer (2019). [4] Stefano Demarta and Alexander J McNeil. 2005. The t copula and related copulas. International statistical review 73, 1 (2005), 111–129. [5] Christian Genest, Kilani Ghoudi, and L-P Rivest. 1995. A semiparametric estimation procedure of dependence parameters in multivariate families of distributions. Biometrika 82, 3 (1995), 543–552. [6] Harry Joe. 2005. Asymptotic efficiency of the two-stage estimation method for copula-based models. Journal of multivariate Analysis 94, 2 (2005), 401–419. [7] Harry Joe. 2014. Dependence modeling with copulas. CRC press. [8] Samuel Kotz and Saralees Nadarajah. 2000. Extreme value distributions: theory and applications. World Scientific. [9] Dorota Kurowicka and Roger M Cooke. 2006. Uncertainty analysis with high Figure 10: MAE Variation of Synthetic Time Series statis- dimensional dependence modelling. John Wiley & Sons. tics. [10] Zheng Li, Yue Zhao, and Jialin Fu. 2020. SYNC: A Copula based Frame- work for Generating Synthetic Data from Aggregated Sources. arXiv preprint arXiv:2009.09471 (2020). [11] Donald MacKenzie and Taylor Spears. 2014. ‘The formula that killed Wall Street’: The Gaussian copula and modelling practices in investment banking. 8 CONCLUSION Social Studies of Science 44, 3 (2014), 393–417. [12] Alexander J McNeil, Rüdiger Frey, and Paul Embrechts. 2015. Quantitative This paper proposed MTCopula a flexible, extendable, and generic risk management: concepts, techniques and tools-revised edition. Princeton solution for synthetic complex data generation. It incorporates university press. [13] Thomas Mikosch. 2006. Copulas: Tales and facts. Extremes 9, 1 (2006), 3–20. different Copula models (for the moment Gaussian and T-Copula) [14] Krishnamurty Muralidhar, Rahul Parsa, and Rathindra Sarathy. 1999. A general in order to capture different dependency structures including tail additive data perturbation method for database security. management science dependence. To bypass the non invariance problem of Pearson- 45, 10 (1999), 1399–1415. [15] Roger B Nelsen. 2007. An introduction to copulas. Springer Science & Business Correlation based Copula methods, MTCopula involves Kendall Media. 휏, which is robust to outliers and invariant under strictly mono- [16] Aristidis K Nikoloulopoulos, Harry Joe, and Haijun Li. 2009. Extreme value tone transformations. This ensures dependency preservation dur- properties of multivariate t copulas. Extremes 12, 2 (2009), 129–148. [17] Sheehan Olver and Alex Townsend. 2013. Fast inverse transform sampling in ing the process of copula learning. Unlike the GADP approach one and two dimensions. arXiv preprint arXiv:1307.1223 (2013). that uses only the Gaussian distribution to model the marginals, [18] Noseong Park, Mahmoud Mohammadi, Kshitij Gorde, Sushil Jajodia, Hongkyu Park, and Youngmin Kim. 2018. Data Synthesis Based on Generative Adver- our solution incorporates a variety of bounded distribution in sarial Networks. Proc. VLDB Endow. 11, 10 (2018), 1071–1083. order to best fit the behavior of variables and do not generate [19] Neha Patki, Roy Wedge, and Kalyan Veeramachaneni. 2016. The synthetic outliers. In addition, MTCopula is less restrictive in terms of the data vault. In 2016 IEEE International Conference on and Advanced Analytics (DSAA). IEEE, 399–410. quantity of the input data and is more explainable than GANs. [20] Andrew Patton. 2013. Copula methods for forecasting multivariate time series. MTCopula is able to automatically select both the univariate In Handbook of economic forecasting. Vol. 2. Elsevier, 899–960. marginal distributions and the copula model that best fit the input [21] L. Petricioli, L. Humski, M. Vranić, and D. Pintar. 2020. Data Set Synthesis Based on Known Correlations and Distributions for Expanded Social Graph data. For that, it uses MLE to fit the possible marginal distribution Generation. IEEE Access 8 (2020), 33013–33022. model, and then AIC to choose both the best distribution and the [22] Stéphanie Portet. 2020. A primer on model selection using the Akaike infor- mation criterion. Infectious Disease Modelling 5 (2020), 111–128. best Copula Model between the T-Copula and the Gaussian one. [23] Alexander J Ratner, Henry Ehrenberg, Zeshan Hussain, Jared Dunnmon, and MTCopula handles multiple data types including complex tabu- Christopher Ré. 2017. Learning to compose domain-specific transformations lar datasets and multiple/multivariate time series. The proposed for data augmentation. In Advances in neural information processing systems. 3236–3246. experiments show MTCopula’s interest and efficiency compared [24] Jerome P Reiter, Quanli Wang, and Biyuan Zhang. 2014. Bayesian estimation to existing methods. of disclosure risks for multiply imputed, synthetic data. Journal of Privacy In our future works, first, further experiments will be con- and Confidentiality 6, 1 (2014). [25] Marko Robnik-Šikonja. 2015. Data generators for learning systems based on ducted to evaluate (푖) the sensitivity of MTCopula to the number RBF networks. IEEE transactions on neural networks and learning systems 27, 5 of parameters it has to fit to correctly estimate the marginals or (2015), 926–938. [26] David Salinas, Michael Bohlke-Schneider, Laurent Callot, Roberto Medico, Copula models, by varying the number and the nature of the and Jan Gasthaus. 2019. High-dimensional multivariate forecasting with low- variables, (푖푖) how it deals with asymmetric tail dependency be- rank Gaussian Copula Processes. In Advances in Neural Information Processing haviors as this problem is still open in MTCopula. Second, we Systems. 6827–6837. [27] Francesco Serinaldi. 2008. Analysis of inter-gauge dependence by Kendall’s 휏 will work on making our approach robust to missing values in K, upper tail dependence coefficient, and 2-copulas with application to rainfall the original datasets. Third, we plan to study the use of synthetic fields. Stochastic Environmental Research and Risk Assessment 22, 6 (2008), data for machine learning model fitting, in order to see how qual- 671–688. [28] M Sklar. 1959. Fonctions de repartition an dimensions et leurs marges. Publ. itative is the new data for different tasks. Fourth, an important inst. statist. univ. Paris 8 (1959), 229–231. way to see how much using MTCopula could be interesting for [29] Natasa Tagasovska, Damien Ackerer, and Thibault Vatter. 2019. Copulas as High-Dimensional Generative Models: Vine Copula Autoencoders. In Ad- machine learning tasks is also to analyze its scalability according vances in Neural Information Processing Systems. 6528–6540. to the number of original and generated data. Fifth, we want to [30] Honggang Zhu, LM Zhang, Te Xiao, and XY Li. 2017. Generation of multivari- tackle a new research problem: how can MTCopula efficiently ate cross-correlated geotechnical random fields. Computers and Geotechnics 86 (2017), 95–107. consider conditional dependencies between variables. Using Vine Copula seems to be a promising solution that we need to study.