OPTIMAL COPULA TRANSPORT FOR CLUSTERING MULTIVARIATE

Gautier Marti?† Frank Nielsen† Philippe Donnat?

? Hellebore Capital Management † Ecole Polytechnique

ABSTRACT erature consist in N real-valued variables observed T times, while in this work we will focus on N × d × T time series This paper presents a new methodology for clustering multi- datasets, i.e. N vector-valued variables in Rd observed T variate time series leveraging optimal transport between cop- times. For instance, N horses can be monitored through time ulas. Copulas are used to encode both (i) intra-dependence of using d sensors placed on their body and limbs. Clustering a multivariate time series, and (ii) inter-dependence between these multivariate time series using dependence information two time series. Then, optimal copula transport allows us to only is already challenging: (i) dependence information can define two distances between multivariate time series: (i) one be found at two levels, intra-dependence between the d time for measuring intra-dependence dissimilarity, (ii) another one series x(i), . . . , x(i) , ≤ i ≤ N, and inter-dependence be- for measuring inter-dependence dissimilarity based on a new ( 1 d ) 1 tween the N time series X ,...,X ; (ii) efficient multi- multivariate dependence coefficient which is robust to noise, ( 1 N ) variate dependence measures are required. Back to the pre- deterministic, and which can target specified dependencies. vious example, intra-dependence between the d time series Index Terms— Clustering; Multivariate Time Series; Op- quantifies how the d sensors jointly move and thus may help timal Transport; Earth Mover’s Distance; Empirical Copula; to cluster horses based on their gaits (e.g., walk, trot, canter, Dependence Coefficient gallop, pace) while inter-dependence between the N time se- ries quantifies how the N horses jointly move and thus may 1. INTRODUCTION help to cluster horses based on their trajectories. Recently, several new dependence coefficients between random vari- Clustering is the task of grouping a set of objects in such a ables have been proposed in the literature ([6], [7], [8], [9]) way that objects in the same group, also called cluster, are demonstrating the interest and the difficulty of obtaining such more similar to each other than those in different groups. This measures. However, for the clustering task of multivariate primitive in unsupervised machine learning is known to be time series, most of them are inappropriate: (i) some are not hard to formalize and hard to solve. For practitioners, the multivariate measures, (ii) some are not robust, i.e. estimate proper choice of a pair-wise similarity measure, features or dependence may strongly vary from one estimation to another representation, normalizations, and number of clusters is a and may yield erroneously high dependence estimate between supplementary burden: it is most often task and goal depen- independent variables as it is in the case with the Randomized dent. Time series, sequences of points or ordered sets of Dependence Coefficient (RDC) [7] as noticed in [8], (iii) they random variables, add complexity to the clustering task be- aim to capture a wide of dependence equitably [9] and ing dynamical objects. In the survey [1], the author classifies thus are not application-oriented, i.e. one cannot specify the time series clustering into three main approaches: working (i) dependencies we want to focus on and ignore the others. Con- arXiv:1509.08144v2 [cs.LG] 11 Jan 2016 on raw data (e.g., time- [2]), (ii) on features (e.g., sequently, shortcoming (ii) leads to spurious clusters, and (iii) , SAX [3]), (iii) on models (e.g., ARIMA time series to ill-suited clusters for specific tasks besides increasing the [4]). Regardless of the method chosen, dependence between risk of capturing spurious dependence, e.g., the Hirschfeld- time series (usually measured with Pearson linear correlation) Gebelein-Renyi Maximum Correlation Coefficient equals 1 is a major information to study. This is notably the case for too often [8]. In this work, we will therefore propose a new fMRI, EEG and financial time series. Obviously, dependence multivariate dependence measure which was motivated by ap- does not amount for the whole information in a set of time plications (clustering credit default swaps based on their noisy series. For example, in the specific case of N time series term structure time series [10]), and the need of robustness on whose observed values are drawn from T independent and finite noisy samples. Our dependence measure leveraging sta- identically distributed random variables, one should take into tistical robustness of empirical copulas and optimal transport account all the available information in these N time series, achieves the best results on the benchmark datasets, yet the i.e. dependence between them and the N marginal distribu- is biased in our favor since we specify to our co- tions, in order to design a proper distance for clustering [5]. efficient the dependence we look for (by definition) unlike the Many of the time series datasets which can be found in the lit- other dependence measures. Contributions the mathematical tools to capture this intra-dependence and how to compare it between two d-variate time series in order In this article, we will introduce (i) a method to compare to perform a clustering based on this information. intra-dependence between two multivariate time series, (ii) a dependence coefficient to evaluate the inter-dependence between two such time series, (iii) a method that allows 3.1. The Copula Transform to specify the dependencies our coefficient should mea- Since “the study of copulas and their applications in statis- sure. The novel dependence coefficient proposed is bench- tics is a rather modern phenomenon” and “despite overlap- 1 marked on [11] based on R code from [7] . ping goals of multivariate modeling and dependence identifi- Tutorial, implementation and illustrations are available at cation, until recently the fields of machine learning in general www.datagrapple.com/Tech. [. . . ] have been ignorant of the framework of copulas” [16], we recall in this section the basic definitions and results of 2. RELATED WORK Copula Theory required for clustering with optimal copula transport. Clustering multivariate time series (MTS) datasets has been Definition. The Copula Transform. Let X = (X ,...,X ) much less explored than clustering univariate time series [12] 1 d be a random vector with continuous marginal cumulative dis- despite their ubiquity in fields such as motion recognition tribution functions (cdfs) P , 1 ≤ i ≤ d. The random vector (e.g., gaits), (e.g., EEG, fMRI) and finance (e.g., i U = (U ,...,U ) := P (X) = (P (X ),...,P (X )) is fixed-income securities yield curves or term structures). A 1 d 1 1 d d known as the copula transform. U , 1 ≤ i ≤ d, are uniformly general trend for clustering N multivariate time series is to i distributed on [0, 1] (the probability integral transform): for consider them as N datasets of dimension d × T , then in P the cdf of X , we have x = P (P −1(x)) = Pr(X ≤ order to obtain a clustering of the N datasets, one lever- i i i i i P −1(x)) = Pr(P (X ) ≤ x), thus P (X ) ∼ U[0, 1]. ages a similarity measure between two such d × T MTS i i i i i datasets among Euclidean Distance, Dynamic Time Warp- Theorem. Sklar’s Theorem [17]. For any random vec- ing, Weighted Sum SVD, PCA similarity factor and other tor X = (X1,...,Xd) having continuous marginal cdfs Pi, PCA-based similarity measures [13] before running a stan- 1 ≤ i ≤ d, its joint cumulative distribution P is uniquely dard algorithm such as k-. In [14], the authors improve expressed as on the PCA-based methodology by adding another similar- ity factor: a Mahalanobis distance similarity factor which P (X1,...,Xd) = C(P1(X1),...,Pd(Xd)), discriminates between two datasets that may have similar spatial orientation (similar principal components) but are lo- where C, the multivariate distribution of uniform marginals, cated far apart. Authors finally combine orientation (PCA) is known as the copula of X. and location (Mahalanobis distance) with a convex combi- Copulas are central for studying the dependence between nation to feed a k-means algorithm leveraging the resulting random variables: their uniform marginals jointly encode all dissimilarities. Paving another way for research in [12], the the dependence. One can observe that in most cases, we do authors map each individual time series to a fixed-length not know a priori the margins Pi, 1 ≤ i ≤ d, for applying vector of non-parametric statistical summaries before apply- the copula transform on (X1,...,Xd). [18] has introduced a ing k-means on this feature space. In this work, we focus practical estimator for the uniform margins and the underly- instead on dependence which is not yet well understood in ing copula, the empirical copula transform. the multivariate setting (e.g., many different definitions of mutual information, the copula construction breaks down for Definition. The Empirical Copula Transform [18]. Xt,...,Xt t ,...,T T non-overlapping multivariate marginals [15]). To alleviate Let ( 1 d), = 1 , be observations the former shortcoming, we propose to study separately intra- from a random vector (X1,...,Xd) with continuous mar- dependence and inter-dependence. In line with the related gins. Since one cannot directly obtain the corresponding U t,...,U t P Xt ,...,P Xt research, we focus on defining proper distances between the copula observations ( 1 d) = ( 1( 1) d( d)), where t = 1,...,T , without knowing a priori (P1,...,Pd), multivariate time series rather than elaborating on the cluster- T ing algorithm. one can instead estimate the d empirical margins Pi (x) = 1 PT t T t=1 1(Xi ≤ x), 1 ≤ i ≤ d, to obtain the T empiri- cal observations U˜t,..., U˜t P T Xt ,...,P T Xt . 3. CLUSTERING INTRA-DEPENDENCE ( 1 d) = ( 1 ( 1) d ( d)) ˜t t t Equivalently, since Ui = Ri/T , Ri being the rank of obser- t We refer to the dependence between the d univariate time se- vation Xi , the empirical copula transform can be considered ries of a d-variate time series as intra-dependence. We present as the normalized rank transform.

1https://github.com/lopezpaz/randomized_ Few remarks: the empirical copula transform is (i) easy dependence_coefficient and fast to compute, i.e. sorting D arrays of length T , ρ 0.84 ρ =1 d×T 2 1.2 of data X ,X ∈ R . We estimate the two empirical cop- ≈ 1 2 0 1.0 ula densities using h1 and h2 that are converted ) 0.8 2 n n ) X into signatures s = {(p , w ) } and s = {(q , w ) }, ( 0.6 1 i p 2 i q Y i i=1 i i=1 (

n 4

l d Y 0.4 where pi, qi ∈ [0, 1] are the central positions of bins in h1, P ∼ 6 0.2

Y h2 respectively, and wp , wq are equal to the correspond- 8 0.0 i i

10 0.2 ing bin frequencies. We define the intra-dependence distance 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 X [0,1] PX (X) Dintra(X1,X2) := EMD(s1, s2), where [22] ∼U Fig. 1. The copula transform invariance property to strictly X increasing transformation: Let X ∼ U[0, 1] and Y ∼ ln(X). EMD(s1, s2) := min kpi − qjkfij f Pearson correlation cannot retrieve the perfect deterministic 1≤i,j≤n dependence on raw data (left panel) but it can on the copula subject to fij ≥ 0, 1 ≤ i, j ≤ n, transform (right panel). n X fij ≤ wpi , 1 ≤ i ≤ n, O(DT log T ); (ii) consistent and converges fast to the un- j=1 (1) n derlying copula [19], [6]. Authors leverage the empirical X f ≤ w , 1 ≤ j ≤ n, copula transform for several purposes: [6] benefit from its ij qj i=1 invariance to strictly increasing transformation of Xi vari- n n ables (Fig. 1) for improving feature selection, [7] to obtain X X fij = 1. a dependence coefficient invariant with respect to marginal i=1 j=1 distribution transformations, and [5] to study separately de- pendence and margins for clustering. From a practical point of view, this distance is robust to the binning process (i.e. small misalignements of correspond- 3.2. Optimal Transport between Copulas ing frequencies yield to a slightly larger distance) unlike the standard bin-by-bin distances (small misalignements yield to Optimal transport is an old problem in a much larger distance) such as Kullback-Leibler anchored in Gaspard Monge seminal treatise which has re- for instance. However, its main drawback is the computa- ceived a renewed attention from both the pure and applied tional complexity. Finding the optimal transport between the mathematics communities. Its recent theoretical development n Diracs is an instance of the assignement problem. This fun- are detailed in [20]. Meantime, applications have been found damental combinatorial optimization problem is equivalent to in several domains such as mathematical economics [21], im- the problem of finding a minimum weight in a com- age retrieval [22], image recoloring [23] and univariate em- plete weighted bipartite graph K which can be solved by pirical probability distributions clustering [24]. n,n the Hungarian algorithm in O(n3). Solving an optimal transport problem amounts to find In Fig. 2, we display three bivariate empirical copulas the optimal transportation and allocation of resources: for computed on real data (these copulas encode the depen- instance, finding the best mapping between n factories and m dence between two maturities of the credit default swaps retail stores given the cost of shipment, or the best way to turn for three different entities). We can notice strong positive piles of dirt into another minimizing the work defined by the dependence (expressed by the diagonals), yet these copulas amount of dirt moved times the distance by which it is moved. exhibit distinct dependence patterns which may not be taken The last example actually motivated the definition of a dis- into account by correlation coefficients (cf. the tutorial at tance between two multi-dimensional distributions called www.datagrapple.com/Tech for an application of this the Earth Mover’s Distance (EMD) [22] which has been methodology to credit default swaps time series). found to be the discrete version of the first Wasserstein dis- tance W (µ, ν) := inf R d(x, y)dγ(x, y) whose 1 γ∈Γ(µ,ν) M×M 4. CLUSTERING INTER-DEPENDENCE definition is strongly related to the optimal transport in its Kantorovich’s formulation {R c x, y γ x, y | γ ∈ inf X×Y ( )d ( ) We refer to the dependence between two d-variate time se- Γ(µ, ν)}. Since copulas encode all the dependencies with ries as inter-dependence. In this section, we present our inter- their uniform marginals, we leverage them to quantify intra- dependence measure and compare it in Fig. 4 with the most dependence similarity between two d-variate random vari- popular coefficients in the literature (ACE, dCor, MIC, RDC). ables, i.e. how similar the dependence between their d coor- Let ZT = (XT ,...,XT ,Y T ,...,Y T ) ∈ R2d×T be dinates is. 1 d 1 d T stacked observations from the random vectors X = Definition. Earth Mover’s Distance between two copulas. (X1,...,Xd) and Y = (Y1,...,Yd). Informally, in or- d Let U˜1, U˜2 ∈ [0, 1] be the empirical copula transforms der to estimate dependence between X and Y , we use the Fig. 2. Optimal Copula Transport for measuring intra- dependence similarity of two MTS; C1, C2, C3 are empirical 2×T copulas of data X1,X2,X3 ∈ R respectively. According to the EMD, the dependence between the two coordinates of X1 is much similar to those of X2 than those of X3. Fig. 3. Optimal Copula Transport for estimating dependence between random variables; Dependence can be seen as the relative distance between the independence copula and one relative position of the empirical copula C˜ built from data or more target dependence copulas. In this picture, the tar- T Z on the shortest path starting from the independence cop- get dependencies are “perfect dependence” and “perfect anti- ula Cind, ending to one of the copulas {Ci} encoding the dependence”. The empirical copula (Data Copula) was built target dependencies, and passing through C˜. This idea is from positively correlated Gaussians, and thus is nearer to depicted in Fig. 3 and benchmarked in Fig. 4. the “perfect dependence” copula (top right corner) than to the “perfect anti-dependence” copula (bottom left corner). Definition. Target Dependencies Coefficient using Trans- port to Dependence Copulas. The Target Dependencies Coef- ficient using Transport to Dependence Copulas (TDC) is de- fined as

˜ EMD(C , C) 0.8 TDC(XT ,Y T ) := ind .

EMD(Cind, C˜) + mini EMD(C,C˜ i) 0.4 power.cor[typ, ] power.cor[typ, ] 0.0 cor dCor MIC 0.8 xvals xvals C˜ C C˜ ∈ {C } ACE For = ind, TDC = 0, for i , TDC = 1, otherwise RDC TDC TDC quantifies the relative nearness to independence or to 0.4 power.cor[typ, ] power.cor[typ, ] the specified dependencies. Bonus: practitioners can discover 0.0 Power

which specified dependence is “activated”, i.e. which of their 0.8 xvals xvals

hypotheses about the dependence between X and Y seems 0.4 the most likely (qualitative) and how strong (quantitative). power.cor[typ, ] power.cor[typ, ] 0.0

0.8 xvals xvals 0.4 power.cor[typ, ] power.cor[typ, ]

5. DISCUSSION 0.0 0 20 40 60 80 100 0 20 40 60 80 100

xvals Noise Level xvals The proposed methodology presents several benefits: non- Fig. 4. Experiments based on [11] and [7]; Dependence es- parametric, robust and deterministic (optimal transport), ac- timators power as a function of the noise for several deter- curate and generic representation of dependence (empirical ministic patterns + noise. Their power is the percentage of copulas). Yet, it has also some scalability drawbacks: (i) in times that they are able to distinguish between dependent and dimension, non-parametric estimations of density suffer from independent samples. TDC (dark blue curves) can deal with the curse of dimensionality (ii) in time, EMD is costly to com- complex dependence patterns in presence of noise if they are pute (but, there exist methods for speeding up computations specified in its targets. [25]). To alleviate drawback (i), parametric methods may be a solution, we consider optimal copula transport in statistical manifolds for further research. 6. REFERENCES Mining: Algorithms, Theory and Applications, in con- junction with the Conference on Data Mining, Houston, [1] T Warren Liao, “Clustering of time series data a sur- 2005, pp. 25–32. vey,” , vol. 38, no. 11, pp. 1857– 1874, 2005. [13] Kiyoung Yang and Cyrus Shahabi, “A PCA-based simi- larity measure for multivariate time series,” in Proceed- [2] Robert H Shumway, “Time-frequency clustering and ings of the 2nd ACM international workshop on Multi- discriminant analysis,” Statistics & probability letters, media databases. ACM, 2004, pp. 65–74. vol. 63, no. 3, pp. 307–314, 2003. [14] Ashish Singhal and Dale E Seborg, “Clustering of multi- [3] Jessica Lin, Eamonn Keogh, Li Wei, and Stefano variate time-series data,” Journal of Chemometrics, vol. Lonardi, “Experiencing SAX: a novel symbolic repre- 19, pp. 427–438, 2005. sentation of time series,” Data Mining and knowledge [15] Christian Genest, JJ Molina, and JA Lallena, “De discovery, vol. 15, no. 2, pp. 107–144, 2007. l’impossibilite´ de construire des lois a` marges mul- tidimensionnelles donnees´ a` partir de copules,” [4] Konstantinos Kalpakis, Dhiral Gada, and Vasundhara Comptes rendus de l’Academie´ des sciences. Serie´ 1, Puttagunta, “Distance measures for effective clustering Mathematique´ , vol. 320, no. 6, pp. 723–726, 1995. of ARIMA time-series,” in Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on. [16] Gal Elidan, “Copulas in machine learning,” in Copulae IEEE, 2001, pp. 273–280. in Mathematical and Quantitative Finance, pp. 39–60. Springer, 2013. [5] Philippe Donnat, Gautier Marti, and Philippe Very, “To- ward a generic representation of random variables for [17] A Sklar, Fonctions de repartition´ a` n dimensions et leurs machine learning,” Pattern Recognition Letters, vol. 70, marges, Universite´ Paris 8, 1959. pp. 24–31, 2016. [18] Paul Deheuvels, “La fonction de dependance´ em- [6] Barnabas´ Poczos,´ Zoubin Ghahramani, and Jeff Schnei- pirique et ses propriet´ es.´ un test non parametrique´ der, “Copula-based kernel dependency measures,” dindependance,”´ Acad. Roy. Belg. Bull. Cl. Sci.(5), vol. ICML, 2012. 65, no. 6, pp. 274–292, 1979. [19] Paul Deheuvels, “An asymptotic decomposition for [7] David Lopez-Paz, Philipp Hennig, and Bernhard multivariate distribution-free tests of independence,” Scholkopf,¨ “The randomized dependence coefficient,” Journal of Multivariate Analysis, vol. 11, no. 1, pp. 102– NIPS, 2013. 113, 1981. [8] A Adam Ding and Yi Li, “Copula correlation: An eq- [20] Cedric´ Villani, Optimal transport: old and new, vol. uitable dependence measure and extension of pearson’s 338, Springer Science & Business Media, 2008. correlation,” arXiv preprint arXiv:1312.7214, 2013. [21] Leonid Vitalievich Kantorovich, “On the translocation [9] Justin B Kinney and Gurinder S Atwal, “Equitability, of masses,” in Dokl. Akad. Nauk SSSR, 1942, vol. 37, mutual information, and the maximal information coef- pp. 199–201. ficient,” Proceedings of the National Academy of Sci- ences, vol. 111, no. 9, pp. 3354–3359, 2014. [22] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas, “The earth mover’s distance as a metric for image re- [10] Gautier Marti, Philippe Very, Philippe Donnat, and trieval,” International journal of computer vision, vol. Frank Nielsen, “A proposal of a methodological frame- 40, no. 2, pp. 99–121, 2000. work with experimental guidelines to investigate clus- [23] Sira Ferradans, Nicolas Papadakis, Julien Rabin, tering stability on financial time series,” IEEE ICMLA, Gabriel Peyre,´ and Jean-Franc¸ois Aujol, Regularized 2015. discrete optimal transport, Springer, 2013. [11] Noah Simon and Robert Tibshirani, “Comment on” [24] Keith Henderson, Brian Gallagher, and Tina Eliassi- detecting novel associations in large data sets” by Rad, “EP-MEANS: An efficient nonparametric cluster- reshef et al, science dec 16, 2011,” arXiv preprint ing of empirical probability distributions,” 2015. arXiv:1401.7645, 2014. [25] Marco Cuturi, “Sinkhorn distances: Lightspeed com- [12] Tamraparni Dasu, Deborah F Swayne, and David Poole, putation of optimal transport,” in Advances in Neural “Grouping multivariate time series: A case study,” in Information Processing Systems, 2013, pp. 2292–2300. Proceedings of the IEEE Workshop on Temporal Data