Optimal Copula Transport for Clustering Multivariate Time Series

OPTIMAL COPULA TRANSPORT FOR CLUSTERING MULTIVARIATE TIME SERIES Gautier Marti?y Frank Nielseny Philippe Donnat? ? Hellebore Capital Management y Ecole Polytechnique ABSTRACT erature consist in N real-valued variables observed T times, while in this work we will focus on N × d × T time series This paper presents a new methodology for clustering multi- datasets, i.e. N vector-valued variables in Rd observed T variate time series leveraging optimal transport between cop- times. For instance, N horses can be monitored through time ulas. Copulas are used to encode both (i) intra-dependence of using d sensors placed on their body and limbs. Clustering a multivariate time series, and (ii) inter-dependence between these multivariate time series using dependence information two time series. Then, optimal copula transport allows us to only is already challenging: (i) dependence information can define two distances between multivariate time series: (i) one be found at two levels, intra-dependence between the d time for measuring intra-dependence dissimilarity, (ii) another one series x(i); : : : ; x(i) , ≤ i ≤ N, and inter-dependence be- for measuring inter-dependence dissimilarity based on a new ( 1 d ) 1 tween the N time series X ;:::;X ; (ii) efficient multi- multivariate dependence coefficient which is robust to noise, ( 1 N ) variate dependence measures are required. Back to the pre- deterministic, and which can target specified dependencies. vious example, intra-dependence between the d time series Index Terms— Clustering; Multivariate Time Series; Op- quantifies how the d sensors jointly move and thus may help timal Transport; Earth Mover’s Distance; Empirical Copula; to cluster horses based on their gaits (e.g., walk, trot, canter, Dependence Coefficient gallop, pace) while inter-dependence between the N time series quantifies how the N horses jointly move and thus may 1. INTRODUCTION help to cluster horses based on their trajectories. Recently, several new dependence coefficients between random vari- Clustering is the task of grouping a set of objects in such a ables have been proposed in the literature ([6], [7], [8], [9]) way that objects in the same group, also called cluster, are demonstrating the interest and the difficulty of obtaining such more similar to each other than those in different groups. This measures. However, for the clustering task of multivariate primitive in unsupervised machine learning is known to be time series, most of them are inappropriate: (i) some are not hard to formalize and hard to solve. For practitioners, the multivariate measures, (ii) some are not robust, i.e. estimate proper choice of a pair-wise similarity measure, features or dependence may strongly vary from one estimation to another representation, normalizations, and number of clusters is a and may yield erroneously high dependence estimate between supplementary burden: it is most often task and goal depen- independent variables as it is in the case with the Randomized dent. Time series, sequences of data points or ordered sets of Dependence Coefficient (RDC) [7] as noticed in [8], (iii) they random variables, add complexity to the clustering task be- aim to capture a wide range of dependence equitably [9] and ing dynamical objects. In the survey [1], the author classifies thus are not application-oriented, i.e. one cannot specify the time series clustering into three main approaches: working (i) dependencies we want to focus on and ignore the others. Con- arXiv:1509.08144v2 [cs.LG] 11 Jan 2016 on raw data (e.g., time-frequency [2]), (ii) on features (e.g., sequently, shortcoming (ii) leads to spurious clusters, and (iii) wavelets, SAX [3]), (iii) on models (e.g., ARIMA time series to ill-suited clusters for specific tasks besides increasing the [4]). Regardless of the method chosen, dependence between risk of capturing spurious dependence, e.g., the Hirschfeld- time series (usually measured with Pearson linear correlation) Gebelein-Renyi Maximum Correlation Coefficient equals 1 is a major information to study. This is notably the case for too often [8]. In this work, we will therefore propose a new fMRI, EEG and financial time series. Obviously, dependence multivariate dependence measure which was motivated by ap- does not amount for the whole information in a set of time plications (clustering credit default swaps based on their noisy series. For example, in the specific case of N time series term structure time series [10]), and the need of robustness on whose observed values are drawn from T independent and finite noisy samples. Our dependence measure leveraging sta- identically distributed random variables, one should take into tistical robustness of empirical copulas and optimal transport account all the available information in these N time series, achieves the best results on the benchmark datasets, yet the i.e. dependence between them and the N marginal distribu- experiment is biased in our favor since we specify to our co- tions, in order to design a proper distance for clustering [5]. efficient the dependence we look for (by definition) unlike the Many of the time series datasets which can be found in the lit- other dependence measures. Contributions the mathematical tools to capture this intra-dependence and how to compare it between two d-variate time series in order In this article, we will introduce (i) a method to compare to perform a clustering based on this information. intra-dependence between two multivariate time series, (ii) a dependence coefficient to evaluate the inter-dependence between two such time series, (iii) a method that allows 3.1. The Copula Transform to specify the dependencies our coefficient should mea- Since “the study of copulas and their applications in statis- sure. The novel dependence coefficient proposed is bench- tics is a rather modern phenomenon” and “despite overlap- 1 marked on experiments [11] based on R code from [7] . ping goals of multivariate modeling and dependence identifi- Tutorial, implementation and illustrations are available at cation, until recently the fields of machine learning in general www.datagrapple.com/Tech. [. ] have been ignorant of the framework of copulas” [16], we recall in this section the basic definitions and results of 2. RELATED WORK Copula Theory required for clustering with optimal copula transport. Clustering multivariate time series (MTS) datasets has been Definition. The Copula Transform. Let X = (X ;:::;X ) much less explored than clustering univariate time series [12] 1 d be a random vector with continuous marginal cumulative dis- despite their ubiquity in fields such as motion recognition tribution functions (cdfs) P , 1 ≤ i ≤ d. The random vector (e.g., gaits), medicine (e.g., EEG, fMRI) and finance (e.g., i U = (U ;:::;U ) := P (X) = (P (X );:::;P (X )) is fixed-income securities yield curves or term structures). A 1 d 1 1 d d known as the copula transform. U , 1 ≤ i ≤ d, are uniformly general trend for clustering N multivariate time series is to i distributed on [0; 1] (the probability integral transform): for consider them as N datasets of dimension d × T , then in P the cdf of X , we have x = P (P −1(x)) = Pr(X ≤ order to obtain a clustering of the N datasets, one lever- i i i i i P −1(x)) = Pr(P (X ) ≤ x), thus P (X ) ∼ U[0; 1]. ages a similarity measure between two such d × T MTS i i i i i datasets among Euclidean Distance, Dynamic Time Warp- Theorem. Sklar’s Theorem [17]. For any random vec- ing, Weighted Sum SVD, PCA similarity factor and other tor X = (X1;:::;Xd) having continuous marginal cdfs Pi, PCA-based similarity measures [13] before running a stan- 1 ≤ i ≤ d, its joint cumulative distribution P is uniquely dard algorithm such as k-means. In [14], the authors improve expressed as on the PCA-based methodology by adding another similarity factor: a Mahalanobis distance similarity factor which P (X1;:::;Xd) = C(P1(X1);:::;Pd(Xd)); discriminates between two datasets that may have similar spatial orientation (similar principal components) but are lo- where C, the multivariate distribution of uniform marginals, cated far apart. Authors finally combine orientation (PCA) is known as the copula of X. and location (Mahalanobis distance) with a convex combi- Copulas are central for studying the dependence between nation to feed a k-means algorithm leveraging the resulting random variables: their uniform marginals jointly encode all dissimilarities. Paving another way for research in [12], the the dependence. One can observe that in most cases, we do authors map each individual time series to a fixed-length not know a priori the margins Pi, 1 ≤ i ≤ d, for applying vector of non-parametric statistical summaries before apply- the copula transform on (X1;:::;Xd). [18] has introduced a ing k-means on this feature space. In this work, we focus practical estimator for the uniform margins and the underly- instead on dependence which is not yet well understood in ing copula, the empirical copula transform. the multivariate setting (e.g., many different definitions of mutual information, the copula construction breaks down for Definition. The Empirical Copula Transform [18]. Xt;:::;Xt t ;:::;T T non-overlapping multivariate marginals [15]). To alleviate Let ( 1 d), = 1 , be observations the former shortcoming, we propose to study separately intra- from a random vector (X1;:::;Xd) with continuous mar- dependence and inter-dependence. In line with the related gins. Since one cannot directly obtain the corresponding U t;:::;U t P Xt ;:::;P Xt research, we focus on defining proper distances between the copula observations ( 1 d) = ( 1( 1) d( d)), where t = 1;:::;T , without knowing a priori (P1;:::;Pd), multivariate time series rather than elaborating on the cluster- T ing algorithm. one can instead estimate the d empirical margins Pi (x) = 1 PT t T t=1 1(Xi ≤ x), 1 ≤ i ≤ d, to obtain the T empirical observations U~t;:::; U~t P T Xt ;:::;P T Xt .

Optimal Copula Transport for Clustering Multivariate Time Series

Real-Time Wavelet Compression and Self-Modeling Curve

Curriculum Vitae

Multivariate Chemometrics As a Strategy to Predict the Allergenic Nature of Food Proteins

Copula-Based Analysis of Dependent Data with Censoring and Zero Inﬂation

Principal Component Analysis

Chemometrics in Europe: I (R;Q,P)= I (Q11p) Na (2) Selected Results

Chemometrics: Theory and Application

A Synergistic Use of Chemometrics and Deep Learning Improved the Predictive Performance of Near-Infrared Spectroscopy Models for Dry Matter Prediction in Mango Fruit

Philip Ernst: Tweedie Award the Institute of Mathematical Statistics CONTENTS Has Selected Philip A

Abstracts Chemometrics and Analytical Chemistry 2014

Chemometric Study of the Correlation Between Human Exposure to Benzene and Pahs and Urinary Excretion of Oxidative Stress Biomarkers

Sequential Truncation of R-Vine Copula Mixture Model for High- Dimensional Datasets