Embedding and learning with signatures Adeline Fermaniana aSorbonne Universit´e,CNRS, Laboratoire de Probabilit´es,Statistique et Mod´elisation,4 place Jussieu, 75005 Paris, France,
[email protected] Abstract Sequential and temporal data arise in many fields of research, such as quantitative fi- nance, medicine, or computer vision. A novel approach for sequential learning, called the signature method and rooted in rough path theory, is considered. Its basic prin- ciple is to represent multidimensional paths by a graded feature set of their iterated integrals, called the signature. This approach relies critically on an embedding prin- ciple, which consists in representing discretely sampled data as paths, i.e., functions from [0, 1] to Rd. After a survey of machine learning methodologies for signatures, the influence of embeddings on prediction accuracy is investigated with an in-depth study of three recent and challenging datasets. It is shown that a specific embedding, called lead-lag, is systematically the strongest performer across all datasets and algorithms considered. Moreover, an empirical study reveals that computing signatures over the whole path domain does not lead to a loss of local information. It is concluded that, with a good embedding, combining signatures with other simple algorithms achieves results competitive with state-of-the-art, domain-specific approaches. Keywords: Sequential data, time series classification, functional data, signature. 1. Introduction Sequential or temporal data are arising in many fields of research, due to an in- crease in storage capacity and to the rise of machine learning techniques. An illustra- tion of this vitality is the recent relaunch of the Time Series Classification repository (Bagnall et al., 2018), with more than a hundred new datasets.