Over Sampling for Time Series Classification (OSTSC) Is Built on This Idea
Total Page:16
File Type:pdf, Size:1020Kb
Over Sampling for Time Series Classification Matthew F. Dixon, Diego Klabjan and Lan Wei 2017-11-05 Contents Abstract . .1 Introduction . .1 Overview . .2 Background . .2 Functionality . .3 Examples . .4 Data loading & oversampling . .4 Applying OSTSC to medium size datasets . .6 Evaluating OSTSC on the large datasets . 13 Summary . 19 References . 19 Abstract The OSTSC package is a powerful oversampling approach for classifying univariant, but multinomial time series data. This vignette provides a brief overview of the over-sampling methodology implemented by the package. A tutorial of the OSTSC package is provided. We begin by providing three test cases for the user to quickly validate the functionality in the package. To demonstrate the performance impact of OSTSC, we provide two medium size imbalanced time series datasets. Each example applies a TensorFlow implementation of a Long Short-Term Memory (LSTM) classifier - a type of a Recurrent Neural Network (RNN) classifier - to imbalanced time series. The classifier performance is compared with and without oversampling. Finally, larger versions of these two datasets are evaluated to demonstrate the scalability of the package. The examples demonstrate that the OSTSC package improves the performance of RNN classifiers applied to highly imbalanced time series data. In particular, OSTSC is observed to increase the AUC of LSTM from 0.51 to 0.786 on a high frequency trading dataset consisting of 30,000 time series observations. Introduction A significant number of learning problems involve the accurate classification of rare events or outliers from time series data. For example, the detection of a flash crash, rogue trading, or heart arrhythmia from an electrocardiogram. Due to the rarity of these events, machine learning classifiers for detecting these events may be biased towards avoiding false positives. This is because any potential for false positives is greatly exaggerated by the number of negative samples in the data set. Class imbalance problems are most easily addressed by treating the observations as conditionally independent. Then standard statistical techniques, such as oversampling the minority class or undersampling the majority class, or both, are applicable. More (2016) compared a batch of resampling techniques’ classification performances on imbalanced datasets. Besides the conventional resampling approaches, More showed how ensemble methods retain as much original information from the majority class as possible when performing undersampling. Ensemble methods perform well and have gained popularity in the data mining literature. Dubey et al. (2014) studied an ensemble system of feature selection and data sampling from an imbalanced Alzheimer’s Disease Neuroimaging Initiative dataset. 1 However the imbalanced time series classification problem is more complex when the time dimension needs to be accounted for. Not only is the assumption that the observations are conditionally independent too strong, but also the predictors may be cross-correlated too. The sample correlation structure may weaken or be entirely lost under application of the conventional resampling approaches described above. There are two existing research directions for imbalanced time series classification. One is to preserve the covariance structure during oversampling proposed by Cao et al. (2011). Another is to conduct undersampling with various learning algorithms, proposed by Liang and Zhang (2012). Both approaches are limited to binary classification and do not consider the more general problem of multi-loss classification. A key assertation by Cao, Tan, and Pang (2014) is that a time series sampling scheme should preserve the covariance structure. When the observations are conditionally dependent, this approach has been shown to outperform other sampling approaches such as undersampling the majority class, oversampling the minority class, and SMOTE. Our R package Over Sampling for Time Series Classification (OSTSC) is built on this idea. OSTSC first implements Enhanced Structure Preserving Oversampling (EPSO) of the minority class. It then uses a nearest neighbor method from the SMOTE family to generate synthetic positives. Specifically, it uses an Adaptive Synthetic Sampling Approach for Imbalanced Learning (ADASYN). Note that other packages such as Siriseriwan (2017) already implement SMOTE sampling techniques, including ADASYN. However an implementation of ADASYN has been provided in OSTSC for compatibility with the format required for use with EPSO and TensorFlow. For examining the performance of oversampling for times series classification, RNNs are preferred (Graves (2013)). Recently Dixon (2017) applied RNNs to imbalanced times series data used in high frequency trading. The RNN classifier predicts a price-flip in the limit order book based on a sequence of limit order book depths and market orders. The approach uses standard under-sampling of the majority class to improve the classifier performance. OSTSC provides a uni-variate sample of this data and demonstrates the application of EPSO and ADASYN to improve the performance of the RNN. The RNN is implemented in ‘TensorFlow’ (Abadi et al. (2016)) and made available in R by using a wrapper for ‘Keras’ (Allaire and Chollet (2017)), a high-level API for ‘TensorFlow’. The current version of the package currently only supports univariant classification of time series. The extension to multi-features requires tensor computations which are not implemented here. Overview This vignette provides a brief description of the sampling methodologies implemented. We introduce the OSTSC package and illustrate its application using various examples. For validation purposes only, we first apply OSTSC to three small built-in toy datasets. These datasets are not sufficiently large to demonstrate the methodology. However, they can be used to quickly verify that the OSTSC function generates a balanced dataset. For demonstrating the effect of OSTSC on LSTM performance, we provide two medium size datasets that can be computed with moderate computation. Finally, to demonstrate scalability, we evaluate OSTSC on two larger datasets. The reader is advised that the total amount of computation in this case is significant. We would therefore expect a user to test the OSTSC functionality on the small or medium size datasets, but reserve running the larger dataset examples on a higher performance machine. The medium and large datasets are not built-in to keep the package size within 5MB. Background ESPO is used to generate a large percentage of the synthetic minority samples from univariate labeled time series under the modeling assumption that the predictors are Gaussian. EPSO estimates the covariance structure of the minority-class samples and applies a spectral filter to reduce noise. ADASYN is a nearest neighbor interpolation approach which is subsequently applied to the EPSO samples (Cao et al. (2013)). 2 More formally, given the time series of positive labeled predictors P = x11, x12, ..., x1|P | and the negative n×1 time series N = x01, x02, ..., x0|N| , where |N| |P |, xij ∈ R , the new samples are generated by the following steps: 1. Removal of the Common Null Space T Let qij = Ls xij represent xij in a lower-dimensional signal space, where Ls consists of eigenvectors in the signal space. 2. ESPO n ˆ ˆ o Let Dˆ denote the diagonal matrix of regularized eigenvalues d1, ..., dn organized in descending order. Let V be the orthogonal eigenvector matrix from the positive-class covariance matrix: |P | 1 X W = (q − q¯ )(q − q¯ )T . p |P | 1j 1 1j 1 j=1 Let b denote the synthetic positive sample to be generated. The transformed version of b follows a zero-mean −1/2 mixed Gaussian distribution which we denote as z = Fˆ(b−q¯1), where Fˆ = V Dˆ and q¯1 is the corresponding positive-class mean vector. Substituting the definition of Fˆ in to the expression for z and rearranging gives z = Fˆ(b − q¯1) −1/2 z = V Dˆ (b − q¯1) 1/2 T Dˆ V z = b − q¯1 1/2 T b = Dˆ V z +q ¯1 which is used to generate b once z is drawn from the mixed Gaussian distribution. The oversampling is repeated until all (|N| − |P |)r required synthetic samples are generated, where r ∈ [0, 1] is the integration percentage of synthetic samples contributed by ESPO, which is chosen empirically. The remaining (1 − r) percentage of the samples are generated by the interpolation procedure described next. 3. ADASYN Given the transformed positive data Pt = {q1i} and negative data Nt = {q0j}, each sample q1i is replicated T Γi = |Si:k−NN Nt| /Z times, where Si:k−NN is this sample’s kNN in the entire dataset, Z is a normalization P|Pt| factor so that i=1 Γi = 1. See Cao et al. (2013) for further technical details of this approach. Functionality The package imports ‘parallel’ (R Core Team (2017)), ‘doParallel’ (Microsoft Corporation and Weston (2017b)), ‘doSNOW’ (Microsoft Corporation and Weston (2017a)) and ‘foreach’ (Revolution Analytics and Weston (2015)) for multi-threaded execution on shared memory architectures. Parallel execution is strongly suggested for datasets consisting of at least 30,000 observations. OSTSC also imports ‘mvrnorm’ from ‘MASS’ (Venables and Ripley (2002)) to generate random vectors from the multivariate normal distribution, and ‘rdist’ from ‘fields’ (Douglas Nychka et al. (2015)) in order to calculate the Euclidean distance between vectors and matrices. This vignette displays some simple examples below. For calling the RNN and examining the classifier’s performance, ‘keras’ (Allaire and Chollet (2017)), ‘dummies’