Over Sampling for Time Series Classification Matthew F. Dixon, Diego Klabjan and Lan Wei 2017-11-05

Contents

Abstract ...... 1 Introduction ...... 1 Overview ...... 2 Background ...... 2 Functionality ...... 3 Examples ...... 4 Data loading & oversampling ...... 4 Applying OSTSC to medium size datasets ...... 6 Evaluating OSTSC on the large datasets ...... 13 Summary ...... 19 References ...... 19

Abstract

The OSTSC package is a powerful oversampling approach for classifying univariant, but multinomial time series data. This vignette provides a brief overview of the over-sampling methodology implemented by the package. A tutorial of the OSTSC package is provided. We begin by providing three test cases for the user to quickly validate the functionality in the package. To demonstrate the performance impact of OSTSC, we provide two medium size imbalanced time series datasets. Each example applies a TensorFlow implementation of a Long Short-Term Memory (LSTM) classifier - a type of a (RNN) classifier - to imbalanced time series. The classifier performance is compared with and without oversampling. Finally, larger versions of these two datasets are evaluated to demonstrate the scalability of the package. The examples demonstrate that the OSTSC package improves the performance of RNN classifiers applied to highly imbalanced time series data. In particular, OSTSC is observed to increase the AUC of LSTM from 0.51 to 0.786 on a high frequency trading dataset consisting of 30,000 time series observations.

Introduction

A significant number of learning problems involve the accurate classification of rare events or outliers from time series data. For example, the detection of a flash crash, rogue trading, or heart arrhythmia from an electrocardiogram. Due to the rarity of these events, classifiers for detecting these events may be biased towards avoiding false positives. This is because any potential for false positives is greatly exaggerated by the number of negative samples in the data set. Class imbalance problems are most easily addressed by treating the observations as conditionally independent. Then standard statistical techniques, such as oversampling the minority class or undersampling the majority class, or both, are applicable. More (2016) compared a batch of resampling techniques’ classification performances on imbalanced datasets. Besides the conventional resampling approaches, More showed how ensemble methods retain as much original information from the majority class as possible when performing undersampling. Ensemble methods perform well and have gained popularity in the data mining literature. Dubey et al. (2014) studied an ensemble system of feature selection and data sampling from an imbalanced Alzheimer’s Disease Neuroimaging Initiative dataset.

1 However the imbalanced time series classification problem is more complex when the time dimension needs to be accounted for. Not only is the assumption that the observations are conditionally independent too strong, but also the predictors may be cross-correlated too. The sample correlation structure may weaken or be entirely lost under application of the conventional resampling approaches described above. There are two existing research directions for imbalanced time series classification. One is to preserve the covariance structure during oversampling proposed by Cao et al. (2011). Another is to conduct undersampling with various learning algorithms, proposed by Liang and Zhang (2012). Both approaches are limited to binary classification and do not consider the more general problem of multi-loss classification. A key assertation by Cao, Tan, and Pang (2014) is that a time series sampling scheme should preserve the covariance structure. When the observations are conditionally dependent, this approach has been shown to outperform other sampling approaches such as undersampling the majority class, oversampling the minority class, and SMOTE. Our package Over Sampling for Time Series Classification (OSTSC) is built on this idea. OSTSC first implements Enhanced Structure Preserving Oversampling (EPSO) of the minority class. It then uses a nearest neighbor method from the SMOTE family to generate synthetic positives. Specifically, it uses an Adaptive Synthetic Sampling Approach for Imbalanced Learning (ADASYN). Note that other packages such as Siriseriwan (2017) already implement SMOTE sampling techniques, including ADASYN. However an implementation of ADASYN has been provided in OSTSC for compatibility with the format required for use with EPSO and TensorFlow. For examining the performance of oversampling for times series classification, RNNs are preferred (Graves (2013)). Recently Dixon (2017) applied RNNs to imbalanced times series data used in high frequency trading. The RNN classifier predicts a price-flip in the limit order book based on a sequence of limit order book depths and market orders. The approach uses standard under-sampling of the majority class to improve the classifier performance. OSTSC provides a uni-variate sample of this data and demonstrates the application of EPSO and ADASYN to improve the performance of the RNN. The RNN is implemented in ‘TensorFlow’ (Abadi et al. (2016)) and made available in R by using a wrapper for ‘’ (Allaire and Chollet (2017)), a high-level API for ‘TensorFlow’. The current version of the package currently only supports univariant classification of time series. The extension to multi-features requires tensor computations which are not implemented here.

Overview

This vignette provides a brief description of the sampling methodologies implemented. We introduce the OSTSC package and illustrate its application using various examples. For validation purposes only, we first apply OSTSC to three small built-in toy datasets. These datasets are not sufficiently large to demonstrate the methodology. However, they can be used to quickly verify that the OSTSC function generates a balanced dataset. For demonstrating the effect of OSTSC on LSTM performance, we provide two medium size datasets that can be computed with moderate computation. Finally, to demonstrate scalability, we evaluate OSTSC on two larger datasets. The reader is advised that the total amount of computation in this case is significant. We would therefore expect a user to test the OSTSC functionality on the small or medium size datasets, but reserve running the larger dataset examples on a higher performance machine. The medium and large datasets are not built-in to keep the package size within 5MB.

Background

ESPO is used to generate a large percentage of the synthetic minority samples from univariate labeled time series under the modeling assumption that the predictors are Gaussian. EPSO estimates the covariance structure of the minority-class samples and applies a spectral filter to reduce noise. ADASYN is a nearest neighbor interpolation approach which is subsequently applied to the EPSO samples (Cao et al. (2013)).

2  More formally, given the time series of positive labeled predictors P = x11, x12, ..., x1|P | and the negative  n×1 time series N = x01, x02, ..., x0|N| , where |N|  |P |, xij ∈ R , the new samples are generated by the following steps: 1. Removal of the Common Null Space T Let qij = Ls xij represent xij in a lower-dimensional signal space, where Ls consists of eigenvectors in the signal space. 2. ESPO n ˆ ˆ o Let Dˆ denote the diagonal matrix of regularized eigenvalues d1, ..., dn organized in descending order. Let V be the orthogonal eigenvector matrix from the positive-class covariance matrix:

|P | 1 X W = (q − q¯ )(q − q¯ )T . p |P | 1j 1 1j 1 j=1 Let b denote the synthetic positive sample to be generated. The transformed version of b follows a zero-mean −1/2 mixed Gaussian distribution which we denote as z = Fˆ(b−q¯1), where Fˆ = V Dˆ and q¯1 is the corresponding positive-class mean vector. Substituting the definition of Fˆ in to the expression for z and rearranging gives

z = Fˆ(b − q¯1) −1/2 z = V Dˆ (b − q¯1) 1/2 T Dˆ V z = b − q¯1 1/2 T b = Dˆ V z +q ¯1 which is used to generate b once z is drawn from the mixed Gaussian distribution. The oversampling is repeated until all (|N| − |P |)r required synthetic samples are generated, where r ∈ [0, 1] is the integration percentage of synthetic samples contributed by ESPO, which is chosen empirically. The remaining (1 − r) percentage of the samples are generated by the interpolation procedure described next. 3. ADASYN

Given the transformed positive data Pt = {q1i} and negative data Nt = {q0j}, each sample q1i is replicated T Γi = |Si:k−NN Nt| /Z times, where Si:k−NN is this sample’s kNN in the entire dataset, Z is a normalization P|Pt| factor so that i=1 Γi = 1. See Cao et al. (2013) for further technical details of this approach.

Functionality

The package imports ‘parallel’ (R Core Team (2017)), ‘doParallel’ (Microsoft Corporation and Weston (2017b)), ‘doSNOW’ (Microsoft Corporation and Weston (2017a)) and ‘foreach’ ( and Weston (2015)) for multi-threaded execution on shared memory architectures. Parallel execution is strongly suggested for datasets consisting of at least 30,000 observations. OSTSC also imports ‘mvrnorm’ from ‘MASS’ (Venables and Ripley (2002)) to generate random vectors from the multivariate normal distribution, and ‘rdist’ from ‘fields’ (Douglas Nychka et al. (2015)) in order to calculate the Euclidean distance between vectors and matrices. This vignette displays some simple examples below. For calling the RNN and examining the classifier’s performance, ‘keras’ (Allaire and Chollet (2017)), ‘dummies’ (Brown (2012)) and ‘pROC’ (Robin et al. (2011)) are required.

3 Examples

Data loading & oversampling

The OSTSC package provides three small built-in datasets for verification that OSTSC has correctly installed and generates balanced time series. The first two examples use OSTSC to balance binary data while the third balances multinomial data.

The synthetically generated control dataset The dataset Dataset_Synthetic_Control is a time series of sensor measurements of human body motion generated by Alcock et al. (1999). We introduce the following labeling: Class 1 represents the ‘Normal’ state, while Class 0 represents one of ‘Cyclic’, ‘Increasing trend’, ‘Decreasing trend’, ‘Upward shift’ or ‘Downward shift’ (Pham and Chan (1998)). Users load the dataset by calling data(). data(Dataset_Synthetic_Control)

train.label <- Dataset_Synthetic_Control$train.y train.sample <- Dataset_Synthetic_Control$train.x test.label <- Dataset_Synthetic_Control$test.y test.sample <- Dataset_Synthetic_Control$test.x

Each row of the dataset is a sequence of observations. The sequence is of length 60 and there are 300 observations. dim(train.sample)

## [1] 300 60 The imbalance ratio of the training data is 1:5. table(train.label)

## train.label ## 0 1 ## 250 50 We now provide a simple example demonstrating oversampling of the minority data to match the number of observations of the majority class. The output ‘MyData’ stores the samples (a.k.a. features) and labels. There are ten parameters in the OSTSC function, the details of which can be found in the help documentation. Calling the OSTSC function requires the user to provide at least the labels and sample data - the other parameters have default values. It is important to note that the labels are separated from the samples. MyData <- OSTSC(train.sample, train.label, parallel = FALSE) over.sample <- MyData$sample over.label <- MyData$label

The positive and negative observations are now balanced. Let us check the (im)balance of the new dataset. table(over.label)

## over.label ## 0 1 ## 250 250 The minority class data is oversampled to produce a balanced feature set. The minority-majority formation uses a one-vs-rest strategy. For this binary dataset, the Class 1 data has been oversampled to yield the same number of observations as Class 0.

4 dim(over.sample)

## [1] 500 60

The automatic diatoms identification dataset The dataset Dataset_Adiac is generated from a pilot study identifying diatoms (unicellular algae) from images by Jalba, Wilkinson, and Roerdink (2004) originally has 37 classes. For the purpose of demonstrating OSTSC we selected only one class as the positive class (Class 1) and all others are set as the negative class (Class 0) to form a highly imbalanced dataset. Users load the dataset into R by calling data(). data(Dataset_Adiac)

train.label <- Dataset_Adiac$train.y train.sample <- Dataset_Adiac$train.x test.label <- Dataset_Adiac$test.y test.sample <- Dataset_Adiac$test.x

The training dataset consists of 390 observations of a 176 length sequence. dim(train.sample)

## [1] 390 176 The imbalance ratio of the training data is 1:29. table(train.label)

## train.label ## 0 1 ## 377 13 The OSTSC function generates a balanced dataset: MyData <- OSTSC(train.sample, train.label, parallel = FALSE) over.sample <- MyData$sample over.label <- MyData$label

table() provides a summary of the balanced dataset. table(over.label)

## over.label ## 0 1 ## 377 377

The high frequency trading dataset The OSTSC function provides support for multinomial classification. The user specifies which classes should be oversampled. Typically, oversampling is first applied to the minority class - the class with the least number of observations. The dataset Dataset_HFT300 is extracted from a real high frequency trading datafeed (Dixon (2017)). It contains a feature representing instantaneous liquidity imbalance using the best bid to ask ratio. The data is labeled so that Y = 1 for a next event mid-price up-tick, Y = −1 for a down-tick, and Y = 0 for no mid-price movement. Users load the dataset into the R environment by calling data(). data(Dataset_HFT300)

5 train.label <- Dataset_HFT300$y train.sample <- Dataset_HFT300$x

The sequence length is set to 10 and 300 sequence observations are randomly drawn for this example dataset. dim(train.sample)

## [1] 300 10 The imbalance ratio of the three class dataset is 1:48:1. table(train.label)

## train.label ## -1 0 1 ## 6 288 6 This example demonstrates the case when there are two minority classes and both are over-sampled. The oversampling is processed using a one-vs-rest strategy, which means that each minority class is oversampled to the same count as the sum of the count of all other classes. This results in a slight imbalance in the total number of labels. MyData <- OSTSC(train.sample, train.label, parallel = FALSE) over.sample <- MyData$sample over.label <- MyData$label

We observe the ratio of the classes after oversampling. table(over.label)

## over.label ## -1 0 1 ## 294 288 294 The above examples illustrate how OSTSC oversamples small datasets. In the next section, we demonstrate and evaluate the oversampled data on two medium size datasets.

Applying OSTSC to medium size datasets

The Electrical Devices dataset The dataset Dataset_ElectricalDevices is a sample collected from the ‘Powering the Nation’ study (Lines et al. (2011)). This study seeks to reduce the UK’s carbon footprint by collecting behavioural data on how consumers use electricity within the home. Each class represent a signal from a different electrical device. Classes 5 and 6 in the original dataset are set as the negative and positive respectively. The dataset is split into training and testing features and labels. ElectricalDevices <- Dataset_ElectricalDevices() train.label <- ElectricalDevices$train.y train.sample <- ElectricalDevices$train.x test.label <- ElectricalDevices$test.y test.sample <- ElectricalDevices$test.x

Each row in the data represents a sequence of length 96. dim(train.sample)

## [1] 2915 96

6 The imbalance ratio of the training data is 1:4.7. table(train.label)

## train.label ## 5 6 ## 2406 509 After oversampling with OSTSC, the positive and negative observations are balanced. MyData <- OSTSC(train.sample, train.label, parallel = FALSE) over.sample <- MyData$sample over.label <- MyData$label

table(over.label)

## over.label ## 5 6 ## 2406 2406 Here, a LSTM classifier is used as the basis for performance assessment of oversampling with OSTSC. We use ‘keras’ (Allaire and Chollet (2017)) to configure the architecture, hyper-parameters and learning schedule of the LSTM classifier for sequence classification. As a baseline for OSTSC, we assess the performance of the LSTM model trained on the original unbalanced data. The procedure for applying Keras is next outlined: 1. One-hot encode the categorical label vectors as binary class matrices using the Keras ‘to_categorical()’ function. Then transform the feature matrices to tensors for LSTM. library(keras) train.y <- to_categorical(train.label) test.y <- to_categorical(test.label) train.x <- array(train.sample, dim = c(dim(train.sample),1)) test.x <- array(test.sample, dim = c(dim(test.sample),1))

2. Initialize a sequential model, add layers and then compile it. Measure the losses and accuracies after each epoch and display them in Figure 1. The score of the losses and accuracies indicate if the model has been well trained. model <- keras_model_sequential() model %>% layer_lstm(10, input_shape = c(dim(train.x)[2], dim(train.x)[3])) %>% layer_dropout(rate = 0.2) %>% layer_dense(dim(train.y)[2]) %>% layer_dropout(rate = 0.2) %>% layer_activation("softmax") model %>% compile( loss = "categorical_crossentropy", optimizer = "adam", metrics = "accuracy" ) lstm.before <- model %>% fit( x = train.x, y = train.y, validation_split = 0.2, epochs = 20 )

7 enwtana netclycngrdLT lsie nteoesmlddt.RpaigSes13 first 1-3; Steps Repeating data. oversampled the on classifier LSTM configured indentically an train now We hncntutadtantemdl h osadacrc r esrdoe pce n hw nFgr 2. Figure in shown and epoches over measured are accuracy and loss the model, the train and construct Then nadto otetann itr,Fgrs3ad4cmaetecnuinmtie ftetomdl without models two the of matrices confusion the compare 4 and 3 Figures history, training the to addition In plot ) model.over <- lstm.after ) model.over trained. well been not has model the that note model.over we accuracies, <- and model.over losses the of score the From <- over.x <- over.y data: the transform dataset. Devices Electrical plot unbalanced the epoch. on each trained of classifier end LSTM the the at evaluated of are accuracy metrics and Both loss The 1: Figure 20 = epochs , 0.2 = validation_split over.y, = y over.x, = x "accuracy" = metrics , "adam" = optimizer , "categorical_crossentropy" = loss layer_activation layer_dropout layer_dense layer_dropout layer_lstm (lstm.after) (lstm.before) array to_categorical compile %>% %>% iptsae= input_shape (10, (

dim keras_model_sequential loss acc dm= dim (over.sample, ) 0.1 = (rate 0.1) = (rate 10.0 0.25 0.50 0.75 0.0 2.5 5.0 7.5 (over.y)[2]) ("softmax") ( (over.label) fit %>% %>% %>% 5 %>% c ( ( dim c (over.x)[2], () ( dim epoch 10 (over.sample),1)) 8 dim 15 (over.x)[3])) 20 %>% data validation training h dataset The n d aesb hne l 21) h SS akg ss500rnol eetdhatetsequences. heartbeat selected randomly 5,000 uses package sequences OSTSC heartbeat The extract (2015). to al. pre-processed et was Chen dataset by The labels failure. add heart and congestive severe with patients dataset Electrocardiogram The <- cm.after <- cm.before model.over <- However, performance. pred.label.over noise. model the white improves models. <- than oversampling the better epoches, of pred.label marginally of curves only number (ROC) is characteristic same performance the operating classifier receiver for LSTM the the compares oversampling, 5 Before Figure oversampling. oversam- with without and dataset Devices Electrical the to applied LSTM of matrix pling. confusion Normalized 3: Figure dataset. Devices Electrical oversampled epoch. the each on of trained end classifier the LSTM at the evaluated of are accuracy metrics and Both loss The 2: Figure Dataset_ECG table table loss acc 0.6 0.9 1.2 1.5 0.4 0.5 0.6 0.7 0.8 ts.ae,pred.label.over) (test.label, ts.ae,pred.label) (test.label, predict_classes %>% a rgnlycetdb odegre l 20)adrcrshatet from heartbeats records and (2000) al. et Goldberger by created originally was

True 5

predict_classes %>% 6 5 epoch 10 (test.x) 1 1 5 9 Predicted (test.x) 15 0 0 6 20 data validation training Predicted 5 6

5 0.6597 0.3403 True

6 0.1669 0.8331

Figure 4: Normalized confusion matrix of LSTM applied to the Electrical Devices dataset with oversampling. 1 0.8 0.6 AUC: 0.500

0.4 AUC: 0.746 True Positive Rate Positive True 0.2

Without oversampling With oversampling 0

0 0.2 0.4 0.6 0.8 1 False Positive Rate

Figure 5: ROC curves comparing the effect of oversampling on the performance of LSTM applied to the Electrical Devices dataset.

10 ECG <- Dataset_ECG()

train.label <- ECG$train.y train.sample <- ECG$train.x test.label <- ECG$test.y test.sample <- ECG$test.x

Each row in the data represents a sequence of length 140. dim(train.sample)

## [1] 2910 140 This experiment uses 4 classes of the dataset to ensure a high degree of imbalance: the imbalance ratio is 119:4:8:1. table(train.label)

## train.label ## 1 3 4 5 ## 2627 86 175 22 Let us check that the data is balanced after oversampling. MyData <- OSTSC(train.sample, train.label, parallel = FALSE) over.sample <- MyData$sample over.label <- MyData$label

table(over.label)

## over.label ## 1 3 4 5 ## 2627 2824 2735 2888 We evaluate the effect of oversampling on the performance of LSTM following Steps 1-3 above. First the data is transformed: library(keras) library(dummies) train.y <- dummy(train.label) test.y <- dummy(test.label) train.x <- array(train.sample, dim = c(dim(train.sample),1)) test.x <- array(test.sample, dim = c(dim(test.sample),1))

After configuring and training the model, the loss and accuracy are measured at the end of each epoch and shown in Figure 6. The score of the losses and accuracies indicate if the model has been well trained. model <- keras_model_sequential() model %>% layer_lstm(10, input_shape = c(dim(train.x)[2], dim(train.x)[3])) %>% layer_dropout(rate = 0.2) %>% layer_dense(dim(train.y)[2]) %>% layer_dropout(rate = 0.2) %>% layer_activation("softmax") model %>% compile( loss = "categorical_crossentropy", optimizer = "adam", metrics = "accuracy" )

11 enwmaueteeeto vrapigo h efrac fLT.Aan pligSes13- 1-3 Steps applying Again, LSTM. of performance the on oversampling of effect the measure now We model.over course epoches of more this where will below, model.over but section amount epoches, datasets 20 modest <- larger of a after model.over number the only trained LSTM. see the with adequately training also increase OSTSC for been should to of used yet user utility choose the are The not the From course demonstrate has of computation. 7. to can more model Figure trying require user the in are The that shown we computation. is note that of we mind epoch in each accuracies, Keep after and accuracy epoches. losses and the loss of the score measuring and training Building, <- over.x <- over.y LSTM: for data the transforming plot ) model <- lstm.before dataset. Electrocardiogram epoch. oversampled each the on of trained end classifier the LSTM at the evaluated of are accuracy metrics and Both loss The 6: Figure "adam" , = optimizer "categorical_crossentropy", = loss layer_activation layer_dropout layer_dense layer_dropout layer_lstm 20 = epochs 0.2, = validation_split train.y, = y train.x, = x (lstm.before) array dummy compile %>% %>% iptsae= input_shape (10, (

dim keras_model_sequential loss acc dm= dim (over.sample, (over.label) ) 0.1 = (rate ) 0.1 = (rate 0.5 0.6 0.7 0.8 0.9 0 2 4 (over.y)[2]) ("softmax") fit %>% ( ( %>% %>% 5 %>% c ( dim c (over.x)[2], () ( dim epoch 10 (over.sample),1)) 12 dim 15 (over.x)[3])) 20 %>% data validation training h vlaino vrapignx sslre aaes h HAT n h F aaes h purpose The datasets. HFT the and MHEALTH the datasets: larger uses next oversampling of evaluation The o1x h vlaino ahdtsttksapoiaeytrehuso . H orcr atpwith laptop dataset four-core up MHEALTH GHz of The 1.7 factor by a sizes on data hours the three increase approximately We takes dataset scale. RAM. each at of of performs 8GM evaluation OSTSC The how demonstrate 10x. to to is evaluation this of datasets large the on OSTSC Evaluating <- cm.after <- cm.before improves model.over oversampling <- epoches, of pred.label.over model number same <- the performance and pred.label for classifier matrices However, LSTM confusion noise. the the oversampling, white compare performance. Before than respectively the oversampling. better 10 with marginally and and only 9 without is 8, LSTM Figures of fixed, curves epoches ROC of number the Keeping plot ) model.over <- lstm.after ) dataset. Electrocardiogram epoch. oversampled each the on of trained end classifier the LSTM at the evaluated of are accuracy metrics and Both loss The 7: Figure 20 = epochs 0.2, = validation_split over.y, = y over.x, = x "accuracy" = metrics (lstm.after) table table loss acc 0.00 0.25 0.50 0.75 0.5 1.0 1.5 2.0 ts.ae,pred.label.over) (test.label, ts.ae,pred.label) (test.label, predict_classes %>% fit %>% 5 predict_classes %>% ( epoch (test.x) 10 13 (test.x) 15 20 data validation training Predicted 1 3 4 5

1 1 0 0 0

3 1 0 0 0 True

4 1 0 0 0

5 1 0 0 0

Figure 8: Normalized confusion matrix of LSTM applied to the Electrocardiogram dataset without oversam- pling.

Predicted 1 3 4 5

1 0.8596 0.0274 0.0068 0.1062

3 0 0.7 0.3 0 True

4 0 0.2632 0.7368 0

5 0 0 0 1

Figure 9: Normalized confusion matrix of LSTM applied to the Electrocardiogram dataset with oversampling.

14 1 0.8 0.6 AUC: 0.500

0.4 AUC: 0.878 True Positive Rate Positive True 0.2

Without oversampling With oversampling 0

0 0.2 0.4 0.6 0.8 1 False Positive Rate

Figure 10: ROC curves of LSTM applied to the Electrocardiogram dataset, with and without oversampling.

The dataset Dataset_MHEALTH benchmarks techniques for human behavioral analysis applied to multimodal body sensing (Banos et al. (2014)). In this experiment, only Subjects 1-5 and Feature 12 (the x coordinate of the magnetometer reading from the left-ankle sensor) are used. The dataset is labeled with a dichotonomous response (Banos et al. (2015)). Class 11 (Running) is set as the positive and the remaining states are the negative. The dataset is split into training and testing features and labels. mhealth <- Dataset_MHEALTH()

train.label <- mhealth$train.y train.sample <- mhealth$train.x test.label <- mhealth$test.y test.sample <- mhealth$test.x

Each row in the data represents a sequence of length 30. dim(train.sample)

## [1] 10839 30 Class 1 represents the positive data and class 0 represents the negative. The imbalance ratio of the train dataset is 1:42. table(train.label)

## train.label ## 0 1 ## 10584 255 After Oversampling, the positive and negative observations are balanced.

15 euetesm SMcasfir xetta enwices h ubro pce o20 hl,by While, 200. to epoches of number the increase now we that except classifier, LSTM same the use We h dataset The Moreover performance. the improves than oversampling better epoches, marginally of number only same is the performance for classifier However, LSTM noise. the white oversampling, Before oversampling. with aea h mle aae.W pi h riigadtsigdt yartoo :.Tefis afo the of half first The 1:1. of the ratio be a to by configured data data. is testing training dataset and consisting as the training dataset used of the sized are ratio split large observations We imbalance a ordered the dataset. to time control, oversampling smaller For of the 300. as application of same the instead demonstrate observations to 30,000 is of example this of purpose epoches. dataset more trading and dataset frequency training high larger The a with increased has OSTSC and using without from LSTM of gain curves comparative ROC the and matrices confusion the compare respectively 15 and 14 13, Figures oversampling. with to plot and subject without the is properties with (which convergence here the gain show more absolute 12 concerned the and plot are with 11 we less Figures gain, and tuning). performance oversampling parameter a with further to and leads without epoches performance of comparative 1 number the increasing itself, 0 10584 10584 ## ## over.label ## table MyData <- without over.label MyData dataset <- MHEALTH over.sample the on <- trained MyData classifier epoch. each LSTM of the end the of at accuracy evaluated are and metrics loss Both The oversampling. 11: Figure (lstm.after) (lstm.before) (over.label) OSTSC Dataset_HFT acc loss ) FALSE = parallel train.label, (train.sample, 0.92 0.94 0.96 0.98 1.00 0.0 0.1 0.2 0.3 $ label $ 0 sample a led enitoue nthe in introduced been already has 50 epoch 100 16 150 aalaig&oversampling & loading Data 200 data validation training eto.The section. iue1:Nraie ofso arxo SMapidt h HAT aae ihu oversampling. without dataset MHEALTH the to applied LSTM of matrix confusion Normalized 13: Figure oversampling. with dataset epoch. MHEALTH each the of on end trained the classifier LSTM at the evaluated of are accuracy metrics and Both loss The 12: Figure iue1:Nraie ofso arxo SMapidt h HAT aae ihoversampling. with dataset MHEALTH the to applied LSTM of matrix confusion Normalized 14: Figure

acc loss 0.84 0.88 0.92 0.96 0.1 0.2 0.3 0

True True

1 0 1 0 50 0.1725 0.9677 0.9515 epoch 0 0 0 100 17 Predicted Predicted 150 0.8275 0.0323 0.0485 1 1 1 200 data validation training 1 0.8 0.6 AUC: 0.898

0.4 AUC: 0.976 True Positive Rate Positive True 0.2

Without Oversampling With Oversampling 0

0 0.2 0.4 0.6 0.8 1 False Positive Rate

Figure 15: ROC curves of LSTM applied to the MHEALTH dataset, with and without oversampling.

HFT <- Dataset_HFT()

label <- HFT$y sample <- HFT$x train.label <- label[1:15000] train.sample <- sample[1:15000,] test.label <- label[15001:30000] test.sample <- sample[15001:30000,]

The imbalance ratio of the training data is 1:48:1. table(train.label)

## train.label ## -1 0 1 ## 297 14424 279 After oversampling the data is balanced. MyData <- OSTSC(train.sample, train.label, parallel = FALSE) over.sample <- MyData$sample over.label <- MyData$label

table(over.label)

## over.label ## -1 0 1 ## 14703 14424 14721

18 TnoFo:ASse o ag-cl ahn erig”In Learning.” Machine Large-Scale for System A “TensorFlow: prtn ytm einadImplementation and Design Systems Operating efis rvd he xmlsfrteue ovrf orc akg ntlainadrpoueblt fthe of reproduceability and installation package correct verify to user the for examples three provide first We lok .J,Y aoools aaEgneigLbrtr,adDprmn fIfrais 99 “Time- 1999. Informatics. Of Department and Laboratory, Engineering Data Manolopoulos, Y. J., R. Alcock, 2016. others. and Dean, Jeffrey Davis, Andy Chen, Zhifeng Chen, Jianmin Barham, Paul Martin, Abadi, ihu vrapig anann h maac ai,w hnrpae h vlaino w medium two on evaluation the repeated then we ratio, imbalance the Maintaining oversampling. without The eisSmlrt ure mlyn etr-ae prah”In Approach.” Feature-Based a Employing Queries Similarity Series http://dl.acm.org/citation.cfm?id=3026877.3026899. References applied classifiers RNN The of package. performance the data. the of improves series scalability package time the OSTSC imbalanced demonstrate the highly to that to evaluated and demonstrate are with to datasets serve classifier large examples the two compared Finally, we datasets. architecture, size LSTM package. a the by of implemented implementation ‘TensorFlow’ methodology a over-sampling Using the of results. overview brief a provides vignette This data. observations Summary training more with increases only improves with oversampling OSTSC and - using without dataset from epoches. curves MHEALTH gain ROC more largest comparative the and the the and to and matrices similar performance confusion are the the results compare comparative respectively The 20 oversampling. and 19 18, Figures oversampling. with and plot without LSTM of accuracy and loss the plot display oversampling. 17 without and dataset 16 HFT Figures epoch. the each on of trained end classifier the LSTM at the evaluated of are accuracy metrics and Both loss The 16: Figure OSTSC (lstm.after) (lstm.before) akg sapwru vrapigapoc o lsiyn nvrat u utnma ieseries time multinomial but univariant, classifying for approach oversampling powerful a is package

acc loss 0.875 0.900 0.925 0.950 0.2 0.3 0.4 0 50 6–3 SI1.Bree,C,UA SNXAssociation. USENIX USA: CA, Berkeley, OSDI’16. 265–83. , epoch 100 19 150 rceig fte1t snxCneec on Conference Usenix 12th the of Proceedings t elncCneec nInformatics, on Conference Hellenic 7th 200 data validation training iue1:Tels n cuayo h SMcasfirtando h vrape F aae.Both dataset. HFT oversampled the on epoch. trained each classifier of LSTM end the the of at accuracy evaluated and are metrics loss The 17: Figure iue1:Nraie ofso arcso SMapidt h F aae ihu oversampling. without dataset HFT the to applied LSTM of matrices confusion Normalized 18: Figure iue1:Nraie ofso arxo SMapidt h F aae ihoversampling. with dataset HFT the to applied LSTM of matrix confusion Normalized 19: Figure

acc loss 0.25 0.50 0.75 1.00 0.4 0.6 0.8 0

True True 1 0 −1 1 0 −1 50 0.1495 0.0811 0.7393 0.0685 0.0018 0.0924 −1 −1 epoch 100 Predicted Predicted 20 0.7975 0.9962 0.1651 0.8511 0.1485 0.835 0 0 150 0.0726 0.6854 0.0678 0.1122 0.134 0.002 200 1 1 data validation training 1 0.8 0.6 AUC: 0.510

0.4 AUC: 0.786 True Positive Rate Positive True 0.2

Before Oversampling After Oversampling 0

0 0.2 0.4 0.6 0.8 1 False Positive Rate

Figure 20: ROC curves of LSTM applied to the HFT dataset with and without oversampling.

Ioannina, 27–29. Allaire, JJ, and Francois Chollet. 2017. Keras: R Interface to ’Keras’. https://CRAN.R-project.org/ package=keras. Banos, Oresti, Rafael Garcia, Juan A. Holgado-Terriza, Miguel Damas, Hector Pomares, Ignacio Rojas, Alejandro Saez, and Claudia Villalonga. 2014. “mHealthDroid: A Novel Framework for Agile Development of Mobile Health Applications.” In Ambient Assisted Living and Daily Activities: 6th International Work- Conference, IWAAL 2014, Belfast, UK, December 2-5, 2014., edited by Leandro Pecchia, Liming Luke Chen, Chris Nugent, and Jos Bravo, 91–98. Cham: Springer International Publishing. Banos, Oresti, Claudia Villalonga, Rafael Garcia, Alejandro Saez, Miguel Damas, Juan A. Holgado-Terriza, Sungyong Lee, Hector Pomares, and Ignacio Rojas. 2015. “Design, implementation and validation of a novel open framework for agile development of mobile health applications.” BioMedical Engineering OnLine 14 (2): S6. Brown, Christopher. 2012. Dummies: Create Dummy/Indicator Variables Flexibly and Efficiently. https: //CRAN.R-project.org/package=dummies. Cao, Hong, Xiaoli Li, David Yew-Kwong Woon, and See-Kiong Ng. 2011. “SPO: Structure Preserving Oversampling for Imbalanced Time Series Classification.” 2011 IEEE 11th International Conference on Data Mining, 1008–13. ———. 2013. “Integrated Oversampling for Imbalanced Time Series Classification.” IEEE Transactions on Knowledge and Data Engineering 25: 2809–22. Cao, Hong, Vincent Y. F. Tan, and John Z. F. Pang. 2014. “A Parsimonious Mixture of Gaussian Trees Model for Oversampling in Imbalanced and Multimodal Time-Series Classification.” IEEE Transactions on

21 Neural Networks and Learning Systems 25: 2226–39. Chen, Yanping, Yuan Hao, Thanawin Rakthanmanon, Jesin Zakaria, Bing Hu, and Eamonn Keogh. 2015. “A General Framework for Never-Ending Learning from Time Series Streams.” Data Mining and Knowledge Discovery 29 (6): 1622–64. doi:10.1007/s10618-014-0388-4. Dixon, M. F. 2017. “Sequence Classification of the Limit Order Book using Recurrent Neural Networks.” ArXiv E-Prints, July. Douglas Nychka, Reinhard Furrer, John Paige, and Stephan Sain. 2015. “Fields: Tools for Spatial Data.” Boulder, CO, USA: University Corporation for Atmospheric Research. doi:10.5065/D6W957CT. Dubey, Rashmi, Jiayu Zhou, Yalin Wang, Paul M. Thompson, and Jieping Ye. 2014. “Analysis of Sampling Techniques for Imbalanced Data: An N = 648 Adni Study.” NeuroImage 87: 220–41. Goldberger, Ary L., Luis A. N. Amaral, Leon Glass, Jeffrey M. Hausdorff, Plamen Ch. Ivanov, Roger G. Mark, Joseph E. Mietus, George B. Moody, Chung-Kang Peng, and H. Eugene Stanley. 2000. “Phys- ioBank, Physiotoolkit, and Physionet.” Circulation 101 (23). American Heart Association, Inc.: e215–e220. doi:10.1161/01.CIR.101.23.e215. Graves, Alex. 2013. “Generating Sequences with Recurrent Neural Networks.” CoRR abs/1308.0850. http://arxiv.org/abs/1308.0850. Jalba, Andrei C, Michael HF Wilkinson, and Jos BTM Roerdink. 2004. “Automatic Segmentation of Diatom Images for Classification.” Microscopy Research and Technique 65 (1-2). Wiley Online Library: 72–85. Liang, Guohua, and Chengqi Zhang. 2012. “A Comparative Study of Sampling Methods and Algorithms for Imbalanced Time Series Classification.” In AI 2012: Advances in Artificial Intelligence: 25th Australasian Joint Conference, Sydney, Australia, December 4-7, 2012., edited by Michael Thielscher and Dongmo Zhang, 637–48. Berlin, Heidelberg: Springer Berlin Heidelberg. Lines, Jason, Anthony Bagnall, Patrick Caiger-Smith, and Simon Anderson. 2011. “Classification of Household Devices by Electricity Usage Profiles.” In Intelligent Data Engineering and Automated Learning-Ideal 2011, 403–12. Springer. Microsoft Corporation, and Stephen Weston. 2017a. DoSNOW: Foreach Parallel Adaptor for the ’Snow’ Package. https://CRAN.R-project.org/package=doSNOW. Microsoft Corporation, and Steve Weston. 2017b. DoParallel: Foreach Parallel Adaptor for the ’Parallel’ Package. https://CRAN.R-project.org/package=doParallel. More, A. 2016. “Survey of resampling techniques for improving classification performance in unbalanced datasets.” ArXiv E-Prints, August. Pham, D, and AB Chan. 1998. “Control Chart Pattern Recognition Using a New Type of Self-Organizing Neural Network.” In Proceedings of the Institution of Mechanical Engineers Part I-Journal of Systems and Control Engineering - Proc Inst Mech Eng I-J Syst c, 212:115–27. R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/. Revolution Analytics, and Steve Weston. 2015. Foreach: Provides Foreach Looping Construct for R. https://CRAN.R-project.org/package=foreach. Robin, Xavier, Natacha Turck, Alexandre Hainard, Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez, and Markus Müller. 2011. “PROC: An Open-Source Package for R and S+ to Analyze and Compare Roc Curves.” BMC Bioinformatics 12: 77. Siriseriwan, Wacharasak. 2017. Smotefamily: A Collection of Oversampling Techniques for Class Imbalance Problem Based on Smote. https://CRAN.R-project.org/package=smotefamily. Venables, W. N., and B. D. Ripley. 2002. Modern Applied Statistics with S. Fourth. New York: Springer. http://www.stats.ox.ac.uk/pub/MASS4.

22