Bachelor Thesis Project

Electrocardiographic deviation detection - Using long short-term memory recurrent neural networks to detect deviations within electrocardiographic records

Author: Michael Racette Olsén Supervisor: Anders Haggren Semester: VT2017 Subject: Computer Engineering Abstract

Artificial neural networks have been gaining attention in recent years due to their impressive ability to map out complex nonlinear relations within data. In this report, an attempt is made to use a Long short-term memory neural network for detecting anomalies within electrocardiographic records. The hypothesis is that if a neural network is trained on records of normal ECGs to predict future ECG sequences, it is expected to have trouble predicting abnormalities not previously seen in the training data. Three different LSTM model configurations were trained using records from the MIT-BIH Arrhythmia database. Afterwards the models were evaluated for their ability to predict previously unseen normal and anomalous sections. This was done by measuring the mean squared error of each prediction and the uncertainty of over- lapping predictions. The preliminary results of this study demonstrate that recurrent neural networks with the use of LSTM units are capable of detecting anomalies.

Keywords: ECG, LSTM, , Neural Network, Deeplearning4j, Time Series

i Preface I would like to thank HIQ and KIWOK for providing me the opportunity to do research in a field I find very interesting.

ii Contents

1 Introduction1 1.1 Background...... 1 1.2 Hypothesis...... 2 1.3 Problem formulation...... 2 1.4 Objectives...... 2 1.5 Scope/Limitations...... 2 1.6 Target group...... 2 1.7 Outline...... 3

2 Theory4 2.1 Electrocardiography (ECG)...... 4 2.2 with features and labels...... 4 2.3 Artificial Neural Network...... 5 2.3.1 Artificial neuron...... 5 2.3.2 Hidden Layers...... 6 2.3.3 Gradient descent...... 7 2.3.4 (RNN)...... 8 2.3.5 Long short-term memory (LSTM)...... 8 2.4 Preparing the data...... 9 2.5 Training the network...... 9 2.5.1 Hyper-parameter optimization...... 9 2.5.2 Generalization and overfitting...... 10

3 Method 11 3.1 Collecting data for training...... 11 3.2 Evaluation...... 12 3.3 Reliability and Validity...... 12

4 Implementation 13 4.1 Preparing the data...... 13 4.2 Structuring the neural network...... 16 4.3 Preventing overfitting by using early stopping...... 17 4.4 Evaluation...... 17

5 Results 18 5.1 Sequence to One...... 20 5.2 Sequence to Vector...... 21 5.3 Sequence to Sequence...... 22

6 Analysis 23 7 Discussion 24

8 Conclusion 24

References 25 1 Introduction

The eHealth company Kiwok AB has developed a system called BodyKom for remotely monitoring a person’s electrocardiography. The recordings are measured with a soft elec- trode shirt and sent via a cellphone to a server where the recordings can be downloaded by a caregiver for analysis. Kiwok AB has asked HIQ to develop an automatic system for real-time analysis of ECGs. The goal is to be able to report deviations directly to a healthcare provider, working preventively to avoid future cardiac complications. Kiwok has stated that they are particularly interested in finding a solution that involves .

1.1 Background

Studies show that better screening methods are becoming increasingly important as the older population continuous to grow. In a mass screening performed on individuals in the age 75 to 76 it was revealed that a significant proportion of the participants had untreated atrial fibrillation [1]. T. Lindberg et al. later confirmed these findings with the use of BodyKom. In their conclusion they state:

...many older outpatients have undiagnosed and thus untreated persistent and paroxysmal AF. This is a challenge for health care providers, and it is essen- tial to develop more effective strategies for the detection, treatment, and pre- vention of arrhythmias. This study confirms that the long-term wireless ECG recorder BodyKom has good feasibility for arrhythmia screening in older out- patient populations. [2, p. 1089].

Visually analyzing an ECG is a very time consuming job. An ECG record of 72 hours contains approximately 300 thousand heartbeats. Being able to detect anomalies and identify people within risk zones from a distance with the use of machine learning would make ECG monitoring available to far more people. The goal of the requested system is twofold; general deviation detection and individually adapted deviation detection. The second system is meant to recognize what is normal for a particular individual and detect changes as the individual ages. Long short-term memory neural networks have proven to be exceptionally good at learning long term temporal dependencies within time-series data. In 2015 Malhotra et al. demonstrated how LSTM neural networks can be used to detect anomalies:

A network is trained on non-anomalous data and used as a predictor over a number of time steps. The resulting prediction errors are modeled as a multi- variate Gaussian distribution, which is used to assess the likelihood of anoma- lous behavior. The efficacy of this approach is demonstrated on four datasets: ECG, space shuttle, power demand, and multi-sensor engine dataset. [3].

1 In this report a similar approach is evaluated using three different structures of LSTM neural networks which are trained to predict future sequences of ECG data from the MIT- BIH Arrythmia database. Their prediction accuracy and ability to detect anomalies are compared to each other. Additionally two different metrics for are compared; prediction difficulty through squared error and prediction uncertainty through variance in overlapping predictions.

1.2 Hypothesis

When training a neural network model to accurately predict sequential data a few steps a- head of time, it should encounter great difficulties in predicting anomalies not previously seen in the training data. The anomalies can then be identified by either computing an error vector or measuring the uncertainty in overlapping predictions.

1.3 Problem formulation

In order for the network to identify anomalous sections of data, it first needs to be able to accurately predict future sequences of non-anomalous ECG data. The three different network compositions need to be reliably evaluated against each other to answer the fol- lowing questions:

Which architecture has the most accurate predictions? • Are the trained models, capable of detecting anomalies? • Which architecture has the best performance in terms of analysis speed? •

1.4 Objectives

O1 Implement the different architectures using deeplearning4j. O2 Compare the different architectures to each other. O3 Evaluate the effectiveness of the methods.

1.5 Scope/Limitations

This report will include a brief explanation of how Recurrent Neural Networks and Long Short-term memory neural networks work and the difference between them. How LSTMs can be implemented on electrocardiographic data and its effectiveness in detecting devia- tions will also be discussed.

1.6 Target group

This report is directed mainly for HIQ and KIWOK as an evaluation of a proposed devel- opment path for their live monitoring system.

2 1.7 Outline

The report begins with chapter 2 where the theory is outlined. Chapter 3 (Method) describes how the system will be evaluated, the reliability and validity of the results. Chapter 4 (Implementation) describes how the systems were designed and imple- mented. It contains a brief description of the used, the methods for finding optimal structures and the methods that were used for evaluating the structures. In chapter 5 the results are presented for each of the different models. Chapters 6, 7 and 8, contains the analysis, discussion and conclusion.

3 2 Theory

2.1 Electrocardiography (ECG)

Electrocardiography is the process of measuring and recording the electrical activity of the heart over time by attaching small electrodes to the surface of the skin. The elec- trodes measure the small electrical changes in the skin caused by the depolarization and repolarization of the different compartments within the heart muscle. These events form a series of waves, starting with the P-Wave representing the depolarization of the atria, followed by the QRS-complex representing the depolarization of the ventricles and finally the T-wave representing the ventricles being repolarized [4].

Figure 2.1: ECG of a single heartbeat in normal sinus rhythm

2.2 Supervised learning with features and labels

Supervised learning is useful when a system’s inputs and outputs are known in advance and can be measured. The measured characteristics of the input data are called "features", and the measurements of the known output data are called "labels". Features and labels are prepared in pairs. By using algorithms for supervised learning the neural network is able to map the relation between features and labels.

4 2.3 Artificial Neural Network

Artificial neural networks consist of self-learning algorithms inspired to mimic the struc- ture of the neural networks present in a biological brain. By simulating a network of interconnected artificial neurons, complex nonlinear relationships can be mapped [5]. It is particularly suitable for learning complex functions of systems where only the inputs and outputs are known. Provided there is enough data and the network contains enough neurons, an artificial neural network is theoretically capable of mapping out any function [6].

2.3.1 Artificial neuron

The artificial neuron provides a non-linear mapping between multiple input signals and an output signal. The input signals can either be from the environment (through the input layer) or signals from other artificial neurons. Each input connection’s signal strength is determined by its weight Wk. The neuron computes the sum of the inputs, adds the bias and feeds it to the activation function φ(·) which determines the activation strength of the neuron. The bias shifts the activation function on the x axis. Recurrent neural net- works commonly use a sigmoid function such as the hyperbolic tangent as their activation function:

sinh x ex e−x tanh(x) = = − cosh y ex + e−x

During the training phase of the neural network two parameters are adjusted, or “trained” to minimize the error between the output and the labeled data; the weight con- nected to the inputs of the neuron and the bias of the neuron. The weight of each individ- ual input is adjusted to match its significance in determining the strength of the neuron’s activation. The bias can be seen as a threshold as it shifts the entire function [7][6].

Weights Bias Inputs x1 w1 b

Activation x w 2 2 function Output n wixi + b φ( ) y x3 w3 i=1 P · . . . .

xn wn

Figure 2.2: Artificial Neuron

5 2.3.2 Hidden Layers

Neural networks are structured into layers composed of multiple neurons. The first layer is referred to as the visual input layer and consists of a neuron for each input to the network. The last layer is called the output layer and consists of a neuron for each output. The layers in between are referred to as hidden layers, since their inputs and outputs are neither visible from the inputs nor the outputs of the network. The mappings of the hidden layers are thereby not directly defined by the data and instead the learning algorithms need to make use of these layers by creating their own mapping. The neurons in the hidden layers are connected instead to the neurons in the adjacent layers [8, pp. 164-165]

hidden layers

input layer output layer

Figure 2.3: Example of a deep neural network with three hidden layers

6 2.3.3 Gradient descent

Neural networks are typically trained by initializing the weights randomly and computing an output. The output is then compared to the correct output using a cost function. The cost function measures how far off the output is from the label, typically by computing the squared error. The weights within each of the layers are updated using gradient descent by propagating backwards from the output layer towards the input layer. Gradient descent is used to update the parameters in the direction of the cost function’s local minima. Since the weights and biases of a neuron form a multivariate function, the partial derivative for each of the weights with respect to the cost function can be calculated to find the direction in which the weight needs to be adjusted in order for the cost to decrease. The same process can be repeated for earlier layers by viewing the parameter adjustments as a cost. The process of recursively updating the parameters backwards through the layers of the network is known as backpropagation [9, Sec. 3.3.1].

Stochastic gradient descent Performing gradient descent for every data sample takes a lot of time. By computing the gradients of a smaller random subsection of samples and using stochastic methods to estimate the parameter adjustments, the training speed can be dramatically increased. [8, Sec. 8.3.1].

7 2.3.4 Recurrent neural network (RNN)

A recurrent neural network is able to operate on sequences of data by using feedback connections which make the output not only dependent on the current input but also on the network’s hidden state from the previous timestep. Because of this recurrent neural networks are capable of operating on sequential data [9, pp. 18-20].

Figure 2.4: Unfolded recurrent neural network

Unfortunately standard RNN’s have difficulties solving problems where long term tempo- ral dependences are crucial. The reason is that during backpropagation the influence from previous hidden layers decays exponentially with each time-step which makes it very dif- ficult for the network to relate events with gaps of more than a few time-steps. This is known as the vanishing gradient problem and cannot be overcome by simply increasing the influence from previous time-steps as that will ultimately result in exploding gradients where the influence from the feedback connection grows stronger than the input [9, pp. 31-32].

2.3.5 Long short-term memory (LSTM)

The vanishing gradient problem can be solved by replacing the RNN neurons with the more advanced LSTM unit. An LSTM unit is composed of a memory cell, an input-gate, an output-gate, and a forget-gate. Each of the gates has its own weight matrix which is trained similarly to a regular artificial neuron with the use of gradient descent. The gates essentially form an attention mechanism in the sense that the LSTM unit is able to learn what values to store in its memory cell, when to apply these values and when to forget them. By replacing RNN neurons with LSTM units the network is capable of learning the temporal dependencies within the data even when there are large gaps between important events [9, pp. 31-36] [10].

8 2.4 Preparing the data

Before training a neural network, the data needs to be divided into features and labels. The data that is feed into the network through the input layer is referred to as features. The actual measurements that the output of the network is compared to are referred to as labels. The data also needs to be normalized to a specific range depending on the activation function. When using gradient descent in combination with tanh the data is recommended to be within the range of -1 to 1 [11].

2.5 Training the network

Training is done in multiple stages. First the network is fitted to the training dataset which means that the weights and biases in the network are updated through gradient descent. This is done multiple times, each time is referred to as an epoch. After each epoch the network is evaluated on the test dataset to see if the predictions have improved. The number of epochs is usually a fixed number in the range of a few hundred iterations. Early stopping is often used to prevent the network from overfitting, by stopping training once the predictions are no longer improving on the test dataset (see section 2.5.2).

2.5.1 Hyper-parameter optimization

When configuring the neural network, we need to define some fixed parameters that do not change during training. Examples of such parameters are the number of hidden lay- ers and their respective number of neurons, learning rate, activation function etc. These parameters are called hyperparameters and need to be configured before training. Man- ual configuration of these parameters requires an in-depth understanding of how their function relates to training results. Some of the parameters also have a huge impact on resource requirements such as the number of hidden layers or neurons in a layer. One of the main goals when tuning the network architecture is to find a structure that is capable of learning the complexity of the function without becoming unreasonably heavy on the computational resources. Choosing the optimal parameters can be a difficult task, fortunately there are methods to facilitate this process by automatically testing different model configurations to narrow down an optimal configuration over time such as Grid Search, Random Search or Bayesian optimization [12][13].

9 2.5.2 Generalization and overfitting

When using supervised learning algorithms it’s important to separate the data into a train- ing set and a validation set. The goal of supervised learning is to adjust or "fit" the pa- rameters of the network based on the training data, such that the trained network can later be used to predict outputs for data which has not yet been labeled. The accuracy mea- sured on the training data is not a good indicator of how well the model will perform on previously unseen data. It is possible to overfit a neural network such that it learns to rec- ognize patterns that are specific only to the training data and may not be representative of the problem in general. By evaluating the trained models on a dataset containing feature label pairs not present during training, a better measurement of the models generalization capabilities can be measured.

10 3 Method

Three different methods for sequence prediction will be compared.

Sequence to One Model structured such that it will step over a window of 700 previous time-steps and then predict a single time-step ahead of time. The prediction is then feed back into the network and the process repeated 24 times for a total prediction length of 25 time-steps.

Sequence to Vector Model structured to iterate over a window of 700 previous time- steps before outputting a vector containing predictions for 25 time-steps ahead of time.

Sequence to Sequence A more complex structure consisting of two separate parts known as encoder and decoder. The encoder reads the input sequence and encodes a summary of its context into a vector. The decoder is then used to iteratively build up the prediction sequence based on previous predictions and the context from the vector.

3.1 Collecting data for training

All data used for training and evaluation will be collected from the MIT-BIH Arrhyth- mia database. Sections containing normal sinus rhythm will be manually extracted and used for training, based on the accompanying annotations included with the dataset. The dataset contains 48 half-hour excerpts of two-channel ambulatory ECG recordings digi- talized at 360 samples per second with an 11-bit resolution [14].

11 3.2 Evaluation

All three models will be evaluated for their ability to accurately predict previously unseen non-anomalous data and their ability to detect anomalies. The prediction accuracy is evaluated on randomly sampled sequences containing non-anomalous ECG. The error is measured using Mean Squared Error (MSE). Anomaly detection will be evaluated on a ECG section containing a premature ventricular contraction and the onset of atrial fibrillation (AFIB) with aberrated beats (a) from ECG record 203.data 7:40 to 8:10 (MIT- BIH Arrhythmia Database) [14].

Figure 3.5: Section from record 201.dat

3.3 Reliability and Validity

The evaluation method will give us a basic idea of whether the system works in practice or not. For a complete validation of the method, a much larger dataset needs to be used, which requires significantly more computational capacity then what is required for this analysis. The results are dependent on the configuration of the network and the data used. With different layer configurations, data or differently tuned hyperparameters the results may vary.

12 4 Implementation

Deeplearning4j The neural networks were configured, built and trained using the open source library Deeplearning4j. The library acts as a toolkit for java, with its computational core written in , C++ and CUDA. It is capable of running locally as well as in a computational cluster. The platform is able to utilize CPUs and GPUs as long as the GPU supports the parallel CUDA.

4.1 Preparing the data

Features are the data that is fed into the network and labels are the measured output that we wish for the network to output (see section 2.2). The data is collected in sequences of 725 consecutive sample values which are split into two parts. The first 700 samples are used as input and thereby stored as features. The remaining 25 samples are the values we want the network to learn to predict and are stored as labels. Before the data is prepared into features and labels it is normalized into the range of -1 to 1 (see section 2.4). Noise and baseline drift are removed by using Savitchy-Golay filtering. The filter is set to use a polynomial order of 4 and a frame length of 16 [15].

Selecting sequences for preparation Multiple sections containing normal sinus rhythms were manually collected from the MIT-BIH Arrhythmia database and stored to disk. 10 000 different sequences with a total length of 725 samples (700 features and 25 labels) were extracted from random places within these files and prepared for training. Similarly 2500 different sequences were extracted for testing. In total 12 500 sequences were se- lected randomly. All three model structures were trained and evaluated using the same dataset.

13 Sequence to One The features are prepared in sequences of fixed length n that represent the number of time steps taken before a prediction. A sequence starts at step xk and ends in xk+n, where k is the current sample number in the ECG record. The corresponding label is stored as xk+n+1. A feature sequence and corresponding label value are prepared for each sample in the ECG record. Since the length of the feature sequence is different from the length of the label sequence a masking table is used to specify at which time-step the output is compared to the label.

Figure 4.6: Sequence to One

Sequence to Vector The Sequence to Vector method is slightly different, instead of outputting a single prediction at each time-step, multiple predictions are outputted at the last time-step in the form of a vector. A masking table is used similarly to the Sequence to One model.

Figure 4.7: Sequence to Vector

14 Sequence to Sequence For Sequence to Sequence prediction using encoder-decoder LSTM networks, the data needs to be structured into features for both the encoder input and decoder input separately. The encoder layer receives the first 700 samples of the sequence, similarly to previous models. The decoder input is different since it receives the previous predictions as input. During the training phase the previous predictions are substituted for the correct predictions, such that the error of the next prediction is not enlarged by previous predictions (See Table 4.1).

Encoder features Decoder features Labels 1 0 6 2 6 7 3 7 8 4 8 9 5 9 10 Table 4.1: Example of enc-dec training data used for a counter

The input sequence is encoded into a vector, which is then used by the decoder for building prediction sequences of arbitrary length. This essentially separates the process into two parts. The input length to the encoder is independent of the length of the input to the decoder, thus removing the need for a masking table.

Figure 4.8: Sequence to Sequence

15 4.2 Structuring the neural network

Since stochastic gradient descent on its own uses a fixed learning rate for all of the weights in the network which does not change during training, Adam (which is derived from adap- tive moment estimation) was used to provide adaptive independent weight adjustments for each of the weights in the network. This allows for the parameter adjustments to decrease as the network converges to the function. The following configuration was used, as rec- ommended in the following report Adam: A Method for Stochastic Optimization [16]: alpha 0.001 beta1 0.9 beta2 0.999 epsilon 1.0−8

For both the Sequence to One and Sequence to Vector structures a single LSTM layer is used containing 100 neurons, connected to an input layer with a single input. The output layer of Sequence to One has only a single output, whereas the Sequence to Vector model has 25 outputs, one for each prediction.

16 Encoder decoder The Sequence to Sequence structure has a single LSTM layer for the encoder containing 100 neurons, the encoder output at the last time-step is copied to a vector often referred to as a thought vector since it contains an internal representation or summary of the input sequence. The thought vector is then merged with the decoder input for each prediction the decoder performs. Similarly to the encoder, the decoder is composed of a single LSTM layer containing 100 neurons however it is only connected to a single neuron in the output layer.

4.3 Preventing overfitting by using early stopping

Early stopping is a common method for preventing the network from becoming overfit. The goal during the training phase, is for the network to improve its prediction on pre- viously unseen data. If the network keeps improving on the training data but not on the testing data, it means that the patterns that it is learning are too specific for the training data, and not necessarily representative of the data in general. To counter overfitting, the state of the model is only stored to disk after each epoch if its prediction accuracy on the test dataset has increased.

4.4 Evaluation

Prediction accuracy was measured by calculating the squared error between output and labels at each time step. The mean squared error was calculated for the entire test dataset and used to compare the models. For anomaly detection the Mean Absolute Error (MAE) and the unbiased sample vari- ance of overlapping predictions were computed at each time-step. Outliers were filtered out by computing a running median with a window size of 50 time steps for both the MAE and the variance measurements.

17 5 Results

In this section the results of the different models are presented, starting with a statistical comparison in Table 5.2. The table shows the Mean Squared Error (MSE) from evaluating the models on random sequences in the validation dataset. A lower value indicates a higher prediction accuracy. It also shows the complete training time and prediction speed for each of the trained models. In Figure 5.9 all of the overlapping predictions are displayed individually for a regular heartbeat and for a section of abberated beats. Afterwards plots for each of the different models are displayed. Figures 5.10, 5.12 and 5.14 show the different models attempting to predict randomly extracted ECG sequences. Figures 5.11, 5.13 and 5.15 show the different models’ predictions on a section con- taining a premature ventricular contraction followed by 4 regular heartbeats, ending in an onset of atrial fibrillation with abberated beats, see Figure 4.8. Each of these figures contain five charts starting with the predictions compared to the actual values. Underneath is the Mean Absolute Error (MAE) and the variance plotted on rows 2 and 4, with their respective running median plotted on rows 3 and 5.

18 Statistical comparison

Architecture MSE Training time Speed (samples/min) Sequence to One 0.0328 18.8 hours 489 Sequence to Vector 0.0126 8.5 hours 5797 Sequence to Sequence 0.0115 18.7 hours 517 Table 5.2: Prediction accuracy of random sections, training time and evaluation speed

Individual predictions 25 MAE regular heartbeat abberated beats

Figure 5.9: Individual predictions (Sequence to Sequence)

19 5.1 Sequence to One

Prediction accuracy

Actual Predictions

Figure 5.10: Seq to One - prediction accuracy of random sections

Anomaly detection Prediction / Actual MAE Median Variance Median

Figure 5.11: Anomaly detection using Sequence to One

20 5.2 Sequence to Vector

Prediction accuracy

Actual Predictions

Figure 5.12: Seq to Vec - prediction accuracy of random sections

Anomaly detection Prediction / Actual MAE Median Variance Median

Figure 5.13: Anomaly detection using Sequence to Vector

21 5.3 Sequence to Sequence

Prediction accuracy

Actual Predictions

Figure 5.14: Prediction of random sequences

Anomaly detection Prediction / Actual MAE Median Variance Median

Figure 5.15: Anomaly detection using Sequence to Sequence

22 6 Analysis

All of the three different models were able to learn the general pattern of the ECG-curve. The Sequence to Sequence model was slightly more accurate in its predictions compared to the Sequence to Vector model, however the Sequence to Vector model was substantially more efficient, capable of analyzing approximately 11.2 times more data in the same time frame. The Sequence to One model had the worst performance, both in terms of prediction speed and accuracy. Neither prediction error nor variance on their own were able to distinguish the anoma- lies from the regular heartbeats. In Figure 5.9 it can be observed that the longer predictions had trouble estimating the time of the R-wave, resulting in a spike in both the MAE and variance plots. If we attempt to use these measurements to find anomalies, the anomalies would not be distinguishable from the false positive spikes. When applying a running median to either MAE or variance, the anomalies were clearly distinguishable from the false positives.

23 7 Discussion

The first goal of this study was to find a suitable neural network architecture for predicting ECG curves. I discovered that recurrent neural networks, specifically long short-term memory recurrent neural networks, were a suitable choice for working with time-series data due to their ability to learn long term temporal dependencies within the data. Three different model configurations where evaluated; Sequence to One, Sequence to Vector and Sequence to Sequence. In order for the models to detect anomalous sections of data, they first had to learn to predict future sequences of non-anomalous ECG data. All three models were capable of learning the general shape of the ECG record and success- fully predict future sequences. The accuracy varied between the models and the Sequence to One model’s accuracy was clearly worse then the other two. When it came to the models’ ability to detect anomalies the prediction accuracy di- rectly correlated with the models prediction accuracy. Two of the models, Sequence to Vector and Sequence to Sequence, were able to distinguish the two anomalies from the false positives. The Sequence to Vector model was substantially more efficient in terms of analysis speed than the other two models. This was likely caused by the Sequence to One and Sequence to Sequence models building the output sequence in iterations by computing an output for each individual prediction as opposed to the Sequence to Vector model which computes the entire output sequence all at once and outputs it as a vector.

8 Conclusion

My hypothesis was that by training the neural network to predict future sequences on normal sinus rhythms, the network should have trouble predicting anomalous sections of ECG data and that this could be used for general anomaly detection. I was able to verify that by measuring the prediction error using squared error and applying a running median to filter out outliers, the anomalous sections could be distinguished from the sections containing normal sinus rhythm. The data used for validation was limited, so further verification is needed.

24 References

[1] E. Svennberg, J. Engdahl, F. Al-Khalili, L. Friberg, V. Frykman, and M. Rosenqvist, “Mass screening for untreated atrial fibrillation: the strokestop study,” Circulation, pp. CIRCULATIONAHA–114, 2015.

[2] T. Lindberg, D. M. Bohman, S. Elmståhl, C. Jogréus, and J. S. Berglund, “Preva- lence of unknown and untreated arrhythmias in an older outpatient population screened by wireless long-term recording ecg,” Clinical interventions in aging, vol. 11, p. 1083, 2016.

[3] P. Malhotra, L. Vig, G. Shroff, and P. Agarwal, “Long short term memory networks for anomaly detection in time series,” in Proceedings. Presses universitaires de Louvain, 2015, p. 89.

[4] D. E. Becker, “Fundamentals of electrocardiography interpretation,” Anesthesia progress, vol. 53, no. 2, pp. 53–64, 2006.

[5] A. P. Engelbrecht, “Artificial Neural Networks,” in Computational intelligence : an introduction. Chichester: Wiley, 2007, pp. 15–16.

[6] M. A. Nielsen. (2015) Neural networks and deep learning. [Online]. Avail- able: http://neuralnetworksanddeeplearning.com/chap4.html#universality_with_ one_input_and_one_output

[7] A. P. Engelbrecht, “The Artificial Neuron,” in Computational intelligence : an in- troduction. Chichester: Wiley, 2007, pp. 17–18.

[8] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning. MIT press Cambridge, 2016, vol. 1.

[9] K. Kawakami, “Supervised sequence labelling with recurrent neural networks,” Ph.D. dissertation, Ph. D. thesis, Technical University of Munich, 2008.

[10] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated re- current neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.

[11] S. t. Chris V. Nicholson. (2018) Introduction to the core deeplearning4j concepts - deeplearning4j: Open-source, distributed deep learning for the jvm. [Online]. Available: https://deeplearning4j.org/core-concepts#normalizing-data

[12] J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” Jour- nal of Machine Learning Research, vol. 13, no. Feb, pp. 281–305, 2012.

25 [13] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimization of ma- chine learning algorithms,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 2951–2959. [Online]. Available: http://papers.nips.cc/ paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf

[14] MIT Laboratory for Computational Physiology. (2010, Jun. 24) MIT-BIH Arrhythmia Database. [Online]. Available: https://physionet.org/physiobank/ database/mitdb/

[15] N. Rastogi and R. Mehra, “Analysis of savitzky-golay filter for baseline wander cancellation in ecg using wavelets,” International Journal of Engineering Sciences & Emerging Technologies, vol. 6, no. 1, pp. 15–23, 2013.

[16] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

26