<<

Anomaly Detection on Gas Turbine Time-series’ Data Using Deep LSTM-

Marzieh Farahani

Marzieh Farahani Autumn 2020 Degree Project in and Engineering, 30 credits Supervisor: Lili Jiang Extern Supervisor: Mohamed Elhafiz Hassan Examiner: Eddie Wadbro Master of Science Programme in Computational Science and Engineering, 120 credits Abstract

Anomaly detection with the aim of identifying outliers plays a very im- portant role in various applications (e.g., online spam, manufacturing, finance etc.). An automatic and reliable anomaly detection tool with accurate prediction is essential in many domains. This thesis proposes an anomaly detection method by applying deep LSTM (long short-term memory) especially on time-series data. By validating on real-world data at Siemens Industrial Turbomachinery (SIT), the proposed method shows promising performance, and can be employed in different data domains like device logs of turbine machines to provide useful informa- tion on abnormal behaviors. In detail, our proposed method applies an autoencoder to have fea- ture selection by keeping vital features, and learn the time series’s en- coded representation. This approach reduces the extensive input data by pulling out the autoencoder’s latent output. For prediction, we then train a deep LSTM model with three hidden layers based on the encoder’s latent layer output. Afterwards, given the output from the prediction model, we detect the anomaly sensors related to the specific gas turbine by using a threshold approach. Our experimental results show that our proposed methods perform well on noisy and real-world dataset in order to detect anomalies. More- over, it confirmed that making predictions based on encoding represen- tation, which is under reduction, is more accurate. We could say ap- plying autoencoder can improve both anomaly detection and prediction tasks. Additionally, the performance of deep neural networks would be significantly improved for data with high complexity. Acknowledgements

We are grateful because we managed to complete the master final project in anomaly de- tection of a gas turbine using deep LSTM-Autoencoder within the given time by Marzieh Farahani, a student of the master program of computational science and engineering. This theis is could not be completed without the effort and cooperation of Siemens Company and Umeå University. I also thank both supervisors in Siemens Company and Umeå University, Mr. Mohamed Elhafiz Hassan and Dr. Lili Jiang for the guidance and encouragement in finishing the final project. Last but not least, I would like to thank Siemens Data Scientist members in Siemens Com- pany, My family, and Mr. Mehrdad Farahani for their constant source of inspiration and guidance. Contents

1 Introduction 1 1.1 Objectives2 1.2 Scope and Limitation2 1.3 Literature Review3 1.3.1 Statistical-based methods3 1.3.2 Prediction-based methods4 1.3.3 Reconstruction-based methods5 1.4 Thesis Structure5

2 Principles and Concepts 6 2.1 Time Series and Anomaly Forecasting6 2.1.1 Key Components associated with An Anomaly Detection Problem8 2.1.1.1 Nature/Type of Anomaly8 2.1.1.2 Type of Time-spaces8 2.2 Time Series and Deep LSTM9 2.3 Dimentionality Reduction (Autoencoder) 11

3 Methodology 14 3.1 Dataset 14 3.2 Model Design 19 3.3 Prediction Model 19 3.3.1 Reconstruction Autoencoder 19 3.3.1.1 Reduction Using AE 21 3.3.2 Deep LSTM 23 3.4 Detection Model 26 3.4.1 Anomaly Scoring and Selection of Candidate Set 27

4 Experimental Study and Results Analysis 29 4.1 Prediction Model 29 4.1.1 Reconstruction Autoencoder 29 4.1.2 Deep LSTM 33 4.2 Detection Model 36

5 Conclusion and Future Work 39 5.1 Conclusion 39 5.2 Future Work 39

References 41 1(43)

1 Introduction

Anomaly detection (aka outlier detection) is the process of identifying unexpected items, observations or events in data sets, which differ from the norm. As an integral part of most companies and businesses, anomaly detection significantly reduces financial and technical losses. Especially, as time-ordered data in various platforms (industrial, health, economic, and financial) has been exponentially increasing along with the emerging (IoT) [18] , which enables them to collect and share data. This growth creates new business opportunities as well as brings more challenges to detect outliers among the time-series data. However, many companies usually have manual monitoring for identifying anomalies on different underlying bases, which require substantial human effort to monitor daily or weekly reports on operations or performance. Thus, it is challenging for the companies to track all metrics simultaneously and find a correlation between them. Besides the above difficulties, time series data in companies is noisy and in large scale, and especially, the label or the class of the anomalies data is lacking. Many types of re- search are trying to apply data-driven methods. These methods for anomaly detection can be mainly categorized into three types: statistical modelings, such as the k-means cluster- ing and , temporal feature modeling, which is mainly based on the Long Short-term Memory (LSTM), and spatial feature modeling, which takes advantage of Con- volutions Neural Network CNN [14]. The primary purpose of these methods is developing stable algorithms by adopting system conditions to detect outliers even in different environ- ments. methods have become successful due to the capacity to handle non-linearity in complex temporal correlation [8]. Deep learning (DL) is retrieved from classical (ML). Still, deep learning is responsible for the growth of Artificial Intelligence usage by improving the existing algorithms. It has shown high-grade performance because of its power to deal with unstructured and unlabeled data. As well as there is no need for domain knowledge to extract features. Nevertheless, it is fair to say in deep learning approaches, there are limitations such as extra time to train and need lots of training data. Moreover, one of the most considered boundaries of deep learning is the Neural Networks at the core of deep learning are black boxes. The project aims to provide and apply the deep LSTM method with using autoencoder in an unsupervised way to detect time series anomalies. Apart from working with an unla- beled dataset, there is no need to conduct because it is a complicated task. Instead, many deep neural network parameters are trained to learn the input data’s critical feature during the training stage. Plus, Autoencoder helps deal with the large scale dimensional inputs data which they are ordered in time. 2(43)

1.1 Objectives

Siemens Industrial Turbomachinery (SIT) is one of the biggest international companies in power generation. The company invested in diverse projects to examine and study machine lifetime and its corresponding components to identify how and when various failures influ- enced the system. In recent years, the digitalization transformation benefits from collecting and maintaining data in database formats that carry various valuable information about un- expected events, component repair, and operation outage history. The controlling system, which includes a computing device, gives us information about the hardware component’s thermodynamic and operating parameters using sensors placed along with the turbines sec- tions. This thesis’s principal goal is to develop an advanced maintenance strategy that can help the power plant operators increase their assets’ availability and reliability and minimize their CAPEX and OPEX. Capital Expenditures (CAPEX) are significant purchases a company proceeds on goods or services to develop a company’s future performance. Furthermore, Operating Expenses(OPEX) are the typical cost, such as salaries and rent that a company incurs to run their day-to-day operations [36]. To reach this goal, we have stated a general Deep Learning (DL) model with two desires:

• Daily Forecasting in order to predict each sensor’s five minutes head for specific gas turbine machine.

• Detection in order to find the list of sensors with anomaly behavior that gave the system hard time. It improves system cost along with the daily Forecasting model.

The model is expected:

• The model must be generalize. In this study, the model performed for the specific case study. For a different case study the result should be obtained within minimal changes.

• The model should be accurate and valided under different environmental conditions.

The prediction/estimation approach detects abnormal behavior (in a shape anomaly) on the collected data by comparing it with the desired network outputs. It helps build the decision- making process for the power plant operators automatically and classify the useful pattern during the operations. Detecting anomalies can give pre-warning and reduce system costs to the manufactories. The work of anomaly detection in this thesis especially provides useful information to the department within Siemens Industrial Turbomachinery (SIT).

1.2 Scope and Limitation

The thesis scope’s decision has been taken after careful analysis of the customer service dataset. It is required to declare that a turbomachine, such as a gas turbine, generally in- cludes several sections, and each section includes numerous hardware components. Over 3(43) time, multiple elements, including thermal cycling, vibration, and pressure pulses within the gas turbine were measurement by sensors (aka signals). Fifteen gas turbine units were considered according to customers’ commonly requested units with the frequent turbine-model package. Finally, The project begins with one final gas turbine unit, and the rest were left out of the project because they were in the commissioning phase, and the signal value’s quality and quantity were not good enough. Another limitation is related to the records of signal value for the specific gas turbine unit at several time intervals. The records were collected from 2012 until the ongoing year. Nevertheless, some signals did not have any records for some months between years (2012- 2020). As the project’s complexity was high enough, the author of this thesis considered studying each year’s quality and quantity of signals. Finally, it has to be pointed out that the primitive dataset during the thesis is considered in the year 2013.

1.3 Literature Review

There has been a considerable amount of research in the field of anomaly detection. The most manageable and common way in time series anomaly detection is to set thresholds and generate warnings whenever the metric goes above or under the threshold. However, Finding the threshold for each metric needs a deep understanding of the performance of the indicator. It was a difficult task to capture the desired output from the complex structures in the data. To overcome this difficulty, more advanced techniques, namely: statistical-based, prediction- based, and Reconstruction-based, are mainly applied.

1.3.1 Statistical-based methods

Statistical-based methods [35] can be classified as supervised and unsupervised approaches. Both Supervised and unsupervised techniques aim to isolate anomalies within the time se- ries. In the supervised method, observations are labeled as healthy and faulty based on previous historical data; then, this dataset is used to create classification models that can predict unseen records. Support Vector Machine (SVM) and k-nearest neighbor (KNN) are representative algorithms of this category. These algorithms work dependably on a distance measure between objects. Objects that are distant from others are considered as anoma- lies. This detection is also could be called the distance-based methods [30]. Both KNN and SVM are classical Machine Learning methods and generally used for classification [15] [17]. Nevertheless, Standard SVM and KNN may be out of work if we deal with anomaly detection. Next, researchers were pictured on how these methods could be managed for anomaly detection problems. A vital factor for composing an anomaly-based detection model is to select significant fea- tures for making decisions. In most recent researchers, KNN (k-nearest-neighbor) approach in combination with the mother algorithm, showed excellent and successful feature selec- tion and weighting performance [34]. The procedure could be done simply by weighting all initial features in the training stage based on the distance measures, and the top ones were 4(43) selected to complete the testing stage. The KNN algorithm has performed the identification of the nearest neighbors. In most cases, KNN is used as a classifier technique. In this study [35], KNN is represented as a semi- approach to determine the indicator’s performance in the health area. This paper [26] also showed how mapping the data into the kernel space and separates them from the origin with maximum margin could be answered the weaknesses of the standard SVM on anomaly detection problems.This method is declared as a one-class support vector machine (OCSVM). The application of these techniques is restricted by the availability of training data of anoma- lies. Several researchers use density-based methods such as (LOF) and k-mean clustering [37] to handle the limitation of distance-based methods. Still, these tech- niques’ success depends on the similarities between clusters and anomalies’ characteristics. According to their manual similarity, it groups observations into different clusters; for ex- ample, standard data may come from large and dense clusters, and anomalies may arise from small and sparse clusters. From the preceding, the statistical-based methods are covered two models, namely: distance- based and density-based. They had two major obstacles facing time-series data: previous knowledge about anomaly duration; methods could not capture temporal correlations.

1.3.2 Prediction-based methods

It is essential to consider all methods to try to highlight the difference between standard and faulty behaviors. Prediction-based methods study a predictive model for the given time series data to predict future values. A data is pointed out as an anomaly if the difference between the predicted and original value exceeds a certain threshold. Several traditional prediction models employ the relationships between the time series and its lag features to predict future values such as Auto-Regressive (AR), Moving Average (MA), Autoregressive Moving Integrated Average (ARIMA), and Seasonal Autoregressive Moving Integrated Av- erage (SARIMA). There are lots of paper with a focus on the above techniques. However, in most cases, the time series prediction methods were not applied to anomaly detection [4] [31].Still, there exist work that develops traditional time series models in order to have the ability to detect anomalies in the problem [38] [23]. These techniques have some significant limitations; for instance, this study [24] discussed trend and seasonal time series forecasting methods and its importance for making critical decisions. The research shows that the traditional forecasting model, such as ARIMA, has difficulty modeling the nonlinear relationships between variables. Moreover, a constant standard deviation in the ARIMA model’s errors is considered, which may not be satisfied with different problems [32]. A deep learning-based approached attempts to overcome these challenges. LSTM (Long Short-Term Memory) is a particular form of () that was initially proposed to solve the vanishing gradient in RNN’s by replacing their simple internal loop with a different formation that makes LSTMs be caple to track variable in a sequence and learn the long dependencies between them. In numerous researchers, the LSTM lonely 5(43) or in combined with different approaches, can effectively detect anomalies [16] [20] .

1.3.3 Reconstruction-based methods

Reconstruction-based models learn by encoding their input data to a lower-dimensional rep- resentation in the latent structure and decoding back to the original input. According to this research [9], "Reconstruction-based methods assume that anomalies lose information when they are mapped to a lower dimension space, there by cannot be effectively reconstructed; thus, high reconstruction errors suggest high chances of being anomalies." There are several dimensionality-reduction, such as Principal Component Analysis (PCA) and Auto-encoder. The Auto-encoder has more Attention between these two existing meth- ods because it can better handle PCA’s limitation. The most visible limitations are that the method is restricted to linear reconstruction and requires positively correlated data, which follows Gaussian distribution. Lately, the Auto-encoder method in anomaly detection is grown. For instance, in this pa- per [28], Variational Auto-encoder (VAE) with Attention could provide structured and ex- pensive representation to detect anomalous behavior in time series. Furthermore, in other papers [19], Auto-encoder, using Recurrent Neural Network (RNN) to generate multiple auto-encoders with different neural network connection structures. As a result, the frame- work shows outperforming on time series’ outlier detection problems. It is also qualified to discuss that the Encoder-Decoder using Long Short-term Memory (LSTM) showed excel- lent performance on multi-sensors anomaly detection [25].

1.4 Thesis Structure

Chapter 2 presents the theoretical background of this project. It covers the summary of time series and deep learning forecasting methods, followed by the theory of techniques focused on . Chapter 3’s focus is on the methods we used in this project. It starts with dataset and data preparation. Next, the scheme of prediction and detection models will be defined. Chapter 4 shows and discusses the results of each prediction and detection model based on the chosen case study. Chapter 5 is reportage with an outline of the outputs and future work done in this project’s path. 6(43)

2 Principles and Concepts

This chapter defines some principles and concepts that are related to knowledge in the con- text of this study. The first section of this chapter demonstrates the time series and their properties and re- views the necessary tools for studying anomaly time series detection, which provides the organization with useful information for making significant decisions.The second section introduces the definition of Deep Learning (DL) and Long-short-term Memory (LSTM) al- gorithms. The third part discusses the Autoencoers algorithm and their applicability to the anomaly detection problem

2.1 Time Series and Anomaly Forecasting

A time series [30] is a collection of random observations S = {Xt ,t ∈ T} made sequentially through time (T). In time-series data, we have only one realization and a finite number of variable records. If only one variable is modifying over time, The time series specified as a univariant time series (UTS); otherwise, The set S is defined as a multivariant time series (MTS). Figure 1 as follows, one sensor variable of the time-series dataset has been chosen in determined time duration.

Figure 1: active load sensor’s behavior through time 7(43)

Figure 2: active load sensor’s seasonal decomposition

The time interval for collection data could be, for example, seconds, minutes, hours, days, weeks, months, years. Time-series data arise naturally in various disciplines, namely, fi- nance, economics, environmental science, electrical engineering, and computer science [11]. Stationary time series is set to have a constant long-term mean and variance independent of time. Detection of stationary or non-stationary is done by differencing the data from a shifted version of itself after subtracting the data from its trend and seasonality. As a rule, non-stationary data is unpredictable and could not be modeled or forecasted. While the prediction time series, the data should be expected to be stationary. Forecasting [40] in time series is simply described as a process to predict the changes that happen within the given data and to predict the moves that will happen in the future. The prediction methods could be used on the presented data whenever:

• Firstly, each variable’s records must have the time dimension and be arrayed in the temporal order. • Secondly, the records values are a continuous one in a settled period under specific laws.

Their temporal feature, such as trend, seasonality, and residuals, gave us important and useful information in the prediction scheme, which could be obtained by decomposing 2.1 the series. The result of the time series decomposition is shown in figure 2. 8(43)

Xt = mt + st +Yt (2.1)

• mt , trend, a long-term non-periodic movement in the mean.

• st , seasonal variation, cyclic fluctuations, for example, due to calendar or daily varia- tions.

• Yt , residuals, random and all other unexplained variations.

2.1.1 Key Components associated with An Anomaly Detection Problem

To study time series anomaly detection and prediction, the first thing is knowing the anomaly’s determination and its type. It is essential to agree on the exceptions. The second is to un- derstand the different types of time-spaces during the prediction. As discussed before, anomalies are patterns in data that do not fit a well-defined notion of normal behavior. Most of the present anomaly detection techniques solve a particular prob- lem. Solving problem is influenced by numerous circumstances such as type of anomalies [2] and the prediction time-spaces that deal with numerical data.

2.1.1.1 Nature/Type of Anomaly

Point Anomalies Simply definition is a single record of the data has deviated mostly from the rest of the data points in the dataset. An common example is credit card fraud detection. Contextual Anomalies is defined as the anomalies in the dataset whose detection hugely depends on contextual information. This type of anomaly is popular in time-series data. Collective Anomalies means a collection of related data instance concerning the entire dataset would be granted anomalies, not individual value.

2.1.1.2 Type of Time-spaces

Time series prediction on numerical data could be made on three principles of time-spaces including Short-Term Period, Mid-Term Period, and Long-Term Period[24]. Time in a short-term period forecasting is set as a time frame of fewer than three months. Whereas the mid-term is focused on a time frame of three months to one year, the last one is considered more than a year. It is fair to state that these time frame categorization could be changed based on the problem’s circumstances. For example, In traffic time series prediction, it is possible to consider the short-term, mid-term, and long-term periods as seconds, minutes, and hours. 9(43)

2.2 Time Series and Deep LSTM

The definition of deep learning is quietly varying. However, most researchers’ core is that deep learning is a sub-field of machine learning, and it can learn from high dimensional data in a supervised, unsupervised, and hybrid manner [22]. The "deep" word has easily pictured a network of layers stacked on top of each other. Each layer can be seen as a non-linear module that receives the previous layer’s output as its input to transform the input data into meaningful output automatically, which is one reason that makes them quite popular. In recent years, deep learning has frequently become popular and has been applied in var- ious anomaly detection algorithms, as illustrated in Figure 3. Deep anomaly detection (DAD) [1] techniques can automatically learn and extract features without developing man- ual features by domain experts. is expected to gain more attention because collecting labels in an imbalance dataset has many difficulties. The dataset is im- balanced if anomaly behavior happens rarely, and most of the records are normal.

Figure 3: Performance Comparison of traditional vs Deep Learning algorithms. Picture adopted from [1]

The first group of techniques deals with supervised classification. In these methods, records of the variable are labeled as an anomaly or normal based on previous historical data; then, this dataset is used to create classification models that can predict the state(normal or anomaly) of unseen records. The second method deals with unsupervised methodolo- gies which are based on unlabeled states. This approach aims to detect outlier behavior in contrast with the legitimate behavioral. For this purpose, the model needs to extract the standard behavioral for each state and then identify anomaly activities. In summary, unsupervised deep learning models are usually used for denoising, compres- sion, or finding correlation. One of these models is Long Short-term Memory (LSTM). Long Short-term Memory networks (LSTMs) are well-suited to classifying, processing, and making predictions based on data acts like time-series behavior. LSTM was developed to deal with the exploding and vanishing gradient problems encountered when training tra- ditional Recurrent Neural networks (RNNs) [27]. The LSTMs be competent in learning the dependencies between variables in a long period of time.

In general, as demonstrated in the figure 4, an LSTM [13] contains a hidden state ht , cell 10(43) state ct , and LSTM gates (input, output, and forget). The hidden state and cell state is also known as an external and internal state, respectively. The external state is the output of the network, and it showed the LSTM capacity, and the choice of the hidden cell size is on the user’s shoulders.

Figure 4: LSTM structure. Picture adopted from [5]

Cell state is one of the significant differences between LSTMs and RNNs network because the internal state could act as a memory cell for the LSTM, and it kept information from the past. However, it is not required to have them in the output gate of the LSTM network. Gates of the LSTM [12] provide continuous analogs of write, read, and reset information. It is essential to point out each final result of the gates goes through the to lie down the values between zero and one. The forget gate is the first gate in the LSTM network. It is responsible for deciding how much information should be kept from the network. The closer sigmoid result to one, the more information from the past by the LSTM unit is stored. Similarly, the closer the sigmoid result to zero, the less information from the past by the LSTM unit is saved. This part result affects the previous state ct−1. The input gate is the second gate. It is responsible for choosing the amount of information added to the previous LSTM knowledge to perform better. This choice could be decided after applying the sigmoid function on the new input and past state.

The cell state is updated by multiplying the input gate’s result with Cet to provide a new vector added to the recurrent cell state. The output gate decided on LSTM output, and it affected the hidden state value as well.

Three different weights (Wxh, Whh, b.) are included in the LSTM gates. Weights are matrices that represent a linear transformation of the input. The calculation of the weights is done automatically based on the input and the desired output shape. The functions of the LSTM unit showed in detail in following Equations (2.2-2.6). 11(43)

It is good to know for an LSTM layer with "h units", the number of the parameters would be 4 ∗ (hunits ∗ hunits + hunits ∗ num f eatures + hunits ∗ 1)

Forgetgate : ft = σ(Wxh f xt +Whh f ht−1 + b f ) (2.2)

Inputgate : it = σ(Wxhi xt +Whhi ht−1 + bi) (2.3)

In f ormation : Cet = tanh(Wxhc xt +Whhc ht−1 + bc) (2.4)

K K Cellstate : Ct = ft ct−1 + it Cet (2.5)

Hiddenstate/out put : ht = ottanh(Ct ) (2.6)

2.3 Dimentionality Reduction (Autoencoder)

There are different techniques to reduce input dimensionality. For instance, Principal Com- ponent Analysis (PCA) is used for dimensionality reduction as , and Au- toencoder is applied for dimensionality reduction as . PCA briefly used statistical techniques in order to give unlabeled high dimensional dataset, dimensional reduction. Moreover, Autoencoder benefits from applying dimensionality re- duction and feature engineering. They are usually helpful for extracting useful features from the input data in an unsupervised way [22]. The Autoencoder is a unique design of neural networks that tries to learn an image of its input. The Autoencoder model is formed of two main models, which they are in charge of an operation called reconstruction. Reconstruction could be performed by the encoder model for encoding its input data to a lower or higher dimension in a hidden layer (aka latent space) and the decoder model to trying to decode back the original input in the desired space. There are different types of Autoencoders available, like variational, sparse, and denoising autoencoders[21] [10]. Autoencoders have some knowledge beforehand of how their output should look; therefore, they are considered self-supervised models [39]. The figure 5 shows the general structure of Autoencoder. The structure of the Autoencoder has usually been symmetric. It means the encoder layer’s size is the same as the decoder layer’s size but in reverse order. 12(43)

Figure 5: Autoencoder Structure: f , and g are represented the encoded and decoded func- tions

The following equations 2.7, 2.8 showed the general functions for a basic Autoencoder with one layer. In these equations function f (x) and function g(x) represent the encoder and (1) decoder model, respectively. In the encoder stage, σ1 and σ2 are activation functions, W and W (2) are weight matrices, and b(1) and b(2) are bias vectors.The entire reconstruction of the input x is determined by go f (x).

(1) (1) h = f (x) = σ1(W x + b ) (2.7)

(2) (2) xe= g(h) = σ2(W x + b ) (2.8)

It is essential to consider having better perform the input reconstruction; the system needs to minimize the error based on the defined in the equation 2.9.

2 2 (2) (1) 1 2 L(x,xe) = kx − xek = x − σ2(W (σ1(W x + b )) + b ) (2.9)

In this paper[3], the autoencoders based on the number of layers were categorized. There are two main types: Shallow and Deep. Shallow is also known as the autoencoders’ main structure and contains three layers: input, encoding(one hidden layer), and output. Despite this, a Deep autoencoder has more than one hidden layer. In figure 6 , there are four types of autoencoder which is made of these two combinations. 13(43)

Figure 6: Different types of autoencoder structure. Picture adopted from[3] 14(43)

3 Methodology

3.1 Dataset

For this project, we got data for one of the specific Gas turbines in Siemens Industrial Turbomachinery (SIT), representing the 69 sensors’ record in 2013 between January and December. This data is a daily log with a minute interval of each sensor KPIs for one year. In figures (7-9), three arbitrary examples out of the 69-time series are shown. Each time series has a different behavior through time from the others depending on the gas turbine’s sensor location. We consider the unexpected changes (increases and drops) in time series patterns as anomaly behavior, which we intend to predict.

Figure 7: inlet pressure sensor after Standardization: The real value of the sensor could not be shown because of the Siemens company restriction 15(43)

Figure 8: air temperature sensor after Standardization: The real value of the sensor could not be shown because of the Siemens company restriction

Figure 9: outlet pressure sensor after Standardization: The real value of the sensor could not be shown because of the Siemens company restriction

Due to the limitation of providing sensors with their realistic numbers, it is essential to men- tion important behavior of the dataset that effect on the preprocessing steps. we could say, each of these time series are on different scales. Furthermore, the original dataset pictures that two types of quality for the sensors’ value exist: ( good-quality and bad-quality). After removing the dataset from the bad-quality values, it may cause missing value. Additionaly, There are sensors which did not have any behavior of the signals and their value only lies down on two numbers( zero or one), and that make the data noisy. Therefore, data preprocessing helps to make the raw data ready to be fed to the neural network. Data preprocessing is the way to handle the missing values, normalization, and vectorization. There are standard preprocessing related to time series, for instance, Power transformation, 16(43)

Difference Transformation, Standardization, and Normalization. Power transformation is used to transform data into normal (Gaussian) distribution. Difference Transform is remov- ing trend and seasonality structure from the time series. Standardization is transforming data to zero mean and standard deviation one, as shown in the equation 3.1. Normalization is a scale data transformation between zero to one or minus one to plus one, as noted in the equation 3.2. It is also called the Min-Max scaler.

x − x¯ Z = i (3.1) x σ

xi − xmin MinMaxx = (3.2) xmax − xmin

The goal is to have the mid-term prediction of each 69 sensors based on the historical data and predict if there will be an abnormal behavior on each sensor’s performance. For this aim, the flowchart in the figure 10 shows the steps we demanded in the preprocessing stage.

Figure 10: data preprocess steps

The first step is preparation of the time interval. This step supports the user to determine the time interval duration. For instance, the user could adjust the one-minute interval of the raw data to a five-minute interval. The implementation code for this part is shown by block 3.1.

1 d e f data_preparation(df, dt_col_name , val_col_name , interval=’5T’): 2 df[dt_col_name] = pd.to_datetime(df[dt_col_name]) 3 df = df.groupby(pd.Grouper(key=dt_col_name , freq=interval ))[val_col_name ].mean() 4 df = pd.DataFrame(df) 5 df[dt_col_name] = df.index 6 df = df.reset_index(drop=True) 7 8 r e t u r n df Block 3.1: time interval preparation code block

In the whole duration of 2013, some records can be missing because of the operation per- formed in the preprocessing phase and mechanical/electrical failures during the data recov- ery process. These missing values are considered unknown data (aka incompleted feature vector). Different types of approaches exist to deal with missing values. One of the ap- proaches is imputation or estimation of missing values. Imputation could be implemented 17(43)

based on statistical methods such as Mean imputation, Regression Imputation, and Multi- ple Imputation. [7]. In this project, we use Mean imputation to solve the missing sensors’ value, as it shown in block 3.1. This research deals with the time series prediction problem. It is clear to state prediction on a non-time series dataset is more accessible than a time series dataset. The reason behind that is scoring on new records can be performed independently of the other records. How- ever, in the non-time series data, scoring new records depends on recent records’ look-back window. Hence, The following steps are considered to be Normalizing data and generating the timesteps called the " Multi-feature Window Method." At first, we used standard scaling to scale the time series. Next, in the multi-feature window method, we chose a window size of 12 as a number of timesteps, and the prediction is for when timestep is equal to zero. In the case of the multi-feature prediction, each time series is considered one feature itself. Since there are 69 selected time series, it gives us 69 features. We put the first timestep of all 69 time-series at first positions, then the second timestep of all time-series go after them, and so on, as illustrated in the Figure 11.

Figure 11: Multi feature approach: The train data starts with timestep 12 of all time se- ries,then time step 11 of all 69 time series till time step 1 of all 69 time series. The Target are the future time step of all time series. Each time series (ts) is the data related to one sensor located on Gas turbine.

1 x _ c o l s = list (d_tot_copy.columns[1:]) 2 ts_col = d_tot_copy.columns[0] 3 # standard scslar 4 s_data = d_tot_copy[[ts_col] + x_cols] 5 scaler = preprocessing.StandardScaler() 6 scaler_data = scaler.fit_transform(s_data[x_cols].values). tolist () 7 scaler_data = pd.DataFrame(scaler_data , columns=x_cols) 8 s_data = pd.DataFrame(pd.concat([scaler_data , s_data[ts_col]], axis=1), columns=s_data .columns) 9 s_data.head()

Block 3.2: standardization code block 18(43)

The last step in the data preprocesses flowchart is splitting the data into train, Valid, and Test sets. It is vital to consider that random split is not the right choice for a time series dataset. The main reason for this decision is choosing random rows from the dataset caused to lose valuable information. This result goes back to the continuous and time-ordered behavior of time-series datasets. Figures (12-14) show a normalized sample of final data and split into three training, validation, and test datasets.

Figure 12: standardization data examples. sensor-9 time series is splitted to train data (blue), validation data (red), and test data (green)

Figure 13: standardization data examples. sensor-7 time series is splitted to train data (blue), validation data (red), and test data (green) 19(43)

Figure 14: standardization data examples. sensor-467 time series is splitted to train data (blue), validation data (red), and test data (green)

3.2 Model Design

This chapter presents a framework of the research methods followed in this study. It pro- vides the prediction and detection model structure in detail, which helped solve the anomaly detection problem associated with Siemens Industrial Turbomachinery (SIT), as described in Figure 15.

Figure 15: proposed model in the study

3.3 Prediction Model

The prediction model consists of two parts: a reconstruction autoencoder, and a Deep Long Short-term Memory model.

3.3.1 Reconstruction Autoencoder

This part generates a representation of our input time series. For this purpose, an Autoen- coder with a reconstruction goal was implemented. It gets a 2D input data with shape (None, 20(43)

828) and outputs a reconstruction of its input. The autoencoder model is composed of two models: encoder and decoder. Each model is built by a Multilayer Network. The encoder model has two hidden dense layers and one latent dense layer. The decoder model has one latent dense layer and two hidden dense layers, and an output layer.

Figure 16: encoder model structure: consist of the input layer, two hidden dense layer with 512 and 256 hidden units, and latent layer with 120 units

Figure 16 shows the structure for the encoder model built by a net- work. Each neuron in one layer is connected to all neurons of the next layer. The input for 69 features (number of whole time-series available), and 12 timesteps will be in the shape of (None, 12 *69) = (None, 828). The first hidden layer has 512 neurons, and its output has the shape of (None, 512). The second hidden layer has 256 neurons; hence, this layer’s output is in shape (None, 256). The last layer in the encoder model is the latent layer, which has 120 neurons since our goal is to extract essential features by reducing the total number of the input layer. Therefore, the output for this layer is (None, 120). 3 Equation 3.3 gives the computation of the whole encoder structure. on indicated the third dense layer (aka latent layer), which has 64 neurons. In this equation, xi is the input to the l model, wuv denotes the connection between v:th neuron in layer 1-1 to the u:th neuron in l layer 1. The bias of the u:th neuron in layer 1, represent by bu. σ1 and σ2 are in the first dense layer and second layer, respectively. 21(43)

" " " # # # 3 3 2 1 1 2 3 On=64 = ∑wnm σ2(∑wk j σ1(∑w jixi + b j ) + bk) + bn (3.3) m j i

A multilayer Perceptron network built the decoder model the same as the encoder. Each neuron in one layer is connected to all neurons of the next layer. The decoder’s input is the result of the encoder stage in the latent layer, which has the shape of (None, 120). The first hidden layer has 256 neurons, and the second hidden layer has 512 neurons; hence, the layer’s output is in shape (None, 256), and (None, 512), respectively. The last layer in the decoder model has 828 neurons since our goal is to reconstruct the input layer. Therefore, the output for this layer is (None, 828). The structure is the same as figure 16 but in the opposite flow. We applied the Mean squared error (MSE) as a Loss function for this model. Equation 3.4 0 explains how to measure this Loss where y is the reconstructed value, y is the input value, and N is the number of features (aka observations). The Loss function presents the error in the reconstruction value compared to the expected result. Another vital step is updating the weight in order to improve the reconstruction result of the network. The expression assists us to settle this step by minimizing the error that the Loss function is giving us. The process is done by calculating the gradient. First, we considered different network parameters that affect the loss function, such as weight elements and bias matrices as θ. Then, of the Loss function concerning these parameters will be ∂L(θ) ∂θ . Equation 3.5 represents how to update theta parameters for each layer using the gradient descent method. In the equation, γ is a that is a parameter that updates parameters.

N 1 0 2 L = ∑ (yi − yi) (3.4) i=1 N

∂ (θ) θ = θ − γ L (3.5) ∂θ

Meanwhile, the network is extensive, and the size of the training data is vast; therefore, the Optimizer algorithm in the model benefits from speeding up learning and minimizing the Loss function. Stochastic Gradient Descent(SGD) is alternative gradient descent. Instead of updating the Loss calculation parameters on the whole dataset, the SGD algorithm divides the dataset into batches and updates parameters for each batch Loss calculation. There are other optimizer algorithms, such as RMSprop, AdaGrad, and Adam. We called them hyperparameter, and tuning them helps to get a better result as well. In this study, we mainly used Adam.

3.3.1.1 Reduction Using AE

Autoencoder is a way to transform the representation of the input. There are two kinds of design for the autoencoder: Sparse or compressed. Sparse autoencoder could be achieved 22(43)

by keeping the number of the hidden layer nodes greater than the number of original input nodes. On the other hand, the compressed autoencoder could be obtained by selecting the number of hidden layer nodes less than the original input nodes. This study focused on the compressed representation of the input, which achieves the desired dimensionality reduction effect. In this part, we are looking for a non-linear projection method that maps the data from high feature space to lower feature space. Because sample data in high-dimensional space gen- erally cannot diffuse in the whole space, they lie in a low-dimensional manifold embedded in high-dimensional space. The dimensional reduction process is done by designing the non-linear autoencoder recon- struction at the first stage as it shown in block code 3.3.

1 d e f build_ae(input_dim , latent_dims , lr , dropout_rate): 2 # inputs 3 inputs = tf..layers.Input(shape=[input_dim], name=’inputs’) 4 x = i n p u t s 5 6 hidden_dims = latent_dims[: −1] 7 latent_dim = latent_dims[ −1] 8 9 f o r hidden_dim in hidden_dims: 10 x = tf.keras.layers.Dense(hidden_dim, activation=’linear’)(x) 11 x = tf.keras.layers.Dropout(rate=dropout_rate)(x) 12 13 x = tf.keras.layers.Dense(latent_dim , activation=’linear’, name=’latent_layer’)(x) 14 15 f o r hidden_dim in hidden_dims[:: − 1 ] : 16 x = tf.keras.layers.Dense(hidden_dim, activation=’linear’)(x) 17 x = tf.keras.layers.Dropout(rate=dropout_rate)(x) 18 19 outputs = tf.keras.layers.Dense(input_dim , activation=’sigmoid’)(x) 20 21 model = tf .keras.Model(inputs=inputs , outputs=outputs) 22 opt = tf.keras.optimizers.Adam(lr=lr) 23 model .compile(optimizer=opt , loss=’mse’) 24 25 r e t u r n model

Block 3.3: autoencoder build model code block

The critical point in designing the autoencoder model is considering the hidden layer nodes smaller than the original input layer node. The final step is selecting the latent layer, which contained compressed information of the input layer. The scheme of the reduction process is shown in the code block 3.4 and figure 17. 23(43)

Figure 17: Summary of the reduction model

1 d e f dr_model(ae, layer_name=’latent_layer’): 2 i n p u t s = ae .input 3 outputs = ae.get_layer(layer_name).output 4 5 model = tf .keras.Model(inputs=inputs , outputs=outputs) 6 model .compile(optimizer=’adam’, loss=’mse’) 7 8 r e t u r n model

Block 3.4: reduction build model code block

3.3.2 Deep LSTM

As the second part of the model, we applied a Deep LSTM model with three LSTM layers. Nevertheless, it is fair to discuss the model structure in more detail. One sample is a sequence of inputs that has overlap with the next sequence. one feature is one observation at a time step. The timestep is the number of times that LSTM should be unfolded (aka neuron). Accordingly, LSTMs input must be a 3-dimensional tensor repre- senting time sequence order, and it has the shape of (n-samples, timesteps, n-features), as shown in the figure 18.

Figure 18: A 3D time series data tensor. Picture adopted from [6]

Figure 19 shows that this model’s input is in the shape of (None, 12, 10), where 12 is the 24(43)

number of time steps, and 10 is the number of features. Therefore, the LSTM has been unfolded 12 times.

Figure 19: An LSTM layer with input shape of (None, 12, 10)

Units in the LSTM model are the number of hidden units, and it defined the dimension of the output. The units will be considered the LSTM capacity; therefore, the bigger the number of units is, the more learning capacity the LSTM has. It is relevant to mention; this is one of the parameters required to be tuned to counter overfitting during the training phase. This model’s hidden layers were chosen as 200, 100, 200 units for the first, second, and third layers, respectively. The output layer is a dense layer with 69 hidden units to predict every 69 targets(sensors) value for the future 5 minutes based on information from the last 60 minutes. Depending on the LSTM model desired to built, LSTM could have different output ap- proaches. It should consider the hidden states as the outputs of an LSTM layer. In each LSTM layer, we have an option called return-sequence. Return-sequence is set to False by default, and it means only the last LSTM hidden state or last time step of the current se- quence will be considered output. Otherwise, by setting the return-sequence to the True, the output of the LSTM will be all hidden states from all time steps in the sequence (not only the last one). In this model, we fixed the return sequence to True for the First and second hidden layer with 200, and 100 units, and the last hidden layer with 200 units, we set the return-sequence as False. In summary, if the return-state option of the LSTM layer is set to be True, then ct will be returned as output beside ht . The outputs of each three layers are shown in the Table 1.

Table 1 Output of each three hidden layers Input-shape hidden-layer 1 and 2 outputs hidden-layer 3 output (None, 12, 10) (None, 12, 200), (None, 12, 100) (None, 200)

1 d e f lstm_model(n_timestamps , n_features , n_outputs , n_units=None, dropout_rate=0.2, lr=2e −4) : 2 3 n_units = n_units if isinstance(n_units , list) else [100, 100] 4 5 # create the input:I want to consider the entry as(none, 64−number of features, 1) 6 inputs = tf.keras.layers.Input(shape=[n_timestamps , n_features], name=’inputs’) 7 8 x = i n p u t s 25(43)

9 10 f o r units in n_units[: − 1 ] : 11 x = tf .keras.layers.LSTM(units , return_sequences=True)(x) 12 x = tf.keras.layers.Dropout(rate=dropout_rate)(x) 13 14 x = tf.keras.layers.LSTM(n_units[ −1], return_sequences=False)(x) 15 x = tf.keras.layers.Dropout(rate=dropout_rate)(x) 16 17 outputs = tf.keras.layers.Dense(n_outputs , activation=’linear’)(x) 18 19 model = tf .keras.Model(inputs=inputs , outputs=outputs) 20 21 opt = tf .keras.optimizers.Adam(learning_rate=lr) 22 model .compile(optimizer=opt , loss=’mse’) 23 24 r e t u r n model

Block 3.5: Deep LSTM build model preparation code block

Statefulness in LSTM is related to using batch in the training process. Using batch-size means how many samples the network should view before updating the weights [33]. In this model, batch-size is granted 64. To prevent overfitting during the training process, besides the other methods and hypertun- ning, one can use dropout regularization. Dropout will randomly set the output of some hidden units of a layer to zero during training—the dropout-rate in this study chosen 0.1. The easy explanation of the math behind dropout will be passing a long sequence divided into smaller pieces or batches. The cell state of the last timestep of the ith sample from the current batch will be passed to the ith sample of the next batch to initialize its value. The math behind an LSTM was described in the equations (2.2 - 2.6) section 2.2, and we used MSE as a Loss function. Also, we tried Adam as optimizer and linear chosen as the activation function for the LSTM layer. The implementation code of deep LSTM and its summary is shown in the code block 3.5. The prediction model is a combination of Autoencoder reduction and Deep Long Short-term Memory models. First, a Three Multilayer Autoencoder used for automatic feature selection and representation learning (encoded features), as explained before. The purpose of using Autoencoder is to learn the behavior across multiple time series with a variety of patterns in order to capture the correlation among them and obtain useful features with a fixed dimension. Then by driving out the reduction dimension of input from the Autoencoder model’s latent layer, we could obtain the reduction form of the entry features. It is essential to mention that if there is abnormal behavior in the input, it will be captured by the encoder. The next step is to feed these embedded features to the model’s prediction. The model prediction is a Deep LSTM with three layers which its input is created by autoencoder reduction in a new representation form. New representation demands to be in 3d shape, as described in section. To fulfill this need, we used the expand dimension technique. Figure 20 shows the combination of two models in 4.1.1 and 4.1.2. 26(43)

Figure 20: A prediction model based on feature extracted by an autoencoder model, Picture adopted from [41]

3.4 Detection Model

From the start, we have no prior knowledge about normal and abnormal data. Despite this, our desired goal is mainly focused on predicting and detecting anomaly behavior. Section 3.3 thoroughly reported the prediction model and how Autoencoder reduction with deep LSTM could foretell five minutes ahead of each 69 signals based on one-hour history. Figure 21 displays the principal detection model actions. The flowchart in the Figure shows the steps we took to detect anomaly behavior. The first step in detecting anomalies is that we applied the prediction model’s output in the detection model. The next step is measuring the aggregate error. The prediction error could be contained by calculating the following formula, as shown in equation 3.6. The aggregate error is computed by taking the mean of every 69 sensors’ prediction error. The aggregate error represents an error when several errors need to be wrapped in a single error. The next move in the flowchart is Finding Candidate anomaly dates and involved sensors, which could be reached with following critical steps. 27(43)

Figure 21: scheme of the Detection Model Steps

N q 0 0 ∑i=1 (yi − yi) Error = (y − y ) ⇒ Aggregate_Error = (3.6) i i N

3.4.1 Anomaly Scoring and Selection of Candidate Set

To accomplish this step, we must distinguish the observations whose anomaly scores are significantly deviating from others. The score technique must be applied on aggregate errors in the equation 3.6. The critical problem is finding the best cut-off threshold when the boundaries between normal and anomaly behavior are not noticeable to minimize the false positive rate while maximizing the detection rate. According to two assumptions from the paper [29] about the anomaly detection of unlabelled data, The anomalies are assumed to have a small portion of the data, which is assumed not to exceed five percent. Since we are interested in finding the fraction of anomalies with high confidence more than find all anomalies. However, the dataset might not approximately represent a large portion of the standard data. To solve this problem, we consider suitable portion of data as normal, for example, 80%. To score aggregate errors, we applied the quantile method to fix the confidence area of anomalies. This step’s implementation could be found in the block 3.6.

1 # calculate the error 2 error = np.square(yy − yy_pred ) . T 3 4 # calculate the aggregate error 5 agg_error = np.mean(error , axis=0) 6 7 # candidate aggregate error 8 ac_agg_error = np.where(agg_error > np.quantile(agg_error , 0.99) , agg_error , 0) Block 3.6: Threshold on detection model code block yy: represent the y-test and yy-pred: represent the prediction result of Deep LSTM on test data

A quantile determines how many values in a distribution are above or below a specific limit. The Figure 22 showed 1% of the dataset were considered anomalies and 99% as standard. Next, we are required to select candidate sets of the dates that anomaly events occurred. Hence, the first step is obtaining all dates in the test dataset in the anomaly’s confidence 28(43) area, which is 1% of the whole dataset, as illustrated in figure 23, part one. Next, based on the dates suspected to have an anomaly state, we need to determine the involved sensors. The process could be arranged by investigating all 69 sensors in the anomaly’s confidence area with the quantile technique, as shown in figure 23, part two. In other words, we need to check if the sensors lie down in 1% of the dataset or not.

Figure 22: Confidence Area of anomaly and Normality area

Figure 23: Quantile Technique to find candidate set of anomaly case 29(43)

4 Experimental Study and Results Analysis

In this chapter, the presentation and discussion of the results found by this study are given. The chapter is arranged into two sections. The first one shows the prediction model re- sults by tuning hyperparameter in order to find the best combination. The second section discusses the Detection model and its limitation.

4.1 Prediction Model

The first model concentrated on the Prediction, which is a combination of autoencoder reduction and deep LSTM. Each model’s results were described individually.

4.1.1 Reconstruction Autoencoder

For the Reconstruction Autoencoder model with multi-feature experiments, a combination of the following values was tried as listed in Table 2. Table 3 shows the settings applied to the Autoencoder model experiment. The model is trained on the training data and validated on the validation dataset. Table 4 and figure 24 display the tuning results for the top five combination for settled parameters in Table 3. Table 4 shows, the reconstruction autoencoder’s best result with multi-features belongs to the model using Adam optimizer, 512 and 256 as number of first and second hidden layers, 0.1 as dropout rate. Figure 25 shows the loss functions during training the Autoencoder model.

Table 2 Combination of different hyper-parameter values used for training the Autoencoder model First hidden layer 512, 256 Second hidden layer 256, 128 Dropout Rate 0.1 , 0.2 30(43)

Table 3 Parameter settings for the autoencoder model Parameter Value Batch-size 64 Epochs 100 Train data shape (89586, 828) Validation data shape (7905, 828) Test data shape (7905, 828) Latent layer 120 Optimizer Adam Activation linear Learning Rate 2e-4

Table 4 Error on test data set during hyperparameter tuning process for autoencoder model. U1: represents the number of first hidden layer units and U2: shows the number of second hidden layer units Optimizer Best Combination Mean-squared-error (MSE) Adam dr=0.1, U1=512, U2=256 0.04205 Adam dr=0.1, U1=256, U2=256 0.04787 Adam dr=0.1, U1=512, U2=128 0.05704 Adam dr=0.1, U1=256, U2=128 0.05902 Adam dr=0.2, U1=512, U2=256 0.07366

Figure 24: Visual of Error on test data set during hyperparameter tuning process for au- toencoder model. U1: represents the number of first hidden layer units and U2: shows the number of second hidden layer units, each connection between pa- rameters shown by a color and end with the value of mean-squred error. Cold color shows the least MSE and Warm color shows the higgest MSE 31(43)

Figure 25: Train and Validation loss function plots for the first best result of table 4 for multi-feature AE model

In chapter3, we selected three-time series as samples to follow up and illustrated them in the figures (12- 14). Figure 26 explains the reconstruction results of the autoencoder model for those time series. The data in the figure 26 corresponds to the green part (test data) of the figure (12- 14). The error rate for those reconstructions can be found in the Table 5.

Table 5 The reconstruction results on train/valid/test data set. MSE RMSE Train Error 0.01694 0.13015 Valid Error 0.00471 0.06862 Test Error 0.01642 0.12814 32(43)

Figure 26: Reconstruction results of Autoencoder multi-feature model for three time series samples of data. x-axis shows date in the test data and y-axis shows values of reconstruction and expected. 33(43)

4.1.2 Deep LSTM

The second model in prediction was Deep LSTM from section methodology. The model was trained using a combination of the following parameters during tuning process, as il- lustrated in Table 6.

Table 6 Combination of different hyper-parameter values and hidden network sizes that were used to train the Deep LSTM model Dropout 0.1, 0.2 First LSTM layer 100, 200 Second LSTM layer 100, 200 Third LSTM layer 100, 200

Table 7 and figure 27 display the tuning results for the top five best combination with fixed parameter: 2e − 4 as the learning rate. As this table shows, the Deep LSTM’s best result belongs to the model using Adam optimizer, 200, 100 as number of first, second, and 200 as third hidden layers. The best dropout rate and activation function are 0.1, and linear.

Figure 27: Visual of Error on test data set during hyperparameter tuning process for Deep Lstm model. lstm-U1, lstm-U2, lstm-U3: represents the number of first, sec- ond, and third hidden layer units, each connection between parameters shown by a color and end with the value of mean-squred error. Cold color shows the least MSE and Warm color shows the higgest MSE 34(43)

Table 7 Error on test data set during hyperparameter tuning process for Deep LSTM model. lstm-U1, lstm-U2, lstm-U3: represents the number of first, second, and third hidden layer units Optimizer Activation Best Combination (MSE) Adam linear dr=0.1, lstm-U1=200, lstm-U2=100, lstm-U3=200 0.13568 Adam linear dr=0.1, lstm-U1=200, lstm-U2=200, lstm-U3=200 0.14199 Adam linear dr=0.1, lstm-U1=200, lstm-U2=100, lstm-U3=100 0.14234 Adam linear dr=0.2, lstm-U1=200, lstm-U2=200, lstm-U3=200 0.14546 Adam linear dr=0.1, lstm-U1=100, lstm-U2=100, lstm-U3=200 0.14845

The final chosen parameter for the Deep LSTM model represented in the Table 8. Figures represents the prediction results of the Deep LSTM model for three example signals of specific gas turbine in figures (12- 14) of chapter three. The data in this figure correspond to the green part (test data) of the Figures. The error rate for those predictions can be found in the Table 9. Figure 29 shows the loss functions during training the Deep LSTM model.

Table 8 Parameter settings for the Deep LSTM model.lstm-U1, lstm-U2, lstm-U3: repre- sents the number of first, second, and third hidden layer units Parameter Value Batch-size 64 Epochs 100 Train data shape (89586, 12,10) Validation data shape (7905, 12,10) Test data shape (7905, 12,10) Best Combination dr=0.1, lstm-U1=200, lstm-U2=100, lstm-U3=200 lr=2e-4 Optimizer Adam Activation linear

Table 9 The error rate of prediction result on train/ valid/ test data set. MSE RMSE Train Error 0.01754 0.13209 Valid Error 0.00957 0.09782 Test Error 0.14848 0.38533 35(43)

Figure 28: Prediction results of Deep LSTM multi-feature model for three time series sam- ples of data. x-axis shows date in the test data and y-axis shows values of pre- dicted and expected. 36(43)

Figure 29: Train and Validation loss function plots for the deep LSTM model

4.2 Detection Model

The last model from the section methodology was detection. The detection model was built based on the prediction model’s output. Two primary techniques were used in the detection model: aggregate error and quantile as a fixed threshold. Figure 30(a) depicts the aggregate error on the test data with selected candidate date sets, illustrated with red points. For example, Table 10 is showed two dates with candidate anomaly sensors.

Table 10 Three examples of selected candidate anomaly sets of sensors. Date Selected anomaly sensor 2013-12-30 08:20:00 SNSR-65, SNSR-7, SNSR-465, SNSR-45 2013-12-29 21:40:00 SNSR-11, SNSR-6, SNSR-25, SNSR-51 2013-12-30 16:05:00 SNSR-466, SNSR-44, SNSR-18, SNSR-55

Figure 30 (b) depicts the detection result for sensor-7, which represents the gas turbine’s air temperature. The active load sensor records the performance of the gas turbine machine. From the figure, the air temperature sensor on date 30-12-2013 between time (04:40:00 - 09:10:00) experienced anomaly activities. Furthermore, we plot the second sensor named as outlet pressure. In figure 31, we could observe on date 29-12-2013 between time (21:35:00- 21:40:00) this sensor’s performance gave a hard time to the gas turbine machine. One limitation of the detection model was the result validation. No customer dataset possi- bly evaluates the chosen anomaly with the real anomaly states of the corresponding sensor. 37(43)

In this stage, the only possible way was to look at the gas turbine’s performance on the anomaly dates. From figure 30 (b), we could observe there was a drop on 30-12-2013 on the active load plot. However, the drop value reached zero; we could say the gas turbine that time was shutdown. Hence, something strange occurred that gave the whole system such a drop. This issue may be caused by all sensors or all sensors at the same time.

Figure 30: (a):aggregate error plot. (b): anomalies event of the air temperature sensor. First plot represents performance of the gas turbine through active load sensor, sec- ond plot represents the truth and predicted value of the air temperature sensor, the last plot shows the aggregate error with selected candidate anomaly date as red points. The x-axis and y-axis are date on test date and value, respectively. 38(43)

Figure 31: anomalies event of the Outlet pressure sensor. First plot represents performance of the gas turbine through active load sensor, second plot represents the truth and predicted value of the outlet pressure sensor, the last plot shows the aggre- gate error with selected candidate anomaly date as red points. The x-axis and y-axis are date on test date and value, respectively. 39(43)

5 Conclusion and Future Work

5.1 Conclusion

The results of reconstructing time series applying autoencoder are presented in the figure 26. As observed, this model strongly reconstructed the general pattern of each selected time series. The model could reconstruct the input with sudden changes perfectly, as pointed in figure 26 for the time series named air temperature with ID Sensor_7. In general, the autoencoder model can intensely reconstruct the general and sudden pattern in almost all- time series. This part had an outperformed performance on selecting vital features by reducing the ex- tensive input data. Furthermore, the prediction model was developed by deep LSTM model to produce a short-term prediction. It is fair to mention that the data consists of the sensor with different performance. There were sensors with indicator behavior, and their value range changed between zero or one. Moreover, there were sensors with a negative value. For solving this problematic behavior, we decided to use a standardization scaler. The pre- diction model deals strongly with the data’s complexity. Additionally, the model can follow the time series trend quite well, and it does not fail in predicting sudden changes in time series, as we could observe in figure two sensor-467, which represented outlet pressure of the gas turbine. The exciting thing to notice in table 2.9 is that the training error is smaller than the testing prediction error. It seems like our train data is more straightforward than the testing data, and it is easier to find the trends for the train data set. This could be due to how we created our train, test, and validation datasets illustrated in figures (12-14). As this picture shows, our training data is from a completely different time of the year than the test (and validation) data. This will cause a temporal bias as the training data has a different distribution than the test data distribution. In other words, our training data is the data from January 1th 2013 to November 7th 2013. The validation data belongs to November 8th to December 5th 2013, and test data is almost the last one month of the year. As mentioned before, we labeled the test data set with the valid abnormal data points. As shown in figures 30 and 31, the proposed detection model detects abnormal states during specific time intervals. However, this part has only been validated by comparing the result with the gas turbine machine performance.

5.2 Future Work

In this thesis, the proposed model for anomaly detection could still be examined with other configurations to see if we can get more favorable outcomes. Future studies could be an experimenting model with different lookback window sizes and investigating their effect on 40(43) prediction results. This study used 12 as a lookback window to get one-hour history back of each sensor. To get the data’s encoded representation, we employed an autoencoder that reconstructs its input in this thesis. One could train an autoencoder that predicts the next timesteps instead of reconstructing the input. It helps the encoder model learn the essential features of the data to make a better prediction. To obtain the involved sensors in anomaly cases, we used a threshold approach on the output of the prediction result. The threshold is chosen as a fixed number for all sensors found in a particular gas turbine. One could implement the flexible threshold instead of the fixed one. The decision of the threshold will be based on the behavior of each sensor. This way, the detection model would perform better. It is fair to mention that there is a limitation on the evaluation part of the detection model. There was no company source to evaluate the captured result that depicts the involved sen- sors responsible for the anomaly events. However, this part of the thesis is also essential as the prediction part. It would be a valuable effort to have the evaluation part of the detec- tion model as future studies. It has two main benefits: the way to evaluate the prediction result and give useful information for the company to reduce the expense related to the gas turbine’s lifetime. 41(43)

References

[1] Raghavendra Chalapathy and Sanjay Chawla. Deep learning for anomaly detection: A survey. arXiv preprint arXiv:1901.03407, 2019.

[2] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3):1–58, 2009.

[3] David Charte, Francisco Charte, Salvador García, María J del Jesus, and Francisco Herrera. A practical tutorial on autoencoders for nonlinear feature fusion: Taxonomy, models, software and guidelines. Information Fusion, 44:78–96, 2018.

[4] Ching-Fu Chen, Yu-Hern Chang, and Yu Wei Chang. Seasonal arima forecasting of inbound air travel arrivals to taiwan. Transportmetrica, 5:125 – 140, 2009.

[5] Fred Cummins. Lstm.

[6] Chollet Francois. Deep learning with python, 2017.

[7] Pedro J García-Laencina, José-Luis Sancho-Gómez, and Aníbal R Figueiras-Vidal. Pattern classification with missing data: a review. Neural Computing and Applica- tions, 19(2):263–282, 2010.

[8] Alexander Geiger, D. Liu, Sarah Alnegheimish, Alfredo Cuesta-Infante, and K. Veera- machaneni. Tadgan: Time series anomaly detection using generative adversarial net- works. ArXiv, abs/2009.07769, 2020.

[9] Alexander Geiger, Dongyu Liu, Sarah Alnegheimish, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Tadgan: Time series anomaly detection using generative adversarial networks. arXiv preprint arXiv:2009.07769, 2020.

[10] , , Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.

[11] J. Gooijer and R. Hyndman. 25 years of time series forecasting. International Journal of Forecasting, 22:443–473, 2006.

[12] Alex Graves and Jürgen Schmidhuber. Framewise phoneme classification with bidi- rectional lstm and other neural network architectures. Neural networks, 18(5-6):602– 610, 2005.

[13] Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. Lstm: A search space odyssey. IEEE transactions on neural networks and learning systems, 28(10):2222–2232, 2016. 42(43)

[14] Wenxiang Guo, Xiyu Liu, and Laisheng Xiang. Membrane system-based improved neural networks for time-series anomaly detection. 2020 8th Processes International Conference on Chemistry, biochemistry, material, and system engineering, pages 87– 90, 2020.

[15] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classifica- tion using support vector machines. Machine Learning, 46:389–422, 2004.

[16] Kyle Hundman, Valentino Constantinou, Christopher Laporte, Ian Colwell, and Tom Soderstrom. Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & , pages 387–395, 2018.

[17] T. Joachims. Transductive inference for text classification using support vector ma- chines. In ICML, 1999.

[18] Ameeth Kanawaday and Aditya Sane. Machine learning for predictive maintenance of industrial machines using iot sensor data. 2017 8th IEEE International Conference on Software Engineering and Service Science (ICSESS), pages 87–90, 2017.

[19] Tung Kieu, Bin Yang, Chenjuan Guo, and Christian S Jensen. Outlier detection for time series with recurrent autoencoder ensembles. In IJCAI, pages 2725–2732, 2019.

[20] Tung Kieu, Bin Yang, and Christian S Jensen. Outlier detection for multidimensional time series using deep neural networks. In 2018 19th IEEE International Conference on Mobile Data Management (MDM), pages 125–134. IEEE, 2018.

[21] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

[22] Donghwoon Kwon, Hyunjoo Kim, Jinoh Kim, Sang C Suh, Ikkyun Kim, and Kuinam J Kim. A survey of deep learning-based network anomaly detection. Cluster Computing, pages 1–13, 2019.

[23] S. Lee and H. Kim. Adsas: Comprehensive real-time anomaly detection system. In WISA, 2018.

[24] G. Mahalakshmi, S. Sridevi, and S. Rajaram. A survey on forecasting of time series data. 2016 International Conference on Computing Technologies and Intelligent Data Engineering (ICCTIDE’16), pages 1–8, 2016.

[25] Pankaj Malhotra, Anusha Ramakrishnan, Gaurangi Anand, Lovekesh Vig, Puneet Agarwal, and Gautam Shroff. Lstm-based encoder-decoder for multi-sensor anomaly detection. arXiv preprint arXiv:1607.00148, 2016.

[26] Xuedan Miao, Y. Liu, Haiquan Zhao, and Chunguang Li. Distributed online one-class support vector machine for anomaly detection over networks. IEEE Transactions on Cybernetics, 49:1475–1488, 2019.

[27] Mehdi Mohammadi, Ala Al-Fuqaha, Sameh Sorour, and Mohsen Guizani. Deep learn- ing for iot big data and streaming analytics: A survey. IEEE Communications Surveys & Tutorials, 20(4):2923–2960, 2018. 43(43)

[28] Joao Pereira and Margarida Silveira. Unsupervised anomaly detection in energy time series data using variational recurrent autoencoders with attention. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 1275–1282. IEEE, 2018.

[29] Leonid Portnoy. Intrusion detection with unlabeled data using clustering. PhD thesis, Columbia University, 2000.

[30] Hu sheng Wu. A survey of research on anomaly detection for time series. 2016 13th International Computer Conference on Active Media Technology and Information Processing (ICCWAMTIP), pages 426–431, 2016.

[31] Y. Shu, Minfang Yu, J. Liu, and O. Yang. Wireless traffic modeling and prediction using seasonal arima models. IEEE International Conference on Communications, 2003. ICC ’03., 3:1675–1679 vol.3, 2003.

[32] Sima Siami-Namini and Akbar Siami Namin. Forecasting economics and financial time series: Arima vs. lstm. arXiv preprint arXiv:1803.06386, 2018.

[33] Akash Singh. Anomaly detection for temporal data using long short-term memory (lstm), 2017.

[34] M. Su. Real-time anomaly detection systems for denial-of-service attacks by weighted k-nearest-neighbor classifiers. Expert Syst. Appl., 38:3492–3498, 2011.

[35] J. Tian, M. Azarian, and M. Pecht. Anomaly detection using self-organizing maps- based k-nearest neighbor algorithm. 2014.

[36] S. Verbrugge, D. Colle, M. Pickavet, P. Demeester, S. Pasqualini, A. Iselt, A. Kirstädter, R. Hulsermann, F. Westphal, and M. Jaeger. Methodology and input availability parameters for calculating opex and capex costs for realistic network sce- narios. Journal of Optical Networking, 5:509–520, 2006.

[37] Guofeng Wang, C. Liu, and Yinhu Cui. Clustering diagnosis of rolling element bearing fault based on integrated autoregressive/autoregressive conditional heteroscedasticity model. Journal of Sound and Vibration, 331:4379–4387, 2012.

[38] Qin Yu, Lyu Jibin, and Lirui Jiang. An improved arima-based traffic anomaly detection algorithm for wireless sensor networks. International Journal of Distributed Sensor Networks, 12, 2016.

[39] Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lucas Beyer. S4l: Self- supervised semi-supervised learning. In Proceedings of the IEEE international con- ference on , pages 1476–1485, 2019.

[40] G. Zhang. Time series forecasting using a hybrid arima and neural network model. Neurocomputing, 50:159–175, 2003.

[41] Lingxue Zhu and Nikolay Laptev. Deep and confident prediction for time series at uber. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pages 103–110. IEEE, 2017.