<<

sensors

Article of Power Plant Equipment Using Long Short-Term Memory Based Autoencoder Neural Network

Di Hu , Chen Zhang, Tao Yang * and Gang Chen School of Energy and Power Engineering, Huazhong University of Science and Technology, Wuhan 430074, China; [email protected] (D.H.); [email protected] (C.Z.); [email protected] (G.C.) * Correspondence: [email protected]

 Received: 7 October 2020; Accepted: 27 October 2020; Published: 29 October 2020 

Abstract: Anomaly detection is of great significance in condition-based maintenance of power plant equipment. The conventional fixed threshold detection method is not able to perform early detection of equipment abnormalities. In this study, a general anomaly detection framework based on a long short-term memory-based autoencoder (LSTM-AE) network is proposed. A normal behavior model (NBM) is established to learn the normal behavior patterns of the operating variables of the equipment in space and time. Based on the similarity analysis between the NBM output distribution and the corresponding measurement distribution, the Mahalanobis distance (MD) is used to describe the overall residual (OR) of the model. The reasonable range is obtained using kernel density estimation (KDE) with a 99% confidence interval, and the OR is monitored to detect abnormalities in real-time. An induced draft fan is chosen as a case study. Results show that the established NBM has excellent accuracy and generalizability, with average root mean square errors of 0.026 and 0.035 for the training and test data, respectively, and average mean absolute percentage errors of 0.027%. Moreover, the abnormal operation case shows that the proposed framework can be effectively used for the early detection of abnormalities.

Keywords: anomaly detection; power plant; artificial neural networks; long short-term memory based autoencoder neural networks; normal behavior model

1. Introduction In recent years, attention has been devoted to the improvement of condition monitoring systems (CMSs) for power plants [1,2]. One of the critical tasks is anomaly detection for equipment, which can be used to reduce the high cost of unplanned downtime and has great significance in condition-based maintenance. Anomaly detection refers to the detection of patterns that do not conform to established normal behavior models (NBMs) in a specified dataset. The patterns detected are called anomalies [3]. The CMSs store a large amount of operation data in power plants [4–6], and the use of this data represents an important field for future smart power plants (SPPs) [7,8]. Therefore, anomaly detection using data-driven approaches is a popular research topic for SPPs. Power plant equipment can be regarded as a dynamic system with n operating variables, and NBMs describe the normal behavior patterns among these variables. The data-driven modeling approach establishes NBMs based on the historical normal operation dataset, without knowledge of the precise mechanism of the system. The residuals between the NBM outputs and measured variables are monitored in real-time. When the residual exceeds the threshold, the system is considered to be abnormal [9]. methods were used to create NBMs as early as 20 years ago [10–12], due to their ability to model nonlinear dynamic systems. However, early computing power was limited, and

Sensors 2020, 20, 6164; doi:10.3390/s20216164 www.mdpi.com/journal/sensors Sensors 2020, 20, 6164 2 of 18 the structure of NBMs was very simple, and included one output. Following the development of hardware and machine learning methods, many anomaly detection methods based on artificial neural networks (ANNs) and support vector machines (SVMs) have been proposed, e.g., back propagation neural network (BPNN) [13–15], SVM [16,17], least squares support vector machine (LS-SVM) [18,19], and restricted Boltzmann machines (RBM) [20]. However, these methods do not consider the time sequence correlation between variables. The research of Safdarnejad et al. [21] showed that dynamic model considering the time-series data has higher prediction accuracy. The nonlinear autoregressive neural network with exogenous input (NARX) model considers the time-series relationship by time lag or tapped delay line (TDL) [22–25]. However, the time lag increases the input dimensions, which in turn increases the risks of over-fitting. Moreover, the time lag must be manually set. Recurrent neural networks (RNNs) are widely used in time-series modeling [26–29]. However, the conventional RNN still requires the model’s input variables to be manually selected according to the output variables. Equipment operation represents a multivariable time-series that requires an NBM to have multiple outputs. In addition, according to the output of NBM, there is no clear criterion for the selection of input variables. Monitoring multiple variables using multiple models is time-consuming and costly. Therefore, it is necessary to find a method that can monitor multiple variables simultaneously and does not require a complicated variable selection process. The autoencoder (AE) neural network is an model, whose output is the reconstruction of the input, thus avoiding manual selection of feature variables. However, the conventional AE-based methods also do not consider the time sequence correlation between variables in power plants, as previously discussed, e.g., AE [30,31], stacked autoencoder (SAE) [32,33], (VAE) [34], and deep autoencoder Gaussian mixture model [35]. In this study, the long short-term memory-based autoencoder (LSTM-AE) neural network is proposed to establish an NBM with multiple outputs, which learns the correlation between variables in time and space. The advantages of the LSTM-AE are as follows: Firstly, the autoencoder (AE) does not need to manually select the inputs according to the outputs. Secondly, the long and short-term memory (LSTM) can learn the relationship of the variables in a time-series and reduce the long-term dependence. A general anomaly detection framework for power plant equipment is proposed. Based on the similarity analysis between the NBM output distribution and the corresponding measurement distribution, the Mahalanobis distance (MD) is used to describe the overall residual (OR) of the model. The reasonable range of residuals is obtained using kernel density estimation (KDE) with a 99% confidence interval. Taking an induced draft fan as an example, the NBM was established with eight output variables. Hyperparameters such as look-back time steps, and the number of hidden layer units, were optimized. The model accuracy is compared with previous work. Furthermore, the process of anomaly detection is demonstrated via a case study. The main contributions of this study are in two parts: 1. The proposal of an anomaly detection framework for multivariable time-series based on an LSTM-AE neural network. 2. The similarity analysis of the NBM output distribution and the corresponding measurement distribution. The remainder of this paper is organized as follows: In Section2, the proposed anomaly detection framework based on the LSTM-AE neural network is presented in detail. In Section3, the case study demonstrates the established NBM, comparative analysis, and the process of anomaly detection in a power plant. The conclusions and future work are provided in Section4.

2. The Proposed Anomaly Detection Framework

2.1. Framework Flow Chart Diverse pieces of equipment are utilized in a power plant. In this paper, a general anomaly detection framework for power plants is proposed. The framework utilizes a large amount of operating Sensors 2020, 20, x FOR PEER REVIEW 3 of 20

residual threshold is obtained using residual statistical analysis of the test dataset. The proposed anomaly detection framework includes two phases, as shown in Figure 1.

2.1.1. Offline Training NBM The purpose of this phase is to establish the NBM of the target equipment and obtain a reasonable range of model residuals. Specific steps are as follows: Step 1: According to the target equipment, obtain the historical dataset of related operating variables. Step 2: Performing data cleaning to obtain the normal behavior dataset based on three aspects: (1) Eliminate data at the time of equipment downtime. (2) Eliminate data at the time of equipment failure according to operation logs. (3) Eliminate abnormal data based on statistical characteristics. These abnormalities originate from sensors or stored procedures. Boxplots are used in this work. Then, the cleaned dataset is divided into a training dataset and a test dataset. Step 3: The NBM is established based on an LSTM-AE with the training dataset. Step 4: Statistical analysis of the residuals with the test dataset is performed. The MD is proposed to describe the OR of multiple variables, and the reasonable range of residuals is obtained using KDE with a 99% confidence interval.

2.1.2. Online Anomaly Detection by NBM Sensors 2020, 20, 6164 3 of 18 The purpose of this phase is to perform real-time anomaly detection on the target equipment based on the trained NBM. Specific steps are as follows: data relatingStep 1: to Based the equipment on the trained and NBM adopts and the measured LSTM-AE variable network values, to establish the residuals the NBM. are generated The residual in is calculatedreal-time usingto reflect the the NBM status outputs of the andequipment. corresponding measured variables. The residual threshold is obtainedStep 2: using Based residual on the real statistical-time OR analysis of the model of the and test the dataset. residual The of proposedeach variable, anomaly the abnormal detection frameworkpatterns ar includese detected two through phases, the as reasonable shown in Figurerange. 1This. is referred to as phase 1.

Phase 1 :Offline training NBM based LSTM-AE

Historical Data Data cleaning NBM based Statistical analysis LSTM-AE of residuals

NBM based Real-time Data Real-time residuals Anomaly detection LSTM-AE

Phase 2 :Online anomaly detection by NBM

Figure 1. Anomaly detection framework based on the long short-term memory-based autoencoder Figure 1. Anomaly detection framework based on the long short-term memory-based autoencoder (LSTM-AE) neural network. (LSTM-AE) neural network . 2.1.1. Offline Training NBM 2.2. The Normal Behavior Model The purpose of this phase is to establish the NBM of the target equipment and obtain a reasonable The NBM can represent the dynamic relationships among variables in a system [9]. When the range of model residuals. Specific steps are as follows: system is normal, the output value of the NBM is consistent with the measured value. When the Step 1: According to the target equipment, obtain the historical dataset of related system is abnormal, the output value is different from the measured value. In this study, the LSTM- operatingAE neural variables. network is proposed to establish NBM. Suppose that certain equipment has n operating variables,Step 2: Performing푥푛 represents data the cleaning measured to obtain values the at normal a certain behavior moment, dataset and based 푥̂푛 onrepresents three aspects: the (1) Eliminate data at the time of equipment downtime. (2) Eliminate data at the time of equipment failure according to operation logs. (3) Eliminate abnormal data based on statistical characteristics. These abnormalities originate from sensors or stored procedures. Boxplots are used in this work. Then, the cleaned dataset is divided into a training dataset and a test dataset.

Step 3: The NBM is established based on an LSTM-AE with the training dataset. Step 4: Statistical analysis of the residuals with the test dataset is performed. The MD is proposed to describe the OR of multiple variables, and the reasonable range of residuals is obtained using KDE with a 99% confidence interval.

2.1.2. Online Anomaly Detection by NBM The purpose of this phase is to perform real-time anomaly detection on the target equipment based on the trained NBM. Specific steps are as follows: Step 1: Based on the trained NBM and measured variable values, the residuals are generated in real-time to reflect the status of the equipment. Step 2: Based on the real-time OR of the model and the residual of each variable, the abnormal patterns are detected through the reasonable range. This is referred to as phase 1.

2.2. The Normal Behavior Model The NBM can represent the dynamic relationships among variables in a system [9]. When the system is normal, the output value of the NBM is consistent with the measured value. When the system Sensors 2020, 20, 6164 4 of 18 is abnormal, the output value is different from the measured value. In this study, the LSTM-AE neural Sensors 2020, 20, x FOR PEER REVIEW 4 of 20 network is proposed to establish NBM. Suppose that certain equipment has n operating variables, xn represents the measured values at a certain moment, and xˆn represents the reconstruction of xn. reconstruction of 푥푛. The residuals of 푥푛 and 푥̂푛 are used to judge whether the system is abnormal. The residuals of xn and xˆn are used to judge whether the system is abnormal. The threshold is The threshold 훼 is determined by KDE of the reconstruction residuals with a 99% confidenceα determined by KDE of the reconstruction residuals with a 99% confidence interval. The NBM is trained interval. The NBM is trained with a normal operation dataset of the equipment. Algorithm 1 shows withthe anomaly a normal detection operation using dataset the ofNBM. the equipment. Algorithm 1 shows the anomaly detection using the NBM. Algorithm 1 Anomaly detection using the NBM Algorithm 1 Anomaly detection using the NBM INPUT: normal dataset 𝑿, the measured values at a certain moment 풙𝒏, threshold 휶 n INPUTOUTPUT:: normal reconstruction dataset X, the residual measured ‖풙 values𝒏 − 풙̂ at𝒏 a‖ certain moment x , threshold α OUTPUT: reconstruction residual xn xˆn 풇(∙) represents the NBM trained|| − by ||𝑿 f( ) represents the NBM trained by X if reconstruction· residual > 휶 then if reconstruction residual > α then 풙𝒏 is an anomaly xn is an anomaly Else Else 풙𝒏 isxn notis not an an anomaly anomaly 풆𝒏풅end

2.3. The LSTM-AE Neural Network 2.3. The LSTM-AE Neural Network

2.3.1.2.3.1. TheThe AEAE NeuralNeural NetworkNetwork TheThe AEAE neuralneural networknetwork isis anan unsupervisedunsupervised ANNANN withwith aa hiddenhidden layerlayer [[3636].]. ItIt hashas aa three-layerthree-layer symmetricalsymmetrical structure, structure, as as shown shown in in Figure Figure2 , 2, including including an an input input layer, layer, a a hidden hidden layer layer (interval (interval representation),representation), and and an an output output layer layer (reconstruction). (reconstruction). The input layer to the hidden hidden layer layer is is the the encodingencoding process,process, and and the the hidden hidden layer layer to to the the output output layer layer is theis the decoding decoding process. process. The The goal goal of the of AEthe isAE to is reconstruct to reconstruct the originalthe original input input as much as much as possible. as possible.

Internal Input Output representation 풙 풙̂ 풙 풙̂ 풙 풙̂ 풙𝒏 풙̂ 𝒏

Encoding Decoding

Figure 2. The structure of the autoencoder (AE) neural network. Figure 2. The structure of the autoencoder (AE) neural network. The encoding process: The encoding process: H = f (W X + b ) (1) 1 1 · 1 H = f (W ∙ X + b ) (1) The decoding process: 1 1 1 The decoding process: Xˆ = f (W H + b ) (2) 2 2 · 2 ̂ where W1 and b1 represent the weight and푋 = bias푓2( from푊2 ∙ 퐻 the + input 푏2) layer to the hidden layer, respectively.(2) W2 and b2 represent the weight and bias from the hidden layer to the output layer, respectively. X, H, where 푊1 and 푏1 represent the weight and bias from the input layer to the hidden layer, and Xˆ represent the original input, intermediate representation, and reconstruction of the original data, respectively. 푊2 and 푏2 represent the weight and bias from the hidden layer to the output layer, () () respectively.respectively. f1푋 , and퐻 , f2andare 푋̂ activation represent functions. the original Common input, activation intermediate functions representation are the sigmoid, and function, tanh function, and ReLu function [37]. reconstruction of the original data, respectively. 푓1() and 푓2() are activation functions. Common activation functions are the , tanh function, and ReLu function [37]. The most important feature of the AE is that the decoded 푋̂ should be close to the original input 푋 to the greatest extent; that is, the residual between 푋̂ and 푋 needs to be minimized. Therefore, the reconstruction error can be calculated as follows:

Sensors 2020, 20, 6164 5 of 18

SensorsThe 2020 most, 20, x importantFOR PEER REVIEW feature of the AE is that the decoded Xˆ should be close to the original5 input of 20 X to the greatest extent; that is, the residual between Xˆ and X needs to be minimized. Therefore, the reconstruction error can be calculated as follows:푁 푀 2 퐽(푊, 푏) = ∑ ∑‖푥̂푖푗 − 푥푖푗‖ (3) XN XM J(W, b) = 푖=1 푗=1 xˆ x 2 (3) || ij − ij|| where 푁 and 푀 represent the dimensions i= of1 j the=1 original data and the number of samples,

respectively. The goal of model training is to find 푊(푊1, 푊2) and 푏(푏1, 푏2) that minimize the loss where N and M represent the dimensions of the original data and the number of samples, respectively. function. The goal of model training is to find W(W1, W2) and b(b1, b2) that minimize the . 2.3.2. The LSTM Unit 2.3.2. The LSTM Unit The RNN uses internal state (memory) to process the time-series input to capture the The RNN uses internal state (memory) to process the time-series input to capture the relationship relationship of input variables in sequence. The through time (BPTT) algorithm is of input variables in sequence. The backpropagation through time (BPTT) algorithm is typically typically employed to train the RNN [38–40]. However, the chain derivation rule of the BPTT training employed to train the RNN [38–40]. However, the chain derivation rule of the BPTT training algorithm algorithm results in gradient vanishing or explosion problems for long-term dependency tasks. The results in gradient vanishing or explosion problems for long-term dependency tasks. The LSTM unit LSTM unit combines short-term memory with long-term memory using subtle gate control and combines short-term memory with long-term memory using subtle gate control and solves the problem solves the problem of gradient disappearance to a certain extent [41], as illustrated in Figure 3. of gradient disappearance to a certain extent [41], as illustrated in Figure3.

Cell State Update

× +

풇 × ×

Forget Gate Input Gate Output Gate 풙 Figure 3. Architecture of a long short-term memory (LSTM) unit. Figure 3. Architecture of a long short-term memory (LSTM) unit. The pre-built memory unit consists of three gate structures: an input gate, a forget gate, and The pre-built memory unit consists of three gate structures: an input gate, a forget gate, and an an output gate. The input and output gates control input and output activation to the memory cell, output gate. The input and output gates control input and output activation to the memory cell, respectively, whereas the forget gate updates the state of the cell. These gate structures constitute a respectively, whereas the forget gate updates the state of the cell. These gate structures constitute a self-connected constant error conveyor belt (CEC), which allows the constant error flow to pass through self-connected constant error conveyor belt (CEC), which allows the constant error flow to pass the internal state to solve the problem of gradient disappearance. The memory cells are updated and through the internal state to solve the problem of gradient disappearance. The memory cells are the output is implemented by the following equations [41]: updated and the output is implemented by the following equations [41]:   푓ft==푠푔푚표𝑖푑sgmoid(W푊x f ∙x푥t ++W 푊h f h∙tℎ1 + +b f 푏 ) (4(4)) 푡 푥푓 · 푡 ℎ푓· − 푡 1 푓 𝑖 = 푠푔푚표𝑖푑(푊 ∙ 푥 + 푊 ∙ ℎ + 푏 ) 푡it = sgmoid(W푥푖xi xt 푡+ Whiℎ푖ht 1푡 +1 bi) 푖 (5(5)) · · − 퐶̆ = 푡푎푛ℎ(푊 ∙ 푥 + 푊 ∙ ℎ + 푏 ) (6) C푡˘t = tanh(W푥푎xa xt푡+ Whaℎ푎ht 푡1 +1 ba)푎 (6) · · − 표 = 푠푔푚표𝑖푑(푊 ∙ 푥 + 푊 ∙ ℎ + 푏 ) (7) 푡ot = sgmoid(W푥표xo xt 푡+ Whoℎ표ht 1푡 +1 bo) 표 (7) · · − 퐶푡 = 푓푡퐶푡 1 + 𝑖푡퐶̆푡 (8) Ct = ft Ct 1 + it C˘t (8) ⊗ − ⊗ ℎ푡 = 표푡푡푎푛ℎ(퐶푡) (9) ht = ot tanh(Ct) (9) ⊗ where ℎ푡 1 and 퐶푡 1 are output and cell state at the previous moment, respectively; 푥푡 represents where ht 1 and Ct 1 are output and cell state at the previous moment, respectively; xt represents the curren− t input; −푓 represents the forget gate; 푓푡 represents the forgetting control signal of the cell the current input; f represents the forget gate; ft represents the forgetting control signal of the cell state at the previous moment; 푓푡퐶푡 1 represents the information retained at the previous moment; state at the previous moment; ft Ct 1 represents the information retained at the previous moment; 𝑖 represents the input gate; 퐶̆푡 represe⊗ − nts the candidate cell state at the current moment; 𝑖푡 represents i represents the input gate; C˘t represents the candidate cell state at the current moment; it represents the control signal for 퐶̆푡; 표 represents the output gate; ℎ푡 represents the final output; 표푡 represents the control signal for C˘t; o represents the output gate; ht represents the final output; ot represents the the output control signal; 푊푥 represents the input-hidden layer connection matrix; 푊ℎ indicates the hidden layer recurrent connection matrix;  represents elementwise multiplication.

Sensors 2020, 20, 6164 6 of 18

Sensors 2020, 20, x FOR PEER REVIEW 6 of 20 output control signal; Wx represents the input-hidden layer connection matrix; Wh indicates the hidden layer recurrent connection matrix; represents elementwise multiplication. 2.3.3. The LSTM-AE Neural Network⊗ 2.3.3.Abnormal The LSTM-AE detection Neural of Networkpower plant equipment requires simultaneous monitoring of multiple variables. An AE-based NBM can reconstruct multiple variables at the same time without manually Abnormal detection of power plant equipment requires simultaneous monitoring of multiple selecting input variables. These variables in power plant equipment are not only spatially correlated, variables. An AE-based NBM can reconstruct multiple variables at the same time without manually but also have a strong temporal correlation. The conventional AE neural network cannot mine the selecting input variables. These variables in power plant equipment are not only spatially correlated, time correlation of input layer variables. but also have a strong temporal correlation. The conventional AE neural network cannot mine the For this purpose, the LSTM-AE neural network is proposed with the hidden layer units replaced time correlation of input layer variables. by LSTM units. The LSTM-AE neural network cannot only learn the correlation between input For this purpose, the LSTM-AE neural network is proposed with the hidden layer units replaced by variables, but also learn the correlation in the time series. At the same time, the LSTM unit can avoid LSTM units. The LSTM-AE neural network cannot only learn the correlation between input variables, the problem of long-term memory dependence. The structure of the LSTM-AE is depicted in Figure but also learn the correlation in the time series. At the same time, the LSTM unit can avoid the problem 4. of long-term memory dependence. The structure of the LSTM-AE is depicted in Figure4.

𝑿 𝑿 𝑿

풆 풆 풆 풅 풅 풅 풆 풆 풆 풅 풅 풅

LSTMs LSTMs LSTMs LSTMs LSTMs LSTMs

𝑿 𝑿 𝑿 Internal representation

Encoding Decoding

Figure 4. Architecture of the LSTM-AE. Figure 4. Architecture of the LSTM-AE. In Figure4, X is the origin input; Xˆ is the output, reconstruction of X; I is the encoding result; ̂ k representsIn Figure the 4, look-back푋 is the origin time input; steps 푋 in is the the LSTM output, unit; reconstructionhe and Ce areof 푋 the; 퐼 outputis the encoding and cell result; state, 푒 푒 respectively,푘 represents ofthe the look LSTM-back unittime insteps the in encoding the LSTM process; unit; ℎhd andand C퐶d areare thethe outputoutput and cell state state,, 푑 푑 respectively, ofof the the LSTM LSTM unit unit in in the the decoding encoding process. process; ℎ and 퐶 are the output and cell state, respectively, of the LSTM unit in the decoding process. 3. Case Study 3. Case Study 3.1. Data Preparation 3.1. DataIn this Preparation study, an induced draft fan is considered as an example, with a historical dataset of the relatedIn temperaturethis study, an variables induced obtaineddraft fan fromis considered the CMS. as The an obtainedexample, datasetwith a historical is from 1 Maydataset 2019 of tothe 1 Mayrelated 2020, temperat with aure 1 min variable interval.s obtained Eight temperature from the CMS. variables The obtained are contained dataset in is the from dataset, 1 May as 2019 follows: to 1 May Lubricating2020, with a oil1 min temperature: interval. E Abbreviatedight temperature as LOT. variables Lubricating are contained oil is used in the to cool dataset the,bearings as follows: in the inducedLubricating draft oil fan. temperature: Abbreviated as LOT. Lubricating oil is used to cool the bearings in the inducedOutlettemperature draft fan. of electrostatic precipitator A1: Abbreviated as POT_A1, and indicates the temperatureOutlet temperature of the flue gas of fromelectrostatic the electrostatic precipitator precipitator A1: Abbreviated A1 into the as inducedPOT_A1, draft and fan.indicates the temperatureOutlet temperature of the flue gas of electrostaticfrom the electrostatic precipitator precipitator A2: Abbreviated A1 into the as POT_A2,induced draft and indicatesfan. the temperatureOutlet temperature of the flue gas of fromelectrostatic the electrostatic precipitator precipitator A2: Abbreviated A2 into the as inducedPOT_A2, draft and fan.indicates the temperatureBearing temperatureof the flue gas at from the non-drive the electrostatic end of theprecipitator motor: Abbreviated A2 into the asinduced MNDT, draft and fan. indicates the healthBearing of the bearing.temperature at the non-drive end of the motor: Abbreviated as MNDT, and indicates the healthBearing of temperaturethe bearing. at the drive end of the motor: Abbreviated as MDT, and indicates the health of theBearing bearing. temperature at the drive end of the motor: Abbreviated as MDT, and indicates the health of theMain bearing. bearing temperature 1: Abbreviated as MBT_1, and reflects the health status of the bearing. MainMain bearing bearing temperature temperature 1: Abbreviated 2: Abbreviated as MBT_1, as MBT_2, and andreflects reflects the health the health status status of the of bearing. the bearing. MainMain bearing bearing temperature temperature 2: Abbreviated 3: Abbreviated as MBT_2, as MBT_3, and andreflects reflects the health the health status status of the of bearing. the bearing. MainThe bearing correlation temperature coefficient 3: Abbreviated matrix indicates as MBT_3, the correlationand reflects between the health the status variables; of the as bearing. shown in FigureThe5, therecorrelation is a clear coefficient correlation matrix between indicates the variables. the correlation between the variables; as shown in Figure 5, there is a clear correlation between the variables.

Sensors 2020, 20, x FOR PEER REVIEW 7 of 20

Sensors 2020, 20, 6164 7 of 18 Sensors 2020, 20, x FOR PEER REVIEW 7 of 20

Figure 5. The correlation coefficient between variables. Figure 5. The correlation coefficient between variables. 3.2. Data Cleaning Figure 5. The correlation coefficient between variables. 3.2. Data Cleaning In principle, the normal behavior dataset is required to establish an NBM. It is necessary to clean 3.2. Data Cleaning the acquiredIn principle, historical the normal data set. behavior Firstly, dataset the dataset is required is cleaned to establish according an NBM.to the Itequipment is necessary operating to clean signal.the acquiredIn Furthermore, principle, historical the thenormal data data set. behavior during Firstly, the dataset the failure dataset is requiredperiod is cleaned is toeliminated establish according accordingan NBM. to the It equipmentto is thenecessary operation operating to cle log.an Insignal.the this acquired study, Furthermore, thehistorical boxplot the data method data set. during Firstly, is used the theto failure eli datasetminate period is abnormal cleaned is eliminated according data caused according to bythe sensors equipment to the and operation theoperating stored log. procedureInsignal. this study,Furthermore, from the the boxplot perspective the methoddata during of isstatistics. used the tofailure eliminateA boxplot period abnormal is ais standardized eliminated data caused according way byof displaying sensors to the andoperation the the dataset stored log. basedprocedureIn this on study, a fromfive the-number the boxplot perspective summary: method of is statistics.the used minimum, to eli Aminate boxplot the maximum, abnormal is a standardized data the sample caused way median,by of sensors displaying and and the the the first dataset stored and thibasedprocedurerd quartiles. on a from five-number When the perspective the summary:data exceeds of statistics. the the minimum, minimum A boxplot the and is maximum, a maximum standardized the range, sampleway it of is displayingconsidered median, and the an the datasetoutlier first [andbased42]. thirdThe on principlea quartiles. five-number of Whenoutlier summary: the detection data the exceeds usingminimum, a the boxplot minimum the maximum, is shown and in maximumthe Figure sample 6. range, Themedian, data it isandcleaning considered the first result and ans inoutlierthi thisrd quartiles. study [42]. The are When principleshown the in data ofFigure outlier exceeds 7, detectionand the the minimum point usings ain boxplotand red maximumcircle iss shown are range,considered in Figure it is considered6 outliers. The data that an cleaning willoutlier be eliminated.results[42]. The in principle this study of are outlier shown detection in Figure using7, and a theboxplot points is inshown red circles in Figure are considered 6. The data outliers cleaning that result wills bein this eliminated. study are shown in Figure 7, and the points in red circles are considered outliers that will be eliminated.

Interquartile range Outliers IQR = Q3 – Q1 Outliers Interquartile range Outliers IQR = Q3 – Q1 Outliers Minimum Median Maximum (Q1 – 1.5*IQR) Q1 Q3 (Q3 + 1.5*IQR) (first quartile) (third quartile) Minimum Median Maximum (Q1 – 1.5*IQR) Q1 Q3 (Q3 + 1.5*IQR) (first quartile) (third quartile) Figure 6. Boxplot outlier detection principle. Figure 6. Boxplot outlier detection principle. 3.3. NBM Based on LSTM-AE Figure 6. Boxplot outlier detection principle. In this study, the NBM of an induced draft fan was established using an LSTM-AE neural network with eight temperature variables as model inputs. After data cleaning, 266,538 observations were obtained in the sample. Fifty percent of the sample comprised the training dataset, and the remaining 50% comprised the test dataset. Min-max normalization was adopted, and the range of variables was scaled to [0,1]. For the evaluation of the developed NBM model of the induced draft fan,

Sensors 2020, 20, 6164 8 of 18 two indices— root mean square error (RMSE) and mean absolute percentage error (MAPE)—were used. The definitions of the two indices are given in the following equations: v t n 1 X RMSE = (yˆ y )2 (10) n i − i i=1

n 1 X yˆi yi MAPE = − 100% (11) n yi × i=1 where yi is measured value, yˆi is reconstructed value, and n is the number of variables. The look-back time step and hidden layer unit are important hyperparameters in the LSTM-AE and determine the structure of the NBM. Many different combinations were tested in this work, e.g., “4–6” means that the look-back time step is 4 and the number of hidden layer unit is 6. When the look-back time step is 0, the structure is a conventional AE. The performances of different combinations as measured by the two evaluation indices are shown in Figure8. The comparison between the LSTM-AESensors 2020 and, 20, AEx FOR is PEER shown REVIEW in Table 1, and indicates that LSTM-AE is significantly better. 8 of 20

Figure 7. Data cleaning by boxplot. Figure 7. Data cleaning by boxplot. It can be clearly seen from Figure8 that when the number of hidden layer units is small, the 3.3. NBM Based on LSTM-AE model performance is more sensitive to the number of hidden layer units than the look-back time step; however,In athis small study, look-back the NBM time of step an induced can also drafteffectively fan wa improves established the model using performance. an LSTM-AE Therefore, neural thenetwork time relationship with eight temperature between variables variables is meaningfulas model input fors. improvingAfter data cleaning, model performance. 266,538 observations However, whenwere the obtained number in of the hidden sample layer. Fifty elements percent is of very the small, sample an comprised overly large the look-back training dataset, time step and will the lead toremaining the degradation 50% comprised of model the performance. test dataset. When Min-max the look-backnormalization time was step adopted, exceeds and 0 and the the range number of ofvariables hidden layer was scaled units isto greater[0,1]. For than the six,evaluation the RMSE of the and developed MAPE of NBM thetraining model of and the testinduced datasets draft are highlyfan, two similar indices and— very root low,mean which square indicates error (RMSE) that the and accuracy mean absolute and generalization percentage error of the (MAPE model)— are were used. The definitions of the two indices are given in the following equations: very high. However, increasing the look-back time steps and the number of hidden layer units leads to an increase in the required model training resources.푛 In this study, while ensuring the accuracy of 1 the model, a small-scale model structure푅푀푆퐸 should= √ be∑ selected(푦̂ − 푦 as)2 far as possible; thus, the combination(10) 푛 푖 푖 “8–12” was selected as the final model structure. 푖=1

푛 Comparative Analysis 1 푦̂푖 − 푦푖 푀퐴푃퐸 = ∑ | | × 100% (11) 푛 푦푖 In previous research, the principle components푖=1 analysis- nonlinear autoregressive neural network with exogenous input (PCA-NARX) method was proposed to establish the NBM [25]. PCA was used to where 푦푖 is measured value, 푦̂푖 is reconstructed value, and 푛 is the number of variables. The look-back time step and hidden layer unit are important hyperparameters in the LSTM-AE and determine the structure of the NBM. Many different combinations were tested in this work, e.g., “4–6” means that the look-back time step is 4 and the number of hidden layer unit is 6. When the look-back time step is 0, the structure is a conventional AE. The performances of different combinations as measured by the two evaluation indices are shown in Figure 8. The comparison between the LSTM-AE and AE is shown in Table 1, and indicates that LSTM-AE is significantly better. It can be clearly seen from Figure 8 that when the number of hidden layer units is small, the model performance is more sensitive to the number of hidden layer units than the look-back time step; however, a small look-back time step can also effectively improve the model performance. Therefore, the time relationship between variables is meaningful for improving model performance. However, when the number of hidden layer elements is very small, an overly large look-back time step will lead to the degradation of model performance. When the look-back time step exceeds 0 and the number of hidden layer units is greater than six, the RMSE and MAPE of the training and test datasets are highly similar and very low, which indicates that the accuracy and generalization of the model are very high. However, increasing the look-back time steps and the number of hidden layer units leads to an increase in the required model training resources. In this study, while ensuring the accuracy of the model, a small-scale model structure should be selected as far as possible; thus, the combination “8–12” was selected as the final model structure.

Sensors 2020, 20, 6164 9 of 18 reduce redundancy and retain useful information of dataset. The NARX is a nonlinear autoregressive model which has exogenous inputs and can be stated algebraically as: Sensors 2020, 20, x FOR PEER REVIEW 9 of 20   yt = F yt 1, yt 2, ... , yt ny , ut 1, ut 2 , ... , ut nu + εt (12) Table 1.− The comparison− − between− the −LSTM-AE and− AE. where y is the variable ofRMSE interest; on Trainingu is the exogenousRMSE on variable; MAPEn onis theTraining TDL (or timeMAPE lags) on of y, which y indicates that previous valuesDataset help to predictTesty; n Datasetu is the TDL of Datasetu; ε is the errorTest term; DatasetF() is the neural network. In thisAE model, information0.111 about u helps0.191 predict y, as do0.172 previous values0.172 of y itself. LSTM-AE 0.026 0.035 0.027 0.027

FigureFigure 8. The8. The performance performance of of thethe LSTM-AELSTM-AE model model on on difference difference combination combination.. Table 1. The comparison between the LSTM-AE and AE. Comparative Analysis In previousRMSE on research, Training Dataset the principle RMSE on components Test Dataset analysis MAPE on- Trainingnonlinear Dataset autoregressive MAPE on Test neural Dataset networkAE with exogenous0.111 input (PCA-NARX) 0.191 method was proposed 0.172 to establish the NBM 0.172[25]. PCA LSTM-AE 0.026 0.035 0.027 0.027 was used to reduce redundancy and retain useful information of dataset. The NARX is a nonlinear autoregressive model which has exogenous inputs and can be stated algebraically as: In the PCA-NARX method, the exogenous variable is the PCA result of the object variable. In this 푦 = 퐹 (푦 , 푦 , … , 푦 , 푢 , 푢 , … , 푢 ) + 휀 (12) study, an NBM based on PCA-NARX푡 푡 1 was푡 2 established.푡 푛푦 푡 1 The푡 TDL2 of푡 the푛푢 object푡 and exogenous variables werew hyperparameters,here 푦 is the variable and of a interest; numerous 푢 is combinations the exogenous were variable; trained. 푛푦 is The theperformance TDL (or time lags) of PCA-NARX of 푦, for diwhichfferent indicates combinations that previous is shown values in Figurehelp to9 predict. The comparison 푦; 푛푢 is the of TDL the of best 푢; ε performance is the error term; between 퐹() the LSTM-AEis the neural and PCA-NARX network. In isthis shown model, in information Table2. The about PCA-NARX 푢 helps predict method 푦 considers, as do previous the time values sequence of correlation푦 itself. between variables, so is better than the conventional AE method. However, LSTM-AE is better thanIn PCA-NARX.the PCA-NARX From method, the perspectivethe exogenous of modelvariable principles, is the PCA PCA-NARXresult of the obj is aect predictive variable. In model, this study, an NBM based on PCA-NARX was established. The TDL of the object and exogenous and LSTM-AE is a reconstruction model. variables were hyperparameters, and a numerous combinations were trained. The performance of PCA-NARX for different combinations is shown in Figure 9. The comparison of the best performance between the LSTM-AE and PCA-NARX is shown in Table 2. The PCA-NARX method considers the

Sensors 2020, 20, x FOR PEER REVIEW 10 of 20

time sequence correlation between variables, so is better than the conventional AE method. However, LSTM-AE is better than PCA-NARX. From the perspective of model principles, PCA-NARX is a Sensorspredictive2020, 20 ,model, 6164 and LSTM-AE is a reconstruction model. 10 of 18

Figure 9. The performance of the principle components analysis- nonlinear autoregressive neural Figure 9. The performance of the principle components analysis- nonlinear autoregressive neural network with exogenous input (PCA-NARX) model for different combinations. network with exogenous input (PCA-NARX) model for different combinations. Table 2. The comparison between LSTM-AE and PCA-NARX. Table 2. The comparison between LSTM-AE and PCA-NARX. RMSE on Training Dataset RMSE on Test Dataset MAPE on Training Dataset MAPE on Test Dataset LSTM-AE 0.026RMSE on RMSE 0.035 on Test MAPE 0.027 on MAPE on 0.027 Test PCA-NARX 0.044Training 0.044Dataset Training 0.032 Dataset 0.032 AE 0.111 0.191 0.172 0.172 Dataset Dataset LSTM-AE 0.026 0.035 0.027 0.027 3.4. StatisticalPCA-NARX Analysis on the Residuals0.044 0.044 0.032 0.032 In this study,AE a total of 133,4050.111 observations0.191 were obtained from0.172 the corresponding0.172 reconstruction by the trained NBM. The MD was proposed to describe the overall residual (OR) of NBM. The MD provides3.4. Statistical the univariate Analysis distanceon the Residuals between multi-dimensional samples that obey the same distribution with considerationIn this study, of the a total correlation of 133,405 between observations variables were and theobtained dimensional from the scale, corresponding and has been successfullyreconstruction applied by the to capturetrained NBM. different The types MD wa of outlierss proposed [43 ].to Thedescribe MD canthe beoverall calculated residual as (OR) follows: of NBM. The MD provides the univariate distance between multi-dimensional samples that obey the r same distribution with consideration of the correlation  between variablesT and the dimensional scale, MD = X X C 1 X X (13) and has been successfully applied to ijcapture differenti − j types− ofi − outliersj [43]. The MD can be calculated as follows: where Xi and Xj are different samples in the same distribution, and C is the covariance matrix between the variables of the distribution. The distribution similarity between the measured dataset and the reconstructed dataset was used in this study. Kullback–Leibler (KL) divergence [44] was used to describe the distribution difference of the measurement dataset and the reconstructed dataset. KL divergence is an asymmetric and Sensors 2020, 20, x FOR PEER REVIEW 11 of 20

1 푇 푀퐷푖푗 = √(푋푖 − 푋푗)퐶 (푋푖 − 푋푗) (13) where 푋푖 and 푋푗 are different samples in the same distribution, and 퐶 is the covariance matrix between the variables of the distribution. The distribution similarity between the measured dataset and the reconstructed dataset was usedSensors in2020 this, 20 , study. 6164 Kullback–Leibler (KL) divergence [44] was used to describe the distribution11 of 18 difference of the measurement dataset and the reconstructed dataset. KL divergence is an asymmetric and non-negative measure of the difference between two probability distributions. A smaller KL divergencenon-negative demonstrates measure of thea greater difference similarity between of twotwo probability distributions. distributions. KL divergence A smaller can be KL calculated divergence asdemonstrates depicted: a greater similarity of two distributions. KL divergence can be calculated as depicted: 푁 N X 푝(푥(푖)) 퐷 (푃||푄) = ∑ 푝(푥 )log ( p xi) (14) D퐾퐿KL(P Q) = p(x푖i) log ( ) (14) || 푞(q푥(푖x)) 푖i=11 i where 푃 and Q are two distributions, 푝(푥푖) and 푞(푥푖) represent the probability of the two where P and Q are two distributions, p(xi) and q(xi) represent the probability of the two distributions distributions for sample 푥푖, respectively; 푁 is the number of samples. for sample x , respectively; N is the number of samples. Because ithe dataset has eight dimensions and each dimension takes 100 points uniformly, a total Because the dataset has eight dimensions and each dimension takes 100 points uniformly, a total of 1016 sample points exists, which causes calculation difficulties. In this work, 2000 observations were of 1016 sample points exists, which causes calculation difficulties. In this work, 2000 observations randomly selected each time, and 1000 experiments were performed. KL divergence of the were randomly selected each time, and 1000 experiments were performed. KL divergence of the measurement dataset and the reconstructed dataset with 1000 experiments is shown in Figure 10. The measurement dataset and the reconstructed dataset with 1000 experiments is shown in Figure 10. calculated KL divergences are very small and consistent. The red line indicates the average KL The calculated KL divergences are very small and consistent. The red line indicates the average divergence change trend as the number of experiments increases. The final KL was calculated to be KL divergence change trend as the number of experiments increases. The final KL was calculated 0.0049 using the law of large numbers. The experimental results show that the distributions of the to be 0.0049 using the law of large numbers. The experimental results show that the distributions measurement dataset and reconstructed dataset are approximately equal; therefore, it can be of the measurement dataset and reconstructed dataset are approximately equal; therefore, it can be considered that the reconstructed dataset and the measurement dataset belong to the same considered that the reconstructed dataset and the measurement dataset belong to the same distribution. distribution. The MD was applied to describe the OR of model, and the calculation method is as The MD was applied to describe the OR of model, and the calculation method is as follows: follows: r    T 푖i ̂ 11 ̂ 푇 MD푀퐷푅푒푠푖푑푢푎푙 == √(푌Y푖 − 푌Y푖ˆ)퐶C (푌Y푖 − 푌Y푖ˆ) (15(15)) Residual i − i − i − i where 푌 and 푌̂ represent the measurement value and reconstructed value of the operating where Y푖 and Yˆ 푖represent the measurement value and reconstructed value of the operating variables variables iat timei 𝑖, respectively; 퐶 is the covariance matrix between the operating variables. at time i, respectively; C is the covariance matrix between the operating variables.

Figure 10. Kullback–Leibler (KL) divergence of the measured dataset and the reconstructed dataset Figurein tests. 10. Kullback–Leibler (KL) divergence of the measured dataset and the reconstructed dataset in tests. The residual of each variable and overall model were subjected to statistical analysis. The KDE methodThe residual was used of toeach simulate variable the and probability overall model density were function subjected (PDF) to statistical of the residual analysis. distributions, The KDE methodto obtain wa as reasonableused to sim rangeulate the of residuals. probability The density KDE usesfunction a smooth (PDF) peak of the function residual to distributions, fit the observed to obtaindata points a reasonable to simulate range the of residuals. true probability The KDE distribution uses a smooth curve. peak Let functionx1, x2, ... to, xfitn bethe an observed independent data and identically distributed data sample, then, the kernel density of its PDF is estimated as:

n 1 X x x  fˆ (x) = K − i (16) h nh h i=1 where h is a smoothing parameter called the bandwidth. The larger the bandwidth, the smaller the proportion of the observed data points in the final curve shape, and the flatter the overall curve; the smaller the bandwidth, the greater the proportion of the observed data points in the final curve Sensors 2020, 20, x FOR PEER REVIEW 12 of 20 points to simulate the true probability distribution curve. Let 푥1, 푥2,…, 푥푛 be an independent and identically distributed data sample, then, the kernel density of its PDF is estimated as: 푛 1 푥 − 푥푖 푓̂ (푥) = ∑ 퐾( ) (16) ℎ 푛ℎ ℎ 푖=1 Sensorswhere2020 ℎ is, 20 a, 6164smoothing parameter called the bandwidth. The larger the bandwidth, the smaller12 ofthe 18 proportion of the observed data points in the final curve shape, and the flatter the overall curve; the shape.smaller Thethe bandwidth, overall curve the is greater steeper. theK (proportion) is the kernel of the function; observed commonly data points used in the kernel final functionscurve shape. are 퐾(∙) · triangle,The overall rectangle, curve is Epanechnikov, steeper. is Gaussian, the kernel etc. function; commonly used kernel functions are triangle, rectangle,In this Epanechnikov, embodiment, Gaussian, the kernel etc. function was a Gaussian function, and an adaptive bandwidth methodIn this [45] embodiment, was adopted. the The kernel residual function distributions was a of Gaussian the overall function, NBM and and each an variableadaptive are bandwidth shown in Figuresmethod 11[45 and] was 12 adopted., respectively. The residualF( ) represents distributio thens cumulative of the overall probability NBM and function each variable (CDF), the are integral shown · 퐹(∙) ofin theFigure probabilitys 11 and density12, respectively. function. The lowerrepresents and upper the cumulative threshold ofprobability the reasonable function residual (CDF), range the areintegral recorded of the as probabilityR_lower and densityR_upper function., respectively. The lower In and this upper work, threshold the reasonable of the reasonable residual range residual was 푅_푙표푤푒푟 푅_푢푝푝푒푟 obtainedrange are based recorded on the as 99% confidence and interval., respectively. This means F In( R this_lower work,) = 0.01 the , reasonableF(R_upper ) residual= 0.99; 퐹(푅_푙표푤푒푟) = 0.01 therange reasonable was obtained residual based range of on the the overall 99% NBM confidence and each interval. variable This are shown means in Table3. The overall, 퐹(푅_푢푝푝푒푟) = 0.99 residual of the model; the only reasonable takes the residual upper limit. range of the overall NBM and each variable are shown in Table 3. The overall residual of the model only takes the upper limit.

Figure 11. OR distribution of the normal behavior model (NBM). Figure 11. OR distribution of the normal behavior model (NBM). 3.5. Abnormality Detection Table 3. Reasonable residual ranges of the overall NBM and each variable. Based on the trained NBM of the induced draft fan as described in the previous section, this work demonstrates OR the proposedLOT abnormalityPOT_A1 POA_A2 detection methodsMNDT usingMDT two cases:MBT_1 normal MBT_2 operation MBT_3 and abnormal푹_𝒍 풘풆풓 operation.--- −0.048 −0.228 −0.315 −0.041 −0.027 −0.063 −0.029 −0.037 푹_𝒖𝒑𝒑풆풓 0.325 0.084 0.341 0.374 0.037 0.066 0.042 0.095 0.064 3.5.1. Normal Operation Case A two-day operation dataset was collected from May 1 to May 2, 2020 with an interval of 1 min to monitor abnormalities in real-time. The real-time OR and each variable monitoring process are shown in Figures 13 and 14, respectively. The measured value obtained from the CMS and the reconstructed value given by the NBM are monitored in real-time. The reasonable residual ranges of each variable are shown in Table2. The time lag and in the actual operation cause a small amount of discontinuity outside the allowable residual range in the real-time monitoring process. Therefore, in the actual monitoring process, an average sliding window, with size = 30, is used to process the monitored residuals. It can be seen from Figure 13 that the OR of the model is within the reasonable range during this period, indicating that the induced draft fan operates normally and is consistent with this case. Figure 14 shows that the residual of each variable does not exceed the reasonable range, and the reconstructed values are very close to the measured values. This indicates that the trained NBM has a good generalization and accuracy.

3.5.2. Abnormal Operation Case

In this study, the abnormal dataset was manually constructed according to the correlation of the variables shown in Figure5 for an early warning demonstration, because there was no real failure case of the induced draft fan. The correlation coefficients among LOT, MNDT, MDT, MBT_1, MBT_2, and MBT_3 are relatively high, and the failure data between them has high uncertainty. The abnormal dataset generation is performed using POT_A1, and the other variables remain unchanged. Based on Sensors 2020, 20, 6164 13 of 18 the normal operation case, starting at 12:00 on May 1, a linear cumulative drift was added to POT_A1 with a coefficient of 0.02 ◦C, as shown in Figure 15. The real-time monitoring process of the OR is shown in Figure 16, which shows the OR exceeds the upper threshold at 19:50 when the cumulative drift of POT_A1 was 9.4 ◦C. The real-time monitoring of POT_A1 is shown in Figure 17, which shows that the residual of POT_A1 exceeds the upper threshold at 16:15 when the cumulative drift of POT_A1 is 5.1 ◦C. According to the conventional fixed threshold, the alarm will not be raised until POT_A1 reaches 180 ◦C. In this case, the OR exceeds the threshold when POT_A1 is 132.23 ◦C, and the residual of POT_A1 exceeds the threshold when POT_A1 is 127.19 ◦C. This shows that the anomaly detection method proposed in this study can effectively detect early anomalies of equipment. Using the MD to characterize the residual of the overall model can reduce the false alarm rate compared to using a single variable threshold. Therefore, in the monitoring process, the OR is used to detect the abnormal state of the equipment, and the single variable threshold is used to analyze the cause of the abnormality. Sensors 2020, 20, x FOR PEER REVIEW 13 of 20

Figure 12.FigureResidual 12. Residual distribution distribution of each of variable each. variable.

Table 3. Reasonable residual ranges of the overall NBM and each variable.

OR LOT POT_A1 POA_A2 MNDT MDT MBT_1 MBT_2 MBT_3 R_lower --- 0.048 0.228 0.315 0.041 0.027 0.063 0.029 0.037 − − − − − − − − R_upper 0.325 0.084 0.341 0.374 0.037 0.066 0.042 0.095 0.064 Sensors 2020, 20, x FOR PEER REVIEW 14 of 20

3.5. Abnormality Detection Based on the trained NBM of the induced draft fan as described in the previous section, this work demonstrates the proposed abnormality detection methods using two cases: normal operation and abnormalSensors 2020 operation., 20, x FOR PEER REVIEW 14 of 20 3.5. Abnormality Detection 3.5.1. Normal Operation Case Based on the trained NBM of the induced draft fan as described in the previous section, this A twowork-day demonstrates operation the dataset proposed wa abnormalitys collected detection from May methods 1 to May using 2, two 2020 cases: with normal an interval operation of 1 min to monitorand abnormalabnormalities operation. in real-time. The real-time OR and each variable monitoring process are shown in Figures 13 and 14, respectively. The measured value obtained from the CMS and the 3.5.1. Normal Operation Case reconstructed value given by the NBM are monitored in real-time. The reasonable residual ranges of each variableA twoare- dayshown operation in Table dataset 2. waThes collected time lag from and May noise 1 to May in the 2, 2020 actual with operationan interval ofcause 1 min a small to monitor abnormalities in real-time. The real-time OR and each variable monitoring process are amount of discontinuity outside the allowable residual range in the real-time monitoring process. shown in Figures 13 and 14, respectively. The measured value obtained from the CMS and the Therefore,recon instructed the actual value monitoring given by the NBMprocess, are monitored an average in real sliding-time. Thewindow reasonable, with residual size =ranges 30, is of used to process eachthe monitored variable are residuals.shown in Table It can 2. beThe seen time from lag and Figure noise 13 in thatthe actualthe OR operation of the modelcause a is small within the reasonableamount range of discontinuityduring this outsideperiod, the indicating allowable thatresidual the rangeinduced in the draft real -fantime operates monitoring normally process. and is Sensors 2020,consistent20, 6164Therefore, with thisin the case. actual Figure monitoring 14 shows process, that an average the residual sliding ofwindow each , variablewith size does= 30, is not used exceed to the 14 of 18 reasonableprocess range, the monitoredand the reconstructed residuals. It can valuebe seens arefrom very Figure close 13 that to the ORmeasured of the model value is withins. This the indicates reasonable range during this period, indicating that the induced draft fan operates normally and is that the consistenttrained NBM with thishas case.a good Figure generalization 14 shows that and the accuracy. residual of each variable does not exceed the reasonable range, and the reconstructed values are very close to the measured values. This indicates that the trained NBM has a good generalization and accuracy.

FigureFigure 13.FigureReal-time 13. Real13. Real-time-time monitoring monitoring monitoring ofof of thethe the OR OR in OR in a normal ain normal a case normal case. . case.

Sensors 2020, 20, x FOR PEER REVIEW 15 of 20 (a)

(a )

(b)

(c)

(d)

(e)

(f)

Figure 14. Cont.

Sensors 2020, 20, x FOR PEER REVIEW 16 of 20

(g)

(h)

Figure 14. Real-time monitoring of variables in a normal case: (a) LOT; (b) POT_A1; (c) POT_A2; (d) Sensors 2020MNDT;, 20, 6164 (e) MDT; (f) MBT_1; (g) MBT_2; (h) MBT_3. 15 of 18

Sensors 2020, 20, x FOR PEER REVIEW 16 of 20 3.5.2. Abnormal Operation Case In this study, the abnormal dataset was manually constructed according to the correlation of the variables shown in Figure 5 for an early warning demonstration, because there was no real failure case of the induced draft fan. The correlation coefficients among LOT, MNDT, MDT, MBT_1, MBT_2, and MBT_3 are relatively high, and the failure data between them has high uncertainty. The abnormal dataset generation is performed using POT_A1, and the other variables remain unchanged. Based on the normal operation case, starting at 12:00 on May 1, a linear cumulative drift was added to POT_A1 with a coefficient of 0.02 °C, as shown in Figure (15.g) The real-time monitoring process of the OR is shown in Figure 16, which shows the OR exceeds the upper threshold at 19:50 when the cumulative drift of POT_A1 was 9.4 °C. The real-time monitoring of POT_A1 is shown in Figure 17, which shows that the residual of POT_A1 exceeds the upper threshold at 16:15 when the cumulative drift of POT_A1 is 5.1 °C. According to the conventional fixed threshold, the alarm will not be raised until POT_A1 reaches 180 °C . In this case, the OR exceeds the threshold when POT_A1 is 132.23 °C , and

the residual of POT_A1 exceeds the threshold when POT_A1 is 127.19 °C . This shows that the anomaly detection method proposed in this study(h) can effectively detect early anomalies of equipment.Figure 14. Real-time UsingFigure the 14. monitoring MD Real to-time characterize monitoring of variables of variables the residual in in a a normal of cas thee: case: (a ove) LOT; (ralla ()b LOT;) POT_A1;model (b ()canc) POT_A1; POT_A2; reduce (d) (thec) POT_A2; false alarm (d) rate compared toMNDT; using (e) MDT;a single (f) MBT_1; variable (g) MBT_2; threshold. (h) MBT_3. Therefore, in the monitoring process, the OR is MNDT; (e) MDT; (f) MBT_1; (g) MBT_2; (h) MBT_3. used to detect3.5.2. the Abnormal abnormal Operation state Case of the equipment, and the single variable threshold is used to analyze the causeIn ofthis the study, abnormality. the abnormal dataset was manually constructed according to the correlation of the variables shown in Figure 5 for an early warning demonstration, because there was no real failure case of the induced draft fan. The correlation coefficients among LOT, MNDT, MDT, MBT_1, MBT_2, and MBT_3 are relatively high, and the failure data between them has high uncertainty. The abnormal dataset generation is performed using POT_A1, and the other variables remain unchanged. Based on the normal operation case, starting at 12:00 on May 1, a linear cumulative drift was added to POT_A1 with a coefficient of 0.02 °C, as shown in Figure 15. The real-time monitoring process of the OR is shown in Figure 16, which shows the OR exceeds the upper threshold at 19:50 when the cumulative drift of POT_A1 was 9.4 °C. The real-time monitoring of POT_A1 is shown in Figure 17, which shows that the residual of POT_A1 exceeds the upper threshold at 16:15 when the cumulative drift of POT_A1 is 5.1 °C. According to the conventional fixed threshold, the alarm will not be raised until POT_A1 reaches 180 °C . In this case, the OR exceeds the threshold when POT_A1 is 132.23 °C , and Sensors 2020, 20the, x FOR residual PEER of REVIEW POT_A1 exceeds the threshold when POT_A1 is 127.19 °C . This shows that the 17 of 20 Sensors 2020, 20,anomaly x FOR PEER detection REVIEW method proposed in this study can effectively detect early anomalies of 17 of 20 equipment. UsingFigure the MD15. toConstructed characterize thePOT_A1 residual with of the linear overall cumulative model can reduce drift the. false alarm rate comparedFigure to using 15. a singleConstructed variable POT_A1threshold. withTherefore, linear in cumulativethe monitoring drift process,. the OR is used to detectFigure the 15.abnormalConstructed state of the POT_A1 equipment, with and the linear single cumulative variable threshold drift. is used to analyze the cause of the abnormality.

Figure 16. Real-time monitoring of the OR in an abnormal case. FigureFigure 16. 16.Real-time Real-timemonitoring monitoring of of the the OR OR in in an an abnormal abnormal case case..

Figure 17. Real-time monitoring of POT_A1 in an abnormal case. Figure 17. Real-time monitoring of POT_A1 in an abnormal case. Figure 17. Real-time monitoring of POT_A1 in an abnormal case. 4. Conclusions and Future Work 4. Conclusions and Future Work A general anomaly detection framework based on LSTM-AE neural networks is proposed for earlyA detectiongeneral anomal of abnormality detectionies frameworkin power plants.based on The LSTM LSTM-AE-AE neural neural networks network is proposed-based NBM for earlyconsiders detection the spatial of abnormalit and temporalies in correlation power plants. of the The related LSTM operating-AE neural variables network and-based realizes NBM the considerssimultaneous the spatialmonitoring and of temporal multiple correlation variables. In of the the case related study, operating hyperparameters variables andof the realizes LSTM - theAE simultaneousmodel, including monitoring the hidden of multiple layer unit variables. and look In the-back case time study, step, hyperparameters were meticulously of the optimized LSTM-AE to model,obtain aincluding more accurate the hidden reconstruction layer unit and and better look generalization-back time step,. The were comparative meticulously analysis optimized between to obtainthe LSTM a more-AE accurateand PCA reconstruction-NARX model sand illustrates better generalization that the LSTM. -TheAE modelcomparative has better analysis performance. between theThe LSTM average-AE RMSEsand PCA of- NARXthe training model s and illustrates test dataset that thes are LSTM 0.026-AE and model 0.035 has °C, better respectively. performance. The Thecorresponding average RMSEs average of MAPEs the training are with andin 0.027%. test dataset Baseds onare the 0.026 similarity and 0.035 analysis °C, between respectively. the NBM The correspondingoutput value distribution average MAPEs and are the with correspondingin 0.027%. Based measured on the value similarity distribution, analysis thebetween MD isthe used NBM to outputdescribe value the overall distribution residual and of the the corresponding model. The reasonable measured residual value distribution,range of the theoverall MD model is used and to describeeach variable the overall were obtained residual usingof the KDE model. with The a 99% reasonable confidence residual interval. range The of abnormal the overall operation model caseand eashowsch variable that the were proposed obtained framework using KDE detected with a 99% the abnormalityconfidence interval. when the The variable abnormal value operation was 132.23 case shows°C , however, that the the proposed conventional framework fixed threshold detected theis 180 abnormality °C . The resid whenual ofthe the variable overall value model wa iss used 132.23 to °Cdetect, however, the abnormal the conventional state, and fixedthe residual threshold of isthe 180 single °C . Thevariable resid isual used of the to overallanalyze model the cause is used of the to detectabnormality. the abnormal In summary, state, and the the proposed residual of framework the single couldvariable be is usedused forto analyze the early the detectcause ionof the of abnormality.abnormalities Inin summary,real-time, which the proposed is of great framework significance could to condition be used-based for themaintenance early detect in ionpower of abnormalitiesplants. In future in real work,-time, an which abnormal is of great dataset significance will be neededto condition to further-based verifymaintenance this framework, in power plantespeciallys. In for future cases work, in which an multipleabnormal variables dataset are will simultaneously be needed to abnormal. further verify Moreover, this framework,the accurate especiallyextraction forof thecases normal in which behavior multiple operation variables dataset are simultaneously also requires further abnormal. research. Moreover, the accurate extraction of the normal behavior operation dataset also requires further research. Author Contributions: Conceptualization, T.Y.; Formal analysis, C.Z.; Investigation, D.H.; Methodology, D.H.; AuthorProject administration,Contributions: GConceptualization,.C.; Resources, T.Y. T .Y.and; Formal G.C.; Software, analysis, D.H.C.Z.; Validation,Investigation, C.Z. D;.H. Writing; Methodology, – original D.H.draft,; Project administration, G.C.; Resources, T.Y. and G.C.; Software, D.H.; Validation, C.Z.; Writing – original draft,

Sensors 2020, 20, 6164 16 of 18

4. Conclusions and Future Work A general anomaly detection framework based on LSTM-AE neural networks is proposed for early detection of abnormalities in power plants. The LSTM-AE neural network-based NBM considers the spatial and temporal correlation of the related operating variables and realizes the simultaneous monitoring of multiple variables. In the case study, hyperparameters of the LSTM-AE model, including the hidden layer unit and look-back time step, were meticulously optimized to obtain a more accurate reconstruction and better generalization. The comparative analysis between the LSTM-AE and PCA-NARX models illustrates that the LSTM-AE model has better performance. The average RMSEs of the training and test datasets are 0.026 and 0.035 ◦C, respectively. The corresponding average MAPEs are within 0.027%. Based on the similarity analysis between the NBM output value distribution and the corresponding measured value distribution, the MD is used to describe the overall residual of the model. The reasonable residual range of the overall model and each variable were obtained using KDE with a 99% confidence interval. The abnormal operation case shows that the proposed framework detected the abnormality when the variable value was 132.23 ◦C, however, the conventional fixed threshold is 180 ◦C. The residual of the overall model is used to detect the abnormal state, and the residual of the single variable is used to analyze the cause of the abnormality. In summary, the proposed framework could be used for the early detection of abnormalities in real-time, which is of great significance to condition-based maintenance in power plants. In future work, an abnormal dataset will be needed to further verify this framework, especially for cases in which multiple variables are simultaneously abnormal. Moreover, the accurate extraction of the normal behavior operation dataset also requires further research.

Author Contributions: Conceptualization, T.Y.; Formal analysis, C.Z.; Investigation, D.H.; Methodology, D.H.; Project administration, G.C.; Resources, T.Y. and G.C.; Software, D.H.; Validation, C.Z.; Writing—original draft, D.H.; Writing—review & editing, T.Y. All authors have read and agreed to the published version of the manuscript. Funding: This research received no external funding. Conflicts of Interest: The authors declare no conflict of interest.

References

1. Kim, H.; Na, M.G.; Heo, G. Application of monitoring, diagnosis, and prognosis in thermal performance analysis for nuclear power plants. Nucl. Eng. Technol. 2014, 46, 737–752. [CrossRef] 2. Fast, M. Artificial neural networks for gas turbine monitoring. In Sweden: Division of Thermal Power Engineering; Department of Energy Sciences, Faculty of Engineering, Lund University: Lund, Sweden, 2010; ISBN 978-91-7473-035-7. 3. Wu, S.X.; Banzhaf, W. The use of computational intelligence in intrusion detection systems: A review. Appl. Soft Comput. 2010, 10, 1–35. [CrossRef] 4. Gómez, C.Q.; Villegas, M.A.; García, F.P.; Pedregal, D.J. Big Data and Web Intelligence for Condition Monitoring: A Case Study on Wind Turbines//Big Data: Concepts, Methodologies, Tools, and Applications; IGI global: Pennsylvania, PA, USA, 2016; pp. 1295–1308. [CrossRef] 5. Dao, P.B.; Staszewski, W.J.; Barszcz, T.; Uhl, T. Condition monitoring and fault detection in wind turbines based on cointegration analysis of SCADA data. Renew. Energy 2018, 116, 107–122. [CrossRef] 6. Fast, M.; Palme, T. Application of artificial neural networks to the condition monitoring and diagnosis of a combined heat and power plant. Energy 2010, 35, 1114–1120. [CrossRef] 7. Liu, J.Z.; Wang, Q.H.; Fang, F. Data-driven-based Application Architecture and Technologies of Smart Power Generation. Proc. CSEE 2019, 39, 3578–3587. [CrossRef] 8. Moleda, M.; Mrozek, D. Big Data in Power Generation. In International Conference: Beyond Databases, Architectures and Structures; Springer: Cham, Switzerland, 2019; pp. 15–29. [CrossRef] 9. Garcia, M.C.; Sanz-Bobi, M.A.; Del Pico, J. SIMAP: Intelligent System for Predictive Maintenance. Comput. Ind. 2006, 57, 552–568. [CrossRef] Sensors 2020, 20, 6164 17 of 18

10. Muñoz, A.; Sanz-Bobi, M.A. An incipient fault detection system based on the probabilistic radial basis function network: Application to the diagnosis of the condenser of a coal power plant. Neurocomputing 1998, 23, 177–194. [CrossRef] 11. Sanz-Bobi, M.A.; Toribio, M.A.D. Diagnosis of Electrical Motors Using Artificial Neural Networks; IEEE International SDEMPED: Gijón, Spain, 1999; pp. 369–374. 12. Haykin, S.; Network, N. A comprehensive foundation. Neural Netw. 2004, 2, 41. [CrossRef] 13. Lu, S.; Hogg, B.W. Dynamic nonlinear modelling of power plant by physical principles and neural networks. Int. J. Electr. Power Energy Syst. 2000, 22, 67–78. [CrossRef] 14. Sun, Y.; Gao, J.; Zhang, H.; Peng, D. (2016, September). The application of BPNN based on improved PSO in main steam temperature control of supercritical unit. In Proceedings of the 2016 22nd International Conference on Automation and Computing (ICAC), Colchester, UK, 7–8 September 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 188–192. [CrossRef] 15. Huang, C.; Li, J.; Yin, Y.; Zhang, J.; Hou, G. State monitoring of induced draft fan in thermal power plant by gravitational searching algorithm optimized BP neural network. In Proceedings of the 2017 Chinese Automation Congress (CAC), Jinan, China, 20–22 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 4616–4621. [CrossRef] 16. Tan, P.; Zhang, C.; Xia, J.; Fang, Q.; Chen, G. NOx Emission Model for Coal-Fired Boilers Using Principle Component Analysis and Support Vector Regression. J. Chem. Eng. Jpn. 2016, 49, 211–216. [CrossRef] 17. Zhang, L.; Zhou, L.; Zhang, Y.; Wang, K.; Zhang, Y.; Zhijun, E.; Gan, Z.; Wang, Z.; Qu, B.; Li, G. Coal consumption prediction based on least squares support vector machine. EEs 2019, 227, 032007. [CrossRef] 18. Guanglong, W.; Meng, L.; Wenjie, Z. The LS-SVM modeling of power station boiler NOx emission based on genetic algorithm. Autom. Instrum. 2016, 2, 26. [CrossRef] 19. Wang, C.; Liu, Y.; Zheng, S.; Jiang, A. Optimizing combustion of coal fired boilers for reducing NOx emission using Gaussian Process. Energy 2018, 153, 149–158. [CrossRef] 20. Hu, D.; Chen, G.; Yang, T.; Zhang, C.; Wang, Z.; Chen, Q.; Li, B. An Artificial Neural Network Model for Monitoring Real-Time Variables and Detecting Early Warnings in Induced Draft Fan. In Proceedings of the ASME 2018 13th International Manufacturing Science and Engineering Conference. American Society of Mechanical Engineers Digital Collection, College Station, TX, USA, 18–22 June 2018. [CrossRef] 21. Safdarnejad, S.M.; Tuttle, J.F.; Powell, K.M. Dynamic modeling and optimization of a coal-fired utility boiler to forecast and minimize NOx and CO emissions simultaneously. Comput. Chem. Eng. 2019, 124, 62–79. [CrossRef] 22. Bangalore, P.; Tjernberg, L.B. An Artificial Neural Network Approach for Early Fault Detection of Gearbox Bearings. IEEE Trans. Smart Grid 2015, 6, 980–987. [CrossRef] 23. Asgari, H.; Chen, X.; Morini, M.; Pinelli, M.; Sainudiin, R.; Spina, P.R.; Venturini, M. NARX models for simulation of the start-up operation of a single-shaft gas turbine. Appl. Eng. 2016, 93, 368–376. [CrossRef] 24. Lee, W.J.; Na, J.; Kim, K.; Lee, C.J.; Lee, Y.; Lee, J.M. NARX modeling for real-time optimization of air and gas compression systems in chemical processes. Comput. Chem. Eng. 2018, 115, 262–274. [CrossRef] 25. Hu, D.; Guo, S.; Chen, G.; Zhang, C.; Lv, D.; Li, B.; Chen, Q. Induced Draft Fan Early Anomaly Identification Based on SIS Data Using Normal Behavior Model in Thermal Power Plant. In Proceedings of the ASME Power Conference. American Society of Mechanical Engineers, Salt Lake City, UT, USA, 15–18 July 2019; Volume 59100, p. V001T08A002. [CrossRef] 26. Tan, P.; He, B.; Zhang, C.; Rao, D.; Li, S.; Fang, Q.; Chen, G. Dynamic modeling of NOX emission in a 660MW coal-fired boiler with long short-term memory. Energy 2019, 176, 429–436. [CrossRef] 27. Laubscher, R. Time-series forecasting of coal-fired power plant reheater metal temperatures using encoder-decoder recurrent neural networks. Energy 2019, 189, 116187. [CrossRef] 28. Yang, G.; Wang, Y.; Li, X. Prediction of the NOx emissions from thermal power plant using long-short term memory neural network. Energy 2020, 192, 116597. [CrossRef] 29. Pan, H.; Su, T.; Huang, X.; Wang, Z. LSTM-based soft sensor design for oxygen content of flue gas in coal-fired power plant. Trans. Inst. Meas. Control 2020.[CrossRef] 30. Sakurada, M.; Yairi, T. Anomaly detection using autoencoders with nonlinear . In Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis, Gold Coast, QLD, Australia, 2 December 2014; pp. 4–11. [CrossRef] Sensors 2020, 20, 6164 18 of 18

31. Lu, K.; Gao, S.; Sun, W.; Jiang, Z.; Meng, X.; Zhai, Y.; Han, Y.; Sun, M. Auto-encoder based fault early warning model for primary fan of power plant. In IOP Conference Series: Earth and Environmental Science; IOP Publishing: Bristol, UK, 2019; Volume 358, p. 042060. [CrossRef] 32. Roy, M.; Bose, S.K.; Kar, B. A stacked autoencoder neural network based automated method for anomaly detection in on-line condition monitoring. In Proceedings of the 2018 IEEE Symposium Series on Computational Intelligence (SSCI), Bangalore, India, 18–21 November 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1501–1507. [CrossRef] 33. Li, Y.; Hong, F.; Tian, L.; Liu, J.; Chen, J. Early Warning of Critical Blockage in Coal Mills based on Stacked Denoising Autoencoders. IEEE Access 2020.[CrossRef] 34. An, J.; Cho, S. Variationalautoencoder based anomaly detection using reconstruction probability. Spec. Lect. IE 2015, 2, 1–18. 35. Zong, B.; Song, Q.; Min, M.R.; Cheng, W.; Lumezanu, C.; Cho, D.; Chen, H. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. 36. Ng, A. Sparse autoencoder. CS294A Lect. Notes 2011, 72, 1–19. 37. Goodfellow, I.; Bengio, Y.; Courville, A. ; MIT Press: Cambridge, MA, USA, 2016. 38. Williams, R.J.; Zipser, D. Gradient-based learning algorithms for recurrent. In Backpropagation: Theory, Architectures, and Applications; Psychology Press: Hove, UK, 1995; Volume 433. 39. Werbos, P.J. Backpropagation through time: What it does and how to do it. Proc. IEEE 1990, 78, 1550–1560. [CrossRef] 40. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef] [PubMed] 41. Zaremba, W.; Sutskever, I.; Vinyals, O. regularization. arXiv 2014, arXiv:1409.2329. 42. Benjamini, Y. Opening the box of a boxplot. Am. Stat. 1988, 42, 257–262. [CrossRef] 43. Wang, Y.; Miao, Q.; Ma, E.W.; Tsui, K.L.; Pecht, M.G. Online anomaly detection for hard disk drives based on mahalanobis distance. IEEE Trans. Reliab. 2013, 62, 136–145. [CrossRef] 44. Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [CrossRef] 45. Shimazaki, H.; Shinomoto, S. Kernel bandwidth optimization in spike rate estimation. J. Comput. Neurosci. 2010, 29, 171–182. [CrossRef][PubMed]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).