processes

Article An Gated Recurrent Unit for Remaining Useful Life Prediction

Yi-Wei Lu 1, Chia-Yu Hsu 1,* and Kuang-Chieh Huang 2

1 Department of Industrial Engineering and Management, National Taipei University of Technology, Taipei 10608, Taiwan; [email protected] 2 Department of Information Management, Yuan Ze University, Taoyuan 32003, Taiwan; [email protected] * Correspondence: [email protected]; Tel.: +886-2-27712171

 Received: 6 August 2020; Accepted: 11 September 2020; Published: 15 September 2020 

Abstract: With the development of smart manufacturing, in order to detect abnormal conditions of the equipment, a large number of sensors have been used to record the variables associated with production equipment. This study focuses on the prediction of Remaining Useful Life (RUL). RUL prediction is part of predictive maintenance, which uses the development trend of the machine to predict when the machine will malfunction. High accuracy of RUL prediction not only reduces the consumption of manpower and materials, but also reduces the need for future maintenance. This study focuses on detecting faults as early as possible, before the machine needs to be replaced or repaired, to ensure the reliability of the system. It is difficult to extract meaningful features from sensor data directly. This study proposes a model based on an Autoencoder Gated Recurrent Unit (AE-GRU), in which the Autoencoder (AE) extracts the important features from the raw data and the Gated Recurrent Unit (GRU) selects the information from the sequences to forecast RUL. To evaluate the performance of the proposed AE-GRU model, an aircraft turbofan engine degradation simulation dataset provided by NASA was used and a comparison made of different recurrent neural networks. The results demonstrate that the AE-GRU is better than other recurrent neural networks, such as Long Short-Term Memory (LSTM) and GRU.

Keywords: remaining useful life; predictive maintenance; ; autoencoder; gated recurrent unit

1. Introduction Industry 4.0 depends on automated, smart factories, and employs sensor data and the methods of big data analysis, in the expectation that the equipment in the factory will operate automatically and self-correct to improve the product yield [1]. Sensor-related data are collected during the manufacturing processes, such as temperature, pressure, power, humidity, and chemical analysis for equipment monitoring [2,3]. These temporal patterns represent the equipment condition, and poor product quality is often associated with abnormal changes of environment or inappropriate operation settings [4]. Equipment condition can be recorded by sensor data from the past and kept as time series data. When analyzing equipment sensor data, not only the large amount of data recorded should be considered, but also the time series characteristics. The idea of a smart factory is to link and intellectualize the manufacturing process. In the past, automation merely used machines to improve production efficiency, yield, and reduce costs, but intelligence further applies technologies such as the IoT sensor, to monitor and control production lines. The combination of cloud computing, data analysis, and software and hardware integration is also an important part of intelligence. Machines communicate with automation equipment [5]. However, there are numerous parameters in the smart factory, and the data are often

Processes 2020, 8, 1155; doi:10.3390/pr8091155 www.mdpi.com/journal/processes Processes 2020, 8, 1155 2 of 18 highly correlated between equipment and processes. Therefore, the analysis must go beyond a single stage in the equipment or process, and be comprehensive. Equipment maintenance methods are divided into the following three types: (1) repair maintenance [6]; (2) preventive maintenance [7]; (3) predictive maintenance [8]. Corrective maintenance is a method of maintenance that is performed after some or all of the equipment fails. The most common equipment maintenance application is preventive maintenance, also known as scheduled maintenance, which carries out machine inspections, component replacement and other maintenance at prescribed times. Predictive maintenance is an equipment maintenance method based on the condition of the machine. It predicts the time when the machine may be damaged according to the development trend of the past. Predictive Maintenance (PdM) has been used to monitor the historical health status of equipment and make timely adjustments to the equipment, which is quite different from the methods of routine maintenance employed in the past. PdM not only saves unnecessary costs, but also allow early repairs when the equipment is about to break down. It can avoid unpredictable machine stoppages caused by unexpected breakdown and improper operation. Remaining Useful Life (RUL) is an important indicator in PdM. The definition of device or system RUL is the period from the current state to the time when the device begins to operate abnormally [9]. There are two types of RUL prediction: model-based and data-driven [10]. The model-based method establishes the model based on the historical trend of equipment. Generally, model-based methods perform better than data driven methods when there is little historical data. Data-driven methods use the health status data and data collected by sensors. For modeling, the collected data must be closely related to the device status. The advantage of data-driven methods is that the algorithm is based on learning the correlation between input and output data. If there are sufficient data, the stability and accuracy of data-driven methods are better than model-based methods RUL prediction can be described as a time series forecasting problem. The main purpose of time series forecasting is to build models from historical records and predict future situations. Time series forecasting can be divided into statistical models, models and deep learning models. Common statistical models include Autoregression (AR), Moving Average (MA), Autoregressive Moving Average (ARMA), and Autoregressive Integrated Moving Average (ARIMA). Machine learning builds a function or model to detect patterns in training samples and then uses these patterns for prediction [11]. Support Machine Vector (SVM) [12], Random Forest (RF) [13], Extreme Gradient Boosting (XGB) [14] are representative machine learning algorithms for time series forecasting, but it is difficult to design an effective machine learning algorithm without substantial knowledge of the data [15]. In addition, the existing methods for analyzing time series data need pre-processing based on experience, which causes some losses of information. With improved process technology and more sensors, few existing methods can handle multi-sensor data, which is likely to cause misjudgment or missed detection. Deep learning methods are preferred to predict and analyze time series data without predefined features. Deep learning is a branch of machine learning based on artificial neural networks with representation learning [16–18], which can learn the key information to make a response by using many hidden layers integrated in the network. The main difference between deep learning and machine learning is feature extraction. Usually, the features of machine learning are manually selected in advance; deep learning not only identifies features through model training, but is also much better than other algorithms at calculating results. Makridakis et al. [19] compare several time series analysis methods on the M3-Competition dataset. Most of the machine learning and deep learning algorithms perform better than traditional statistic methods. Alfred et al. [20] demonstrated that algorithms based on neural networks are more efficient in learning time series data than other algorithms. Feature extraction is an important data pre-processing method in machine learning for high prediction accuracy. In the past, the dimension reduction of most data features was accomplished through Principal Component Analysis (PCA). PCA retains the maximum variation of features by Processes 2020, 8, 1155 3 of 18 projecting variables into another space to represent the original complete data with fewer features, to achieve the effect of data dimension reduction. The difference between PCA and Autoencoder (AE) is that AE is a non-linear dimension reduction method [21], and an unsupervised training method. First, the model compresses the original data through the encoder structure, and then restore the compressed data through the decoder. The AE rebuilds the original data with fewer features for data representation [22]. It can compress the key information of high dimensional input data into low dimensional features. AE is often used for feature extraction and dimensional reduction in time series forecasting problems [23]. Moreover, machine learning methods required manual selection of features. The quality of the features affected the models’ results. Deep learning has the ability to automatically extract features. It can reduce the time spent on feature engineering and help experts to make decisions. Most of the relevant studies on RUL prediction use deep learning as a prediction model, which not only can save time in selecting features, but also makes prediction results much better than traditional machine learning models. Recurrent Neural Networks (RNN) are a deep learning model that deals specifically with time series data and can select important features from equipment sensors. Long Short-Term Memory networks (LSTM) is a variant of RNN [24]. For example, Zhang et al. [25] constructed an LSTM model to predict the RUL of lithium batteries, predicting after how many cycles of fully charge and discharge the capacity will be lower than the normal threshold, based on the historical decline rate of capacity. Heimes [9] set the maximum remaining engine life at 130 on the PHM08 data set, and predicted the RUL using an RNN model. Zhang et al. [25] also constructed an LSTM model to predict the RUL of the jet engine. One hundred jet engines were used as training data and 100 were used as testing data. Each jet engine has 24 features that record the sensor value from normal to fault, and the RUL are predicted by the change in these values. Mathew et al. [26] predict the RUL of jet engines, using 24 parameters of the original data and comparing ten machine learning methods (Linear Regression, Decision Tree, SVM, Random Forest, KNN, K-means, Gradient Boost, Adaboost, Deep Learning, ANOVA), and verify the validity of the model through RMSE. Zheng et al. [27] used an LSTM model to predict the RUL on C-MAPSS data set, PHM08 data set, and Milling data set. The LSTM model is better than the other models. Chen et al. [28] proposed a two-phase model for RUL prediction by Kernel Principal Component Analysis (KPCA) and Gated Recurrent Unit (GRU), respectively. To the best of our knowledge, little research has integrated AE and GRU for RUL prediction. This study develops an Autoencoder Gated Recurrent Unit (AE-GRU) model to predict the RUL of equipment. In particular, AE is used to select features, and then the correlation between sensors is found through the GRU model, and the RUL is predicted by Multi-Layer Perception (MLP) model. The first part is the Autoencoder (AE) model which is composed of an encoder and decoder. The second part is the GRU model which is a type of RNN that can deal with time series data. The GRU model finds key information in historical sensor data in combination with the MLP model. The MLP model calculates the extracted information and combines with the algorithm to predict the RUL effectively.

2. Literature Review Time series are statistical data, arranging events or data in order of their occurrence. The main purpose of time series forecasting is to build models from historical records and predict future situations.

2.1. Recurrent Neural Networks RNN is the most common time series analysis method in deep learning. The main difference between RNN and general neural network is that the neurons of RNN between hidden layers are not independent, but influence each other. The neurons of RNN also operate in order. The neurons of the have the function of temporarily storing memory. The previously input data will be temporarily stored in the internal memory, so the neuron can have different output values according to the previous state. However, in the training of RNN, too many hidden layer neuron Processes 2020, 8, x FOR PEER REVIEW 4 of 18 Processes 2020, 8, 1155 4 of 18 will let the gradient descent method stay at the local minimum. The traditional neural network has weightsdifferent will parameters cause the to problem be learned of gradient in each vanishhidden becauselayer, but of the the parameters heavy calculation, of the RNN which only will have let thedifferent gradient inputs. descent This method feature stay greatly at the redu localces minimum. the training The burden traditional of the neural model. network has different parametersLSTM tosolves be learned the problem in each of hidden the gradient layer, butvanish the parametersin the random of thegradient RNN onlydescent have in direcurrentfferent inputs.neural Thisnetworks. feature The greatly biggest reduces difference the training between burden the LSTM of the and model. RNN is that each neuron of LSTM has LSTMa control solves gate the function, problem which of the is input, gradient forget vanish, and in output. the random These gradientgates have descent their own in recurrent weights. neuralThe calculation networks. of The the biggestweights di afterfference data betweeninput determ the LSTMines whether and RNN the isswitches that each are neuron turned of on LSTM or off. hasThe a controlinput gate gate controls function, whether which isdata input, can forget, be wri andtten output. to the Thesememory gates space, have and their the own forget weights. gate Thedetermines calculation whether of the the weights contents after of the data previous input determines memory space whether are retained. the switches The output are turned gate controls on or owhetherff. The input the memory gate controls space whether operation data result can becan written be output. to the Although memory adding space, andthese the gates forget to gateeach determinesneuron generates whether more the contents weight to of be the calculated, previous memory it can solve space the are problem retained. of Thegradient output vanish gate controlsfound in whetherthe recurrent the memory neural network. space operation result can be output. Although adding these gates to each neuronFigure generates 1 is a morediagram weight of the to LSTM be calculated, network itunit can architecture, solve the problem and the of related gradient formulas vanish are found shown in theas recurrentEquations neural (1)–(6). network. The LSTM unit will receive the input vector 𝑥 and the output ℎ. Each unit containsFigure four1 is gates a diagram (i, g, off, o) the and LSTM a cell. network x is the unit input architecture, vector of this and layer, the related h is the formulas output arevector, shown ⨀ is the element-wise multiplication. W is the weight matrix, b is an error vector, and (i, f, o, g, c) as Equations (1)–(6). The LSTM unit will receive the input vector xt and the output ht 1. Each unit − J containsrepresents four input gates gate, (i, g, forget f, o) and gate, a cell.outputx is gate, the input cell input, vector and of this cell layer,activation h is the respectively. output vector, The forgetis thegate element-wise controls how multiplication. much informationW is the each weight memory matrix, unitb isneeds an error to forget, vector, the and input (i, f, o,gate g, c)controls represents how inputmuch gate, new forget information gate, output each memory gate, cell unit input, adds, and and cell the activation output respectively.gate controls The how forget much gate information controls howeach much memory information unit outputs. each memoryThe cell unitinput needs is th toe information forget, the input input gate of the controls current how time much and new the informationinformation each of the memory previous unit time. adds, Cell and activation the output controls gate controls the whether how much the informationinput gate and each forget memory gate unitinformation outputs. can The be cell obtained. input is the information input of the current time and the information of the previous time. Cell activation controls the whether the input gate and forget gate information can 𝑖 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑𝑾𝒙 +𝑾𝒉 +𝒃 (1) be obtained. (2) 𝑓it = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑𝑾sigmoid(Wxixt𝒙+ +𝑾Whiht 𝒉1+ b+𝒃i) (1)  −  ft = sigmoid Wx f xt + Wh f ht 1 + b f (2)(3) 𝑜 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑𝑾𝒙 +𝑾−𝒉 +𝒃 ot = sigmoid(Wxoxt + Whoht 1 + bo) (3) − (4) 𝑔 =𝑡𝑎𝑛ℎ𝑾 𝒙 +𝑾𝒉 +𝒃 gt = tanh Wxgxt + Whght 1 + bg (4) − 𝑐 = 𝑓 ⊙𝑐 +𝑖 ⊙𝑔 (5) ct = ft ct 1 + it gt (5) − hℎt ==tanhtanh(c𝑐t) ⊙𝑜ot (6)

FigureFigure 1. 1.Structure Structure of of Long Long Short-Term Short-Term Memory Memory (LSTM) (LSTM) unit. unit.

TheThe gated gated recurrent recurrent unit unit (GRU) (GRU) is is a a variant variant of of LSTM LSTM [ 29[29].]. AsAs shownshown inin FigureFigure2 2,, (z, (z, r, r, H, H,H e𝐻)) are are updateupdate gate, gate, reset reset gate, gate, activation, activation, and and candidate candidate activation activation respectively. respectively. The The detailed detailed formulas formulas are are shownshown an an Equations Equations (7)–(10). (7)–(10). The The update update gate gate is is used used to to control control how how much much historical historical information information and and newnew information information needs needs to to be be forgotten forgotten in in the the current current state. state. The The reset reset gate gate is is used used to to control control how how much much informationinformation is is available available from from the the candidate candidate state. state. The The candidate candidate activation activation can becan regarded be regarded as the as new the informationnew information at the currentat the current time. The time. reset The gate reset is gate used is to used control to control how much how historical much historical information information needs

Processes 2020, 8, 1155 5 of 18 Processes 2020, 8, x FOR PEER REVIEW 5 of 18 to be retained. The activation is generated by the update gate and candidate activation. The update needs to be retained. The activation is generated by the update gate and candidate activation. The gate in activation controls how much new and old information is retained. New information and update gate in activation controls how much new and old information is retained. New information old information have a complementary relationship. If a lot of new information is retained, the old and old information have a complementary relationship. If a lot of new information is retained, the information considered will be less, and vice versa. old information considered will be less, and vice versa.

𝑧zt== 𝑠𝑖𝑔𝑚𝑜𝑖𝑑sigmoid(W𝑾xzx𝒙t++𝑾Whzh𝒉t 1 ++𝒃bz) (7)(7) − (8) 𝑟rt== 𝑠𝑖𝑔𝑚𝑜𝑖𝑑sigmoid(W𝑾xrx𝒙t++𝑾Whrh𝒉t 1 ++𝒃br) (8)  K−  Het = tanh W ext + W ert Ht 1 + be (9)(9) 𝐻 =tanh𝑾xH𝒙 +𝑾HH𝒓⨀𝐻− +𝒃H K K Ht = zt Ht 1 + (1 zt) Het (10)(10) 𝐻 =𝑧⨀𝐻− + 1−𝑧− ⨀𝐻

Figure 2. Structure of gated recu recurrentrrent neural network.

GRU simplifiessimplifies thethe inputinput gate gate and and forget forget gate gate of of LSTM LSTM into into an an update update gate, gate, and and combines combines the cellthe statescell states and and hidden hidden states states together. together. Therefore, Therefore, the GRU the unitGRU retains unit retains the advantages the advantages of LSTM, of LSTM, and further and reducesfurther reduces the model the training model training time by time reducing by reducing the parameters the parameters in the model. in the According model. According to the analysis to the resultsanalysis in results [30], the in [30], results the of results the RNN of the model RNN are mo poordel are and poor the and results the ofresults GRU of and GRU LSTM and are LSTM similar, are andsimilar, better and than better RNN. than However, RNN. However, the GRU the has GRU fewer has parametersfewer parameters than LSTM, than LSTM, so the so trend the trend in deep in learningdeep learning models models applied applied to time to seriestime series analysis analysis is toward is toward GRU. GRU. In view of the excellent feature extraction performance of AE, and the fast calculation advantage of GRU, in in this this study study the the RUL RUL prediction prediction model model wo worksrks by byextracting extracting the the important important features features from from the theoriginal original data data using using AE. AE. After After pre-processing pre-processing of ofthe the extracted extracted features, features, the the GRU GRU model model and and DNN (deep fully connected neuralneural networks)networks) areare appliedapplied toto predictpredict thethe RUL.RUL.

2.2. RNN Applications in Time Series Forecasting 2.2. RNN Applications in Time Series Forecasting Chen et al. [31] performed a prediction of mechanical state (PMS), and extracted the collected sensor Chen et al. [31] performed a prediction of mechanical state (PMS), and extracted the collected data of the machine through the feature extraction of empirical mode decomposition. This method sensor data of the machine through the feature extraction of empirical mode decomposition. This makes the unstable signal slightly stable for building the LSTM model. Zheng et al. [32] predicts the method makes the unstable signal slightly stable for building the LSTM model. Zheng et al. [32] short-term power system load capacity, the time range collected from a few hours to several weeks. predicts the short-term power system load capacity, the time range collected from a few hours to The characteristics of power system load data are non-linear, non-stationary, and non-seasonal, so the several weeks. The characteristics of power system load data are non-linear, non-stationary, and non- research applies an LSTM model to solve this problem. ElSaid et al. [33] applied an LSTM model seasonal, so the research applies an LSTM model to solve this problem. ElSaid et al. [33] applied an to predict aircraft engine vibration and predicted aircraft engine vibration values in the next 5, 10, LSTM model to predict aircraft engine vibration and predicted aircraft engine vibration values in the and 20 s. 76 parameters are recorded in the aircraft flight data recorder. Fifteen key features were next 5, 10, and 20 s. 76 parameters are recorded in the aircraft flight data recorder. Fifteen key features selected by experts with a professional background. The features were normalized to make the values were selected by experts with a professional background. The features were normalized to make the range between 0 and 1, and used to make predictions. Cenggoro and Siahaan [34] constructed a values range between 0 and 1, and used to make predictions. Cenggoro and Siahaan [34] constructed DLSTM (Deep Long Short-Term Memory) neural network to predict traffic flow. The input layer had a DLSTM (Deep Long Short-Term Memory) neural network to predict traffic flow. The input layer 5 input values, there were 5 hidden layers, and each hidden layer had 100 neurons. Bao et al. [35] had 5 input values, there were 5 hidden layers, and each hidden layer had 100 neurons. Bao et al. [35] conducted a stock price prediction. The stock price prediction is divided into three stages. Stage one conducted a stock price prediction. The stock price prediction is divided into three stages. Stage one applied a wavelet transform algorithm to remove noise from the original data. Stage two stacks AEs to reduces the data dimensions and picks out features automatically. Stage three applies an LSTM

Processes 2020, 8, 1155 6 of 18 applied a wavelet transform algorithm to remove noise from the original data. Stage two stacks AEs Processes 2020, 8, x FOR PEER REVIEW 6 of 18 to reduces the data dimensions and picks out features automatically. Stage three applies an LSTM model to make predictions. Zhang et al. [36] [36] divide dividedd Sea Surface Temperature (SST) into two part to make the prediction. The first first part is short-term short-term fo forecasting,recasting, which which predicts predicts the the SST SST after after 1 1 and and 3 3 day. day. The second part is long-term forecasting, which predicts the weekly average and monthly average of SST. ItIt finds finds the the relationship relationship between between sequences sequence throughs through LSTM, LSTM, and then and make thenthe make prediction the prediction through thethrough fully the connected fully connected layer. How layer. et How al. [37 et] usesal. [37] the uses angle the changes angle changes on diff erenton different motion motion sensors sensors of the NAOof the robotNAO torobot classify to classify the current the current actions actions of the of robot the robot through through an LSTM an LSTM model. model. Truong Truong et al. et [38 al.] construct[38] construct an LSTM an LSTM model model to identifyto identify changes changes in in human human type type and and object object type, type, andand eveneven predict possible human actions throughthrough thethe actionaction combinations.combinations. Kuan et al. [[39]39] constructedconstructed an MS-GRUMS-GRU (multilayered(multilayered self-normalizingself-normalizing gatedgated recurrentrecurrent units)units) model.model. It has good results in predicting power loading. Zhang and Kabuka [[40]40] used the GRU model to predict the tratrafficffic volume.volume. The training data included the historical historical weather weather data data 100 100 h h previously previously to to predict predict the the traffic traffi volumec volume in inthe the next next 12 12h. h.It Itwas was found found that that weather weather conditions conditions can can improve improve the the prediction prediction accuracy. accuracy. 3. Research Framework 3. Research Framework This study constructs an AE-GRU model for RUL prediction to increase equipment life, reduce This study constructs an AE-GRU model for RUL prediction to increase equipment life, reduce unexpected harm caused by sudden shutdown of machinery, and improve the reliability of system unexpected harm caused by sudden shutdown of machinery, and improve the reliability of system operation. The proposed AE-GRU includes the steps of data pre-processing, feature extraction, and RUL operation. The proposed AE-GRU includes the steps of data pre-processing, feature extraction, and prediction as shown in Figure3. The data pre-processing step first defines the engine life to predict RUL prediction as shown in Figure 3. The data pre-processing step first defines the engine life to the exact RUL of the engine, and then standardizes the data to avoid errors caused by different scales predict the exact RUL of the engine, and then standardizes the data to avoid errors caused by different of the data. It then uses the deep learning autoencoder model to extract the features of the original scales of the data. It then uses the deep learning autoencoder model to extract the features of the data. GRU is specialized in processing time series related data, and considers the changes of historical original data. GRU is specialized in processing time series related data, and considers the changes of characteristics. The data is converted from 2D to 3D; the new dimension is the time data. The last historical characteristics. The data is converted from 2D to 3D; the new dimension is the time data. step of data processing is to change the value range of the RUL. Because the original data is linearly The last step of data processing is to change the value range of the RUL. Because the original data is decreasing, the results in model training and prediction are not optimal. Finally, the RUL expectancy is linearly decreasing, the results in model training and prediction are not optimal. Finally, the RUL predicted through the GRU model and a fully-connected layer neural network. expectancy is predicted through the GRU model and a fully-connected layer neural network.

Figure 3. Research framework of Autoencoder Gated Recurrent Unit (AE-GRU).

The symbols used in thisthis researchresearch areare defineddefined asas follows:follows: 𝑁: 𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑒𝑛𝑔𝑖𝑛𝑒𝑠 𝑆: 𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑠𝑒𝑛𝑠𝑜𝑟𝑠 𝑛: 𝑙𝑒𝑛𝑔𝑡ℎ𝑜𝑓𝑠𝑒𝑛𝑠𝑖𝑛𝑔𝑑𝑎𝑡𝑎 𝑌: 𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑛𝑒𝑢𝑟𝑜𝑛𝑠

Processes 2020, 8, 1155 7 of 18

N : numbero f engines S : numbero f sensors n : lengtho f sensingdata Y : numbero f neurons L : numbero f hiddenlayers W : weight b : errorvector Features : New f eaturesa f terextraction X : thesensingdatao f thejthsensor theithengine; i = 1 ... N, j = 1 ... S ij ∈ X : thecurrenttimet theithengine,; i = 1 ... N, j = 1 ... T it ∈ h : thenumbero f theythneuron thelthhiddenlayer; l = 1 ... L, y = 1 ... Y ly ∈ T : engineusagetime RUL : remainingli f e

RUL0 : Remainingli f ea f tertrans f orm

µ : theaveragevalueo f sensingdataxi σ : Standarddeviationo f sensingdataxi x0i = dataa f terstandardize, iisthedatalength; i = 1 ... n yi : trueremainingli f e; i = 1 ... n

y0i : predictedremainingli f e; i = 1 ... n

3.1. Data Pre-Processing In the original engine data, there are only usage records for each engine, and there is no specific RUL of the engine recorded. Therefore, this study defines the time from good to bad for each engine, so that we can apply supervised learning methods to the experiment. It is possible to predict the RUL accurately using the parameter changes recorded by the sensors. As shown in Table1, after finding the available time for each engine record (Equation (11)), the difference between the maximum time and the current time (Equation (12)) is the RUL of the engine.

T : Max(Xit) (11)

RUL : T X (12) − it

Table 1. Definition of Remaining Useful Life (RUL).

Engine Usage Time Parameter 1 Parameter 2 Parameter 24 RUL 1 1 10.0047 0.2501 17.1735 222 1 2 0.0015 0.0003 23.3619 221 ...... 1 223 34.9992 0.84 8.6695 0 ...... 218 132 35.007 0.8419 8.6761 1 218 133 25.007 0.6216 8.512 0

In the original engine data, the scales of the values measured by different sensors are different, such as pressure, temperature, humidity, etc., In order to avoid the difference in scales and make all the sensor values standard, the original data are all converted to Z-scores. The average value µ is the sum of the sensor data xi divided by n, the number of observations (Equation (13)). The standard σ deviation is also calculated (Equation (14)). The transformed data xi0 is calculated by calculating Processes 2020, 8, 1155 8 of 18

theProcesses diff erence2020, 8, x between FOR PEERthe REVIEW sensor value of the data and the average value, divided by the standard8 of 18 deviation (Equation (15)). The average value of xi0 is equal to 0 and standard deviation equal to 1. ∑Pn𝑥 𝜇= = xi (13) µ = 𝑛 i 1 (13) n v t n 1 1 X 𝜎=σ = 𝑥 −𝜇(x µ )2 (14)(14) 𝑛 n i − i=1 x µ = i − xi0 𝑥 −𝜇 (15) 𝑥′ = ρ (15) 𝜌 3.1.1. Feature Extraction 3.1.1.AE Feature is used Extraction to reduce the data dimension and achieve the effect of feature extraction. As shown in FigureAE is4 ,used the encoder to reduce will the reduce data dimension the dimension and ofachieve the data the and effect transform of feature the extraction. original data As shown into a newin Figure space. 4, Thethe encoder features will in this reduce space the can dimension describe theof the data data more and concisely transform than the the original original data features. into a Thenew orangespace. The neurons features in the in middlethis space of thecan Figure describe are the the data features more in concisely the new space.than the The original feature features. value is obtainedThe orange by neurons multiplying in the the middle weight of matrix the Figure of the neuronsare the features in the hidden in the layernew ofspace. the previousThe feature layer value and addingis obtained the errorby multiplying vector (Equation the weight (16)), matrix where off () th ise the neurons activation in the function hidden which layer of is Reluthe previous in this model, layer Wandis adding the weight the error matrix, vectorhl is the(Equation error vector, (16)), whereb is the f() bias. is the activation function which is Relu in this model, W is the weight matrix, ℎ is the error vector, b is the bias. Features : f (W h ) + b (16) ∗ l 𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠: 𝑓𝑊∗ℎ +𝑏 (16)

Figure 4. AE Structure

The datadata collectedcollected by by sensors sensors is is a seriesa series of of continuous continuous data. data. However, However, in the in generalthe general model, model, it only it usesonly theuses current the current sensor sensor value tovalue predict to thepredict RUL. the The RUL. change The in change historical in sensorhistorical value sensor is not value considered. is not Inconsidered. this study, In a this new study, data processing a new data method processing is proposed method tois takeproposed historical to take information historical intoinformation account. Theinto formataccount. of The the format data will of the be converteddata will be from conv theerted original from two-dimensionalthe original two-dimensional (samples, features) (samples, to three-dimensionalfeatures) to three-dimensional (samples, time (samples, steps, features). time steps, Zero-padding features). Zero-padding is often used inis often signal used processing in signal or Convolutionalprocessing or NeuralConvolutional Networks Neur (CNN).al Networks The purpose (CNN). is to keepThe purpose or increase is theto dimensionkeep or increase of the datathe withoutdimension affecting of the the data information without affecting of the data the itself. information In this study, of the when data the itself. current In this sensor study, record when has the no historicalcurrent sensor data, record the zero-padding has no historical method data, will the be zero-padding used to keep the method data dimension.will be used to keep the data dimension.

3.1.2. Maximum RUL Transformation

Processes 2020, 8, 1155 9 of 18

Processes 2020, 8, x FOR PEER REVIEW 9 of 18 3.1.2. Maximum RUL Transformation The maximum and minimum life of the engine is 356 and 127 cycles, and the average engine life is 209.The The maximum literature and related minimum to the life prediction of the engine of RUL is 356 converts and 127 the cycles, maximum and the RUL average to a enginespecific lifereasonable is 209. The value literature in the experiment. related to the In predictionthis study, of the RUL transform converts process the maximum refers to RULthe setting to a specific of T = reasonable130 in Heimes value [3]. in The the experiment.RUL greater In than this T study, will be the defined transform as T, process the RUL refers less to than the settingT will not of T change= 130 in(Equation Heimes (17)). [3]. The Figure RUL 5 shows greater the than life T transform will be defined of one asof T,the the engines RUL lessin the than data T set. will Through not change this (Equationconversion, (17)). the FigureRUL changes5 shows from the lifethe transformoriginal linear of one decline of the to engines a nonlinear in the decline. data set. Both Through the model this conversion,training and the the RUL estimation changes of from the RUL the original have better linear results decline in the to a setting. nonlinear decline. Both the model training and the estimation of the RUL have better results in the setting. 𝑇, 𝑖𝑓𝑅𝑈𝐿 𝑇 𝑅𝑈𝐿 = (1) ( 𝑅𝑈𝐿, 𝑖𝑓𝑅𝑈𝐿 𝑇 T, i f RUL T RUL0 = ≥ (17) RUL, i f RUL < T

(a) (b)

Figure 5. a b Figure 5.Maximum Maximum life life value value transform transform ( ) original (a) original RUL; ( RUL;) RUL (b after) RUL Maximum after Maximu useful lifem transform.useful life 3.2. RULtransform. Prediction Model

3.2. RULThe RULPrediction prediction Model model of this study combines the advantages of effective feature extraction and fast calculation. The RUL prediction model will be divided into four steps as shown in Figure6. The RUL prediction model of this study combines the advantages of effective feature extraction Step 1 is the input layer. Xt represents the sensor data collected at the current time, Xt 1 represents the and fast calculation. The RUL prediction model will be divided into four steps as shown− in Figure 6. sensor data collected at the previous moment, Xt 2 represents the sensor data collected two moments Step 1 is the input layer. 𝑋 represents the sensor− data collected at the current time, 𝑋 represents earlier. The number of historical time points of sensor data to input to the model are adjustable. Step 2 the sensor data collected at the previous moment, 𝑋 represents the sensor data collected two is the GRU layer. The number of GRU layers is set to two in this paper. The number of GRU neurons moments earlier. The number of historical time points of sensor data to input to the model are will be tuned in the experimental part, and the best parameter combination selected for the research. adjustable. Step 2 is the GRU layer. The number of GRU layers is set to two in this paper. The number The purpose of this step is to identify time series correlation in the input data through GRU. However, of GRU neurons will be tuned in the experimental part, and the best parameter combination selected GRU cannot directly predict the RUL. Therefore, DNN must be added to predict the RUL. Step 3 for the research. The purpose of this step is to identify time series correlation in the input data through is the DNN layer. DNN will convert the features extracted by GRU to the prediction dimension to GRU. However, GRU cannot directly predict the RUL. Therefore, DNN must be added to predict the perform the prediction. The best parameter combination from the experiment result will be used to RUL. Step 3 is the DNN layer. DNN will convert the features extracted by GRU to the prediction decide the number of neurons as in Step 2. The objective function of DNN is applied to calculate the dimension to perform the prediction. The best parameter combination from the experiment result difference between the predicted value and the true value. In this study, Mean Square Error (MSE) is will be used to decide the number of neurons as in Step 2. The objective function of DNN is applied used as the objective function (Equation (18)), and the gradient descent method is used to minimize to calculate the difference between the predicted value and the true value. In this study, Mean Square the objective function and train the model. Step 4 is the output layer of the model. The output is the Error (MSE) is used as the objective function (Equation (18)), and the gradient descent method is used value of predicted RUL at the current time. The model evaluation is conducted by calculating Root to minimize the objective function and train the model. Step 4 is the output layer of the model. The Mean Square Error (RMSE). output is the value of predicted RUL at the curren nt time. The model evaluation is conducted by 1 X 2 calculating Root Mean Square Error (RMSE).MSE = y y0 (18) n i − i i=1 1 𝑀𝑆𝐸 = 𝑦 −𝑦 (18) 𝑛

Processes 2020, 8, 1155 10 of 18

Processes 2020, 8, x FOR PEER REVIEW 10 of 18

Processes 2020, 8, x FOR PEER REVIEW 10 of 18

FigureFigure 6. 6.Model Model structurestructure of AE-GRU for for RUL RUL prediction. prediction. Figure 6. Model structure of AE-GRU for RUL prediction. 4. Result4. Result Evaluation Evaluation 4. Result Evaluation 4.1.4.1. Data Data Collection Collection 4.1.To DataTo validate validate Collection the the proposed proposed AE-GRU,AE-GRU, aa TurbofanTurbofan Engine Degradation Degradation Simulation Simulation Data Data Set Set (Prognostics(PrognosticsTo validate and and Management theManagement proposed 2008, 2008,AE-GRU, PHM08)PHM08) a Turbofan ((hhttps:ttps://www.nasa.gov/)// www.nasa.govEngine Degradation/ )is is used usedSimulation for for performance performance Data Set evaluationevaluation(Prognostics in in this andthis section.section.Management It It was was 2008,carried carried PHM08) out out by (h bya NASAttps://www.nasa.gov/) a NASA tool C-MAPSS, tool C-MAPSS, tois simulateused to for simulate theperformance real thelarge real largecommercialevaluation commercial in aircraft this aircraft section. turbofan turbofan It engines.was engines. carried The dataout The by consists data a NASA consists of a toollot of of C-MAPSS, a time lot of series time to cycles, series simulate cycles,which the come whichreal fromlarge come fromdifferentcommercial different engines enginesaircraft of turbofan ofthe the same same engines. type. type. Each The Each enginedata engine consists star startsts ofwith a with lot a ofdifferent a time different series wear wear cycles, level. level. which The The first come first section sectionfrom introducesintroducesdifferent theengines the engine engine of the data data same collected collected type. Each byby thethe engine sensor,sensor, star andandts with the second seconda different section section wear describes describes level. The the the first parameter parameter section settings of the model. The performance of different activation functions is compared and performance settingsintroduces of the the model. engine The data performance collected by of the diff sensor,erent activation and the second functions section is compared describes and the performanceparameter is optimized. Finally, the number of hidden layers and neurons is compared. In the third section, we is optimized.settings of the Finally, model. the The number performance of hidden of different layers andactivation neurons functions is compared. is compared In the and third performance section, we will extract the best parameters and apply them to different models, and test the model by analyzing willis extractoptimized. the bestFinally, parameters the number and of apply hidden them layers to diandfferent neurons models, is compared. and test In the the model third bysection, analyzing we the root mean square error (RMSE). thewill root extract mean the square best errorparameters (RMSE). and apply them to different models, and test the model by analyzing There are data for 218 turbo engines, and a total of 45,918 records in the dataset. Every record theThere root mean are data square for 218 error turbo (RMSE). engines, and a total of 45,918 records in the dataset. Every record has has 26There original are datafeatures for 218and turbo the customized engines, and RUL. a total The of experimental 45,918 records design in the and dataset. verification Every ofrecord this 26 original features and the customized RUL. The experimental design and verification of this study studyhas 26 adopts original k-fold features cross and validation the customized to ensure RUL. the stability The experimental of the model. design and verification of this adopts k-fold cross validation to ensure the stability of the model. studyFigure adopts 7 isk-fold a diagram cross validation of the sens toor ensure parameters the stability of the turbineof the model. engines in this study. The sensor Figure7 is a diagram of the sensor parameters of the turbine engines in this study. The sensor parametersFigure of7 isEngine a diagram 1 are presentedof the sens visually.or parameters The parameter of the turbine values engi of thenes sensors in this study.vary in The range sensor and parameters of Engine 1 are presented visually. The parameter values of the sensors vary in range and theparameters parameter of Enginetrends are1 are not presented consistent. visually. The RUL The ofparameter the engine values will ofbe the predicted sensors byvary the in changes range and of thethesethe parameter parameter sensor parameters trends trends are are notin not this consistent. consistent. research. TheThe RULRUL ofof the engine will will be be predicted predicted by by the the changes changes of of thesethese sensor sensor parameters parameters in in this this research. research.

Figure 7. Diagram of turbine engine sensor parameter. FigureFigure 7. 7.Diagram Diagram ofof turbineturbine engine sensor parameter. parameter.

ProcessesProcesses 20202020,, 88,, 1155x FOR PEER REVIEW 1111 of of 18

4.2. Hyperparameters Setting 4.2. Hyperparameters Setting Based on the experience of adjusting the model, the hyperparameters in the neural network will Based on the experience of adjusting the model, the hyperparameters in the neural network will significantly affect the results of the model. This section finds the best parameter combination to significantly affect the results of the model. This section finds the best parameter combination to apply apply in the model to predict the RUL value. in the model to predict the RUL value. In experiments on neural networks, the number of hidden layers and the number of neurons are In experiments on neural networks, the number of hidden layers and the number of neurons are often the most difficult to choose. The training results will also depend on the quality of batch size often the most difficult to choose. The training results will also depend on the quality of batch size and epoch. Batch size means the number of samples to work through in one iteration. Epoch is the and epoch. Batch size means the number of samples to work through in one iteration. Epoch is the number of times that the model goes through batches to go through the entire training dataset once. number of times that the model goes through batches to go through the entire training dataset once. For example, if a training dataset has 3000 records and batch size is set to 500, it takes 6 iterations to For example, if a training dataset has 3000 records and batch size is set to 500, it takes 6 iterations to complete an epoch. complete an epoch. 4.2.1. Activation Function 4.2.1. Activation Function The main function of the activation function in the neural network is to introduce nonlinear The main function of the activation function in the neural network is to introduce nonlinear characteristics. If there is no activation function in the neural network, the input and output will only characteristics. If there is no activation function in the neural network, the input and output will only correlate with a simple linear relationship and not able to handle complex issues, so the activation correlate with a simple linear relationship and not able to handle complex issues, so the activation function is very important in deep learning. function is very important in deep learning. This experiment sets the following parameters to observe the training result of the GRU model This experiment sets the following parameters to observe the training result of the GRU model with different activation functions: with different activation functions:

1.1. Epochs:Epochs: 100; 2. Optimizers: Adam; 2. Optimizers: Adam; 3. Batch_size: 128; 3. Batch_size: 128; 4. Hidden layer, number of neurons: 1, 50; 4. Hidden layer, number of neurons: 1, 50; 5. Inputs: 24. 5. Inputs: 24. This experiment uses a 5-fold cross-validation to compare total seven activation functions and This experiment uses a 5-fold cross-validation to compare total seven activation functions and calculate the average loss of each activation function: softmax (3055.5), softplus (517.0), softsign calculate the average loss of each activation function: softmax (3055.5), softplus (517.0), softsign (438.9), (438.9), relu (434.8), tanh (445.3), sigmoid (990.7), hard sigmoid (1051.9). According to the training relu (434.8), tanh (445.3), sigmoid (990.7), hard sigmoid (1051.9). According to the training results, results, we find that the softmax performance is relatively poor, and the relu activation function we find that the softmax performance is relatively poor, and the relu activation function performs performs best (Figure 8). The softmax activation function is a blend of multiple sigmoid, which is best (Figure8). The softmax activation function is a blend of multiple sigmoid, which is relatively relatively suited to multiclass classification problems rather than a regression problem. The suited to multiclass classification problems rather than a regression problem. The advantage of the advantage of the Relu activation function is that neuron deactivation will only happen when the Relu activation function is that neuron deactivation will only happen when the output of linear output of linear transformation is zero. The neurons will not be activated at the same time, but a transformation is zero. The neurons will not be activated at the same time, but a certain number of certain number of neurons will be activated at a time. The objective function of Relu can be express neurons will be activated at a time. The objective function of Relu can be express as [41]: as [41]:

𝑓𝑥f (x=max) = max0,𝑥 (0, x) (19)(19)

In addition, since the relu activa activationtion function does not need to perform perform exponential exponential calculations, calculations, thethe convergenceconvergence speed speed is is fast fast and and the the calculation calculation complexity complexity levels levels are are low. low. So, thisSo, this study study adopted adopted relu asrelu the as activation the activation function. function.

Figure 8. Comparison of training result in didifferentfferent activationactivation functionfunction

Processes 2020, 8, 1155 12 of 18 Processes 2020, 8, x FOR PEER REVIEW 12 of 18

4.2.2. Optimizer The purpose of optimizers is to minimizeminimize thethe lossloss functionfunction and thethe gapgap betweenbetween thethe predictedpredicted valuevalue andand realreal value.value. ThisThis experimentexperiment sets sets the the following following parameters parameters and and observes observes the the training training result result of theof the GRU GRU model model under under diff differenterent optimizers: optimizers: 1.1. Epochs:Epochs: 100;100; 2. Activation:Activation: relu;relu; 3. Batch_size:Batch_size: 128;128; 4. HiddenHidden layer,layer, numbernumber ofof neuron:neuron: 1,1, 50;50; 5. Inputs:Inputs: 24.24. This experimentexperiment applies applies 5-fold 5-fold cross-validation cross-validation to compare to compare a total ofa seventotal optimizationof seven optimization algorithms: SGDalgorithms: (569.8), SGD RMSprop (569.8), (438.7), RMSprop Adagrad (438.7), (4586.9), Adagrad Adadelta (4586.9), (426.6), Adadelta Adam (436.6),(426.6), Adamax Adam (436.6), (436.8), NadamAdamax (420.7). (436.8), The Nadam parameter (420.7). setting The ofparameter each optimization setting of algorithm each optimization is the best resultalgorithm proposed is the in best the literature.result proposed Kingma in the and literature. Ba [42] proposed Kingma Adam’sand Ba [42] optimization proposed Adam’s algorithm. optimization Most of the algorithm. current neural Most networksof the current apply neural Adam networks to optimize apply the Adam loss function.to optimize The the advantage loss function. of Adam The advantage is quick convergence of Adam is andquick dealing convergence with high and noise dealing and thewith problem high noise of sparse and the gradients. problem The of Nadamsparse optimizationgradients. The algorithm Nadam proposedoptimization by Dozatalgorithm [43] changes proposed the partby Dozat of Adam’s [43] momentumchanges the and part accelerates of Adam’s the convergencemomentum rateand ofaccelerates the model. the The convergence study confirms rate of that the inmodel. this PHM08 The study dataset, confirms Nadam that will in this converge PHM08 faster dataset, than Nadam Adam (Figurewill converge9). Therefore, faster than the optimizerAdam (Figure used 9). in thisTheref studyore, isthe Nadam. optimizer used in this study is Nadam.

Figure 9. Comparison of training result in didifferentfferent optimizer. 4.2.3. Number of Hidden Layers and Neurons 4.2.3. Number of Hidden Layers and Neurons This experiment compares three different batch sizes: 32 (Table2), 64 (Table3), 128 (Table4). This experiment compares three different batch sizes: 32 (Table 2), 64 (Table 3), 128 (Table 4). The early stopping function has been adopted in the experiment; if the MSE of verification does not The early stopping function has been adopted in the experiment; if the MSE of verification does not rise for 10 consecutive times, the training will stop and the previous best model will be used in the test rise for 10 consecutive times, the training will stop and the previous best model will be used in the dataset. Early stopping is commonly used in the algorithm of gradient descent to avoid overfitting. test dataset. Early stopping is commonly used in the algorithm of gradient descent to avoid The GRU and DNN layers are set to two layers. The RMSE result of different neurons in the training overfitting. The GRU and DNN layers are set to two layers. The RMSE result of different neurons in are compared. The input of this experiment is 24 parameters of the current raw data of the sensor, the training are compared. The input of this experiment is 24 parameters of the current raw data of and there is no historical information (time steps = 1). the sensor, and there is no historical information (time steps = 1). Table 2. Comparison of Root Mean Square Error (RMSE) result (batch size = 32). Table 2. Comparison of Root Mean Square Error (RMSE) result (batch size = 32). Networks Epochs Training Time RMSE Networks Epochs Training Time RMSE G(32,32)N(8,8) 33 98.4 s 20.18 G(32,32)N(8,8) 33 98.4 s 20.18 G(64,64)N(8,8) 20 89.3 s 20.24 G(64,64)N(8,8)G(32,64)N(8,8) 20 19 76.8 s 89.3 s 20.37 20.24 G(32,64)N(8,8)G(32,64)N(16,16) 19 28 103.7 s 76.8 s 20.24 20.37 G(32,64)N(16,16)G(96,96)N(8,8) 28 25 150.3 s 103.7 s 20.20 20.24 G(96,96)N(8,8)G(96,96)N(16,16) 25 24 151.4 s 150.3 s 20.18 20.20 G(96,96)N(16,16) 24 151.4 s 20.18

Processes 2020, 8, 1155 13 of 18

Table 3. Comparison of RMSE result (batch size = 64).

Networks Epochs Training Time RMSE G(32,32)N(8,8) 21 52.0 s 20.32 G(64,64)N(8,8) 24 75.8 s 20.25 G(32,64)N(8,8) 23 63.8 s 20.33 G(32,64)N(16,16) 24 66.7 s 20.36 G(96,96)N(8,8) 15 80.1 s 20.44 G(96,96)N(16,16) 14 78.3 s 20.44

Table 4. Comparison of RMSE result (batch size = 128).

Networks Epochs Training Time RMSE G(32,32)N(8,8) 42 60.4 s 20.16 G(64,64)N(8,8) 34 70.1 s 20.07* G(32,64)N(8,8) 33 60.3 s 20.20 G(32,64)N(16,16) 52 85.6 s 20.19 G(96,96)N(8,8) 42 119.4 s 20.15 G(96,96)N(16,16) 36 115.7 s 20.12

The experimental results show that the larger the batch size, the faster the model training speed. Therefore, batch size in this study will be set to 128. When the batch size is greater, not only the training time is fast, but the RMSE results are also better. So, the network architecture of this study will be set to G(64,64)N(8,8).

4.3. Model Evaluation

4.3.1. Model Comparison This section uses the 5-fold cross-validation to evaluate the validity of the RUL model of the AE-GRU in this study, and compares the results of existing models. In total 218 engines are divided into five parts as 44, 44, 44, 44 and 42 engines. Four parts are used for training each time, and the remainder are test data. After distinguishing the data, the AE model reduces the dimension of the data, and then predicts the RUL of the engine through GRU. The performance of the model is also compared using the RMSE result. The dimension of the data is reduced from 24 inputs of the original data to 15, 10, 5. AE extracts the characteristics of the original data by non-linear transformation, and through multiple hidden layers, observing the loss of each feature in the compression and restoration process (Table5).

Table 5. Loss of dimension reduction by AE.

Input 15 10 5 Loss 7.7 10 6 2.6 10 5 3.0 10 5 × − × − × −

The experimental results show that the dimension reduction effect is the best when the input is reduced to 15 dimensions. In 10 or 5 dimensions, the effect of restoration is poor because of the loss of original data Information. In this study, the AE will reduce the data dimension from 24 to 15 and compare the performance with other models. After 5-fold cross-validation, (Tables6 and7), it is found that when the input of AE-GRU is reduced from the original 24 features to 15, the RMSE results are better than those of other models. This confirms that AE-GRU can effectively extract features and accurately predict the RUL. Time steps is the number of historical data points. time steps = 5 is the historical information of the current time point and the previous four-time records. time steps = 10 is the current time information and the previous nine-time points. The results show that when time steps = 5, the results of the model are Processes 2020, 8, 1155 14 of 18 Processes 2020, 8, x FOR PEER REVIEW 14 of 18 quiteproblem close. of disappearing When time steps gradients,= 10, becausewhile the of general the large DNN amount has no of time historical series characteristics, data, the RNN and has the problemprediction of results disappearing are the gradients,worst. while the general DNN has no time series characteristics, and the prediction results are the worst. Table 6. RMSE results of different models under 5-fold cross-validation (Time steps = 5).

Table 6. RMSEDNN results of differentRNN models under 5-foldLSTM cross-validation GRU (Time steps = 5).AE-GRE

Inputs:24 Inputs:24 Inputs:24 Inputs:24 Inputs:15 DNN RNN LSTM GRU AE-GRE 1 18.57Inputs:24 Inputs:2418.06 Inputs:2417.80 Inputs:2417.68 Inputs:15 17.64 2 18.76 18.22 18.00 17.88 17.70 1 18.57 18.06 17.80 17.68 17.64 3 2 18.7318.76 18.47 18.22 18.0018.19 17.8818.03 17.70 17.98 4 3 19.1118.73 18.44 18.47 18.1918.45 18.0318.16 17.98 18.20 5 4 19.1219.11 18.43 18.44 18.4518.34 18.1618.12 18.20 18.20 Average 5 18.8619.12 18.32 18.43 18.3418.16 18.1217.96 18.20 17.94 Average 18.86 18.32 18.16 17.96 17.94 Table 7. RMSE results of different models under 5-fold cross-validation (Time steps = 10). Table 7. RMSE results of different models under 5-fold cross-validation (Time steps = 10). DNN RNN LSTM GRU AE-GRE

Inputs:24DNN Inputs:24RNN LSTMInputs:24 GRUInputs:24 AE-GREInputs:15 1 19.21Inputs:24 Inputs:2415.08 Inputs:2411.20 Inputs:2410.31 Inputs:15 10.39 2 1 19.2119.21 14.26 15.08 11.2011.52 10.3111.05 10.39 10.02 3 2 19.0719.21 15.86 14.26 11.5210.78 11.0510.75 10.02 10.70 4 3 19.7719.07 15.07 15.86 10.7811.49 10.7511.33 10.70 10.69 4 19.77 15.07 11.49 11.33 10.69 5 19.69 17.27 10.81 10.54 11.31 5 19.69 17.27 10.81 10.54 11.31 AverageAverage 19.3919.39 15.51 15.51 11.1611.16 10.7910.79 10.62 10.62

4.3.2. RUL Prediction 4.3.2. RUL Prediction Figures 10–12 show the test results for the total of five model methods (DNN, RNN, LSTM, GRU, Figures 10–12 show the test results for the total of five model methods (DNN, RNN, LSTM, GRU, and AE-GRU). According to the Figure, it can be seen that the five models have the problem of and AE-GRU). According to the Figure, it can be seen that the five models have the problem of unstable unstable fluctuation. Neither DNN nor RNN are ideal for training or testing. LSTM, GRU and AE- fluctuation. Neither DNN nor RNN are ideal for training or testing. LSTM, GRU and AE-GRU are GRU are good in the training set, but precarious in the testing set, because there is not enough training good in the training set, but precarious in the testing set, because there is not enough training data. data. Deep learning requires a large amount of data identify features. When the RUL is stable for 130 Deep learning requires a large amount of data identify features. When the RUL is stable for 130 cycles, cycles, the AE-GRU proposed in this study can have more accurate results in predicting the RUL of the AE-GRU proposed in this study can have more accurate results in predicting the RUL of the engine. the engine. It will not fluctuate like DNN and RNN. When the RUL begins to decline, AE-GRU can It will not fluctuate like DNN and RNN. When the RUL begins to decline, AE-GRU can predict it predict it in advance. in advance.

Figure 10. Model testing results comparison (Engine 1).

Processes 2020, 8, x FOR PEER REVIEW 15 of 18 Processes 2020, 8, 1155 15 of 18 Processes 2020, 8, x FOR PEER REVIEW 15 of 18

Figure 11. Model testing results comparison (Engine 2). Figure 11. ModelModel testing testing results results comparison (Engine 2).

Figure 12. Model testing results comparison (Engine 3). Figure 12. Model testing results comparison (Engine 3). 5. Conclusions 5. Conclusions 5. ConclusionsThis study proposes the AE-GRU model for RUL prediction. First, pre-processing was applied to This study proposes the AE-GRU model for RUL prediction. First, pre-processing was applied the sensor data collected by different sensors. Because the scales of different sensors are not the same, to theThis sensor study data proposes collected the by AE-GRU different model sensors. for RULBecause prediction. the scales First, of differentpre-processing sensors was are appliednot the we standardize the sensor data so that the sensor data can be rearranged to the same scale standard tosame, the sensorwe standardize data collected the sensor by different data so sensors. that the Because sensor datathe scales can be of rearrangeddifferent sensors to the aresame not scale the while retaining the characteristics of the data time series. Next, the RUL is defined. The purpose same,standard we whilestandardize retaining the thesensor characteristics data so that of the the sensor data time data series. can be Next, rearranged the RUL to isthe defined. same scale The of defining the RUL is to allow the user to accurately predict the life value of equipment that the standardpurpose ofwhile defining retaining the RUL the ischaracteristics to allow the user of th toe accurately data time predictseries. theNext, life the value RUL of isequipment defined. Thethat engine can work. Feature extraction finds the characteristics of the sensor data by deep learning. purposethe engine of definingcan work. the Feature RUL is extraction to allow the finds user the to ch accuratelyaracteristics predict of the the sensor life value data ofby equipment deep learning. that The characteristics will directly affect the predictive ability of the model. A good model is indispensable theThe engine characteristics can work. will Feature directly extraction affect finds the thepredictive characteristics ability ofof the the sensor model. data A by good deep modellearning. is for characteristics that can fully represent the data. Data dimension conversion takes the historical Theindispensable characteristics for characteristics will directly that affect can thefully pr reedictivepresent theability data. of Data the dimensionmodel. A conversiongood model takes is sensor data into account, because the data has the characteristics of a time series. The last step of data indispensablethe historical sensorfor characteristics data into account, that can because fully re thpresente data thehas data.the characteristics Data dimension of aconversion time series. takes The pre-processing is the transformation of the maximum life value. The engine with an RUL greater than thelast historicalstep of data sensor pre-processing data into account, is the transformati because theon data of the has maximum the characteristics life value. of The a time engine series. with The an 130 cycles is set to 130 cycles. When using the RUL prediction, attention focuses on the RUL value lastRUL step greater of data than pre-processing 130 cycles is set is theto 130 transformati cycles. Whenon of using the maximum the RUL prediction, life value. attentionThe engine focuses with anon when it is about to break, and the situation where the RUL value is still very large is less important. RULthe RUL greater value than when 130 itcycles is about is set to to break, 130 cycles. and the When situation using wherethe RUL the prediction, RUL value attention is still very focuses large on is Then the GRU is combined with the back-propagation algorithm to learn the parameters. Because theless RUL important. value when Then it theis about GRU to is break, combined and the with situation the back-propagation where the RUL value algorithm is still veryto learn large the is there are a large number of parameters to learn, the experiment uses a GPU graphics card in parallel lessparameters. important. Because Then there the GRUare a islarge combined number withof pa therameters back-propagation to learn, the experimentalgorithm to uses learn a GPU the operation and increase the process speed. At the same time, the relu activation function and Nadam parameters.graphics card Because in parallel there operationare a large and number increase of pa therameters process to speed. learn, theAt theexperiment same time, uses the a GPU relu optimizer are used to make the process of learning parameters converge more quickly. Combined with graphicsactivation card function in parallel and Nadam operation optimizer and increase are used the to process make thespeed. process At the of samelearning time, parameters the relu the best combination of parameters run under the experimental design, the prediction of the RUL has activationconverge morefunction quickly. and Nadam Combined optimizer with arethe usedbest tocombination make the processof parameters of learning run parametersunder the better results. convergeexperimental more design, quickly. the predictionCombined of withthe RUL the hasbe stbetter combination results. of parameters run under the The contribution of this study is that it applies the optimized version of LSTM, the GRU. GRU can experimentalThe contribution design, the of thisprediction study isof that the RULit applies has better the optimized results. version of LSTM, the GRU. GRU effectively capture the characteristics of the sensor data, and extract the characteristics of the data by can effectivelyThe contribution capture of the this characteri study isstics that of it theapplies sensor the data, optimized and extract version the ofcharacteristics LSTM, the GRU. of the GRU data AE before the RUL model prediction. The GRU model has many fewer parameters than LSTM but still canby AE effectively before the capture RUL modelthe characteri prediction.stics The of the GRU sensor model data, has and many extract fewer the parameters characteristics than of LSTM the data but by AE before the RUL model prediction. The GRU model has many fewer parameters than LSTM but

Processes 2020, 8, 1155 16 of 18 retains the advantages of LSTM. The AE-GRU model proposed in this study has a shorter training time, and also takes advantage of the features extracted by AE. The convergence speed of the model is much faster. The validity of the model is evaluated and verified through the 5-fold cross-validation. The results of the root mean square error are better than other deep learning methods, and the research model can find the engine that is about to fail early and in time to maintain the equipment, reducing unnecessary costs. The AE-GRU model proposed in this study has good accuracy in the prediction of RUL. With the improvement of process technology, the cost of equipment will become higher and higher. Therefore, more complex models may be needed in future research, such as a bi-directional recurrent network. In this study, only a directional recurrent network is considered. In some cases, the output at the current moment is not only related to the previous state, but also closely related to the state after it. The pre-processing method in this study extracts features from the original data. Although the features can effectively represent the data set, they may contain a little noise. In the future, the pre- processing method might also employ an advanced version of AE, such as denoising AE (DAE). The combination of these two methods can further improve the prediction of RUL. If the algorithm can be effectively implemented, predictive maintenance can be widely used in many applications, providing effective information for early maintenance of machinery, and self-correcting parameters can improve the yield to achieve the target of smart factories.

Author Contributions: Conceptualization, Y.-W.L. and C.-Y.H.; methodology, C.-Y.H.; software, K.-C.H.; validation, Y.-W.L. and K.-C.H.; formal analysis, Y.-W.L. and K.-C.H.; investigation, Y.-W.L. and K.-C.H.; resources, C.-Y.H.; data curation, K.-C.H.; writing—original draft preparation, Y.-W.L.; writing—review and editing, C.-Y.H.; visualization, Y.-W.L.; supervision, C.-Y.H.; project administration, C.-Y.H.; funding acquisition, C.-Y.H. All authors have read and agreed to the published version of the manuscript. Funding: This research is supported by Ministry of Science and Technology, Taiwan (MOST 107-2221-E-027-127- MY2; MOST108-2745-8-027-003). Conflicts of Interest: The authors declare no conflict of interest.

References

1. Wang, S.; Wan, J.; Li, D.; Zhang, C. Implementing Smart Factory of Industrie 4.0: An Outlook. Int. J. Distrib. Sens. Netw. 2016, 12, 3159805. [CrossRef] 2. Chien, C.-F.; Hsu, C.-Y.; Chen, P.-N. Semiconductor Fault Detection and Classification for Yield Enhancement and Manufacturing Intelligence. Flex. Serv. Manuf. J. 2013, 25, 367–388. [CrossRef] 3. Hsu, C.-Y.; Liu, W.-C. Multiple Time-Series Convolutional Neural Network for Fault Detection and Diagnosis and Empirical Study in Semiconductor Manufacturing. J. Intell. Manuf. 2020, 1–14. [CrossRef] 4. Fan, S.-K.S.; Hsu, C.-Y.; Tsai, D.-M.; He, F.; Cheng, C.-C. Data-Driven Approach for Fault Detection and Diagnostic in Semiconductor Manufacturing. IEEE Trans. Autom. Sci. Eng. 2020, 1–12. [CrossRef] 5. Shrouf, F.; Ordieres, J.; Miragliotta, G. Smart factories in Industry 4.0: A review of the concept and of energy management approached in production based on the Internet of Things paradigm. In Proceedings of the 2014 IEEE International Conference on Industrial Engineering and Engineering Management, Selangor, Malaysia, 9–12 December 2014; pp. 697–701. 6. Kothamasu, R.; Huang, S.H.; VerDuin, W.H. System health monitoring and prognostics—A review of current paradigms and practices. Int. J. Adv. Manuf. Technol. 2006, 28, 1012–1024. [CrossRef] 7. Peng, Y.; Dong, M.; Zuo, M.J. Current status of machine prognostics in condition-based maintenance: A review. Int. J. Adv. Manuf. Technol. 2010, 50, 297–313. [CrossRef] 8. Soh, S.S.; Radzi, N.H.; Haron, H. Review on scheduling techniques of preventive maintenance activities of railway. In Proceedings of the 2012 Fourth International Conference on Computational Intelligence, Modelling and Simulation, Kuantan, Malaysia, 25–27 September 2012; pp. 310–315. 9. Heimes, F.O. Recurrent neural networks for remaining useful life estimation. In Proceedings of the 2008 International Conference on Prognostics and Health Management, Denver, CO, USA, 6–9 October 2008; pp. 1–6. Processes 2020, 8, 1155 17 of 18

10. Li, Y.; Shi, J.; Gong, W.; Liu, X. A data-driven prognostics approach for RUL based on principle component and instance learning. In Proceedings of the 2016 IEEE International Conference on Prognostics and Health Management (ICPHM), Ottawa, ON, Canada, 20–22 June 2016; pp. 1–7. 11. Murphy, K.P. Machine Learning: A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2012. 12. Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [CrossRef] 13. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [CrossRef] 14. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; ACM: New York, NY, USA, 2016. 15. Han, Z.; Zhao, J.; Leung, H.; Ma, K.F.; Wang, W. A review of deep learning models for time series prediction. IEEE Sens. J. 2019.[CrossRef] 16. Bengio, Y.; Courville, A.; Vincent, P. Representation Learning: A Review and New Perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [CrossRef] 17. Schmidhuber, J. Deep Learning in Neural Networks: An Overview. Neural Netw. 2015, 61, 85–117. [CrossRef] 18. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [CrossRef][PubMed] 19. Makridakis, S.; Spiliotis, E.; Assimakopoulos, V. Statistical and Machine Learning Forecasting Methods: Concerns and Ways Forward. PLoS ONE 2018, 13, e0194889. [CrossRef][PubMed] 20. Alfred, R.; Obit, J.H.; Ahmad Hijazi, M.H.; Ag Ibrahim, A.A. A performance comparison of statistical and machine learning techniques in learning time series data. Adv. Sci. Lett. 2015, 21, 3037–3041. 21. Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [CrossRef] 22. Lin, P.; Tao, J. A Novel Bearing Health Indicator Construction Method Based on Ensemble Stacked Autoencoder. In Proceedings of the 2019 IEEE International Conference on Prognostics and Health Management (ICPHM), San Francisco, CA, USA, 17–20 June 2019; pp. 1–9. 23. Yin, C.; Zhang, S.; Wang, J.; Xiong, N.N. Anomaly Detection Based on Convolutional Recurrent Autoencoder for IoT Time Series. IEEE Trans. Syst. Man Cybern. Syst 2020, 1–11. [CrossRef] 24. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef] 25. Zhang, Y.; Xiong, R.; He, H.; Liu, Z. A LSTM-RNN method for the lithuim-ion battery remaining useful life prediction. In Proceedings of the 2017 Prognostics and System Health Management Conference (PHM-Harbin), Harbin, China, 9–12 July 2017; pp. 1–4. 26. Mathew, V.; Toby, T.; Singh, V.; Rao, B.M.; Kumar, M.G. Prediction of Remaining Useful Lifetime (RUL) of turbofan engine using machine learning. In Proceedings of the 2017 IEEE International Conference on Circuits and Systems (ICCS), Batumi, Georgia, 5–8 December 2017; pp. 306–311. 27. Zheng, S.; Ristovski, K.; Farahat, A.; Gupta, C. Long short-term memory network for remaining useful life estimation. In Proceedings of the 2017 IEEE international conference on prognostics and health management (ICPHM), Dallas, TX, USA, 19–21 June 2017; pp. 88–95. 28. Chen, J.; Jing, H.; Chang, Y.; Liu, Q. Gated recurrent unit based recurrent neural network for remaining useful life prediction of nonlinear deterioration process. Reliab. Eng. Syst. Safe. 2019, 185, 372–382. [CrossRef] 29. Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. 30. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. 31. Chen, Z.; Liu, Y.; Liu, S. Mechanical state prediction based on LSTM neural netwok. In Proceedings of the 2017 36th Chinese Control Conference (CCC), Dalian, China, 26–28 July 2017; pp. 3876–3881. 32. Zheng, J.; Xu, C.; Zhang, Z.; Li, X. Electric load forecasting in smart grids using long-short-term-memory based recurrent neural network. In Proceedings of the 2017 51st Annual Conference on Information Sciences and Systems (CISS), Baltimore, MD, USA, 22–24 March 2017; pp. 1–6. 33. ElSaid, A.; Wild, B.; Higgins, J.; Desell, T. Using LSTM recurrent neural networks to predict excess vibration events in aircraft engines. In Proceedings of the 2016 IEEE 12th International Conference on e-Science, Baltimore, MD, USA, 23–27 October 2016; pp. 260–269. Processes 2020, 8, 1155 18 of 18

34. Cenggoro, T.W.; Siahaan, I. Dynamic bandwidth management based on traffic prediction using Deep Long Short Term Memory. In Proceedings of the 2016 2nd International Conference on Science in Information Technology (ICSITech), Balikpapan, Indonesia, 26–27 October 2016; pp. 318–323. 35. Bao, W.; Yue, J.; Rao, Y. A deep learning framework for financial time series using stacked and long-short term memory. PLoS ONE 2017, 12, e0180944. [CrossRef] 36. Zhang, Q.; Wang, H.; Dong, J.; Zhong, G.; Sun, X.J.I.G.; Letters, R.S. Prediction of sea surface temperature using long short-term memory. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1745–1749. [CrossRef] 37. How, D.N.T.; Sahari, K.S.M.; Yuhuang, H.; Kiong, L.C. Multiple sequence behavior recognition on humanoid robot using long short-term memory (LSTM). In Proceedings of the 2014 IEEE international symposium on robotics and manufacturing automation (ROMA), Kuala Lumpur, Malaysia, 15–16 December 2014; pp. 109–114. 38. Truong, A.M.; Yoshitaka, A. Structured LSTM for human-object interaction detection and anticipation. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–6. 39. Kuan, L.; Yan, Z.; Xin, W.; Yan, C.; Xiangkun, P.; Wenxue, S.; Zhe, J.; Yong, Z.; Nan, X.; Xin, Z. Short-term electricity load forecasting method based on multilayered self-normalizing GRU network. In Proceedings of the 2017 IEEE Conference on Energy Internet and Energy System Integration (EI2), Beijing, China, 26–28 November 2017; pp. 1–5. 40. Zhang, D.; Kabuka, M.R. Combining weather condition data to predict traffic flow: A GRU-based deep learning approach. IET Intell. Transp. Syst. 2018, 12, 578–585. [CrossRef] 41. Sharma, S. Activation functions in neural networks. Int. J. Eng. Appl. Sci. Technol. 2020, 4, 310–316. [CrossRef] 42. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. 43. Dozat, T. Incorporating nesterov momentum into adam. In Proceedings of the 2016 International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016; pp. 1–4.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).