<<

Methodology of Adaptive Prognostics and Health Management in Dynamic Work Environment

A dissertation submitted to the Graduate School of the University of Cincinnati in partial fulfillment of the requirements for the degree of Doctor of Philosophy

In the Department of Mechanical and Materials Engineering of the College of Engineering and Applied Science by Jianshe Feng

June 2020 B.Sc. in Mechanical Engineering, Tongji University (2012) M.Sc. in Mechatronics Engineering, University (2015) Committee: Prof. Jay Lee (Chair) Prof. Jing Prof. Manish Kumar Prof. Thomas Huston Dr. Hossein Davari Dr. Zongchang ii Abstract

Prognostics and health management (PHM) has gradually become essential technique to improve the availability and efficiency of a complex system. With the rapid advance- ment of sensor technology and communication technology, a huge amount of real-time data are generated from various industrial applications, which brings new challenges to PHM in the context of big data streams. On one hand, high-volume stream data places a heavy demand on data storage, communication, and PHM modeling. On the other hand, continuous change and drift are essential properties of stream data in an evolving environment, which requires the PHM model to be capable to capture the new information in stream data adaptively, efficiently and continuously. This research proposes a systematic methodology to develop an effective online learning PHM with adaptive sampling techniques to fuse information from continuous stream data. An adaptive sample selection strategy is developed so that the representative samples can be effectively selected in both off-line and online environment. In addition, various data-driven models, including probabilistic models, Bayesian algorithms, incremental methods, and ensemble algorithms, are employed and integrated into the proposed methodology for model establishment and updating with important samples selected from streaming sequence. Finally, the effectiveness of proposed systematic methodol- ogy is validated with four typical industrial applications including power forecasting of a combined cycle power plant, fault detection of hard disk drive, virtual metrology in semiconductor manufacturing processes, and prognosis of battery state of capacity.

iii The result comparison between the proposed methodology and state-of-art benchmark methods indicates that the proposed methodology is capable to build an adaptive PHM with sustainable performance to deal with dynamic issues in processes, which provides a promising way to prolong the PHM model lifetime after implementation. Keywords: adaptive PHM; sample selection; sample importance test; online model- ing; sequential model updating

iv To my beloved wife and family.

v vi Declaration

I hereby declare that except where specific reference is made to the work of others, the contents of this dissertation are original and have not been submitted in whole or in part for consideration for any other degree or qualification in this, or any other university. This dissertation is my own work and contains nothing which is the outcome of work done in collaboration with others, except as specified in the text and Acknowledgements. This dissertation contains fewer than 40,000 words including appendices, bibliography, footnotes, tables, and equations and has fewer than 100 figures.

Jianshe Feng June 2020

vii viii Acknowledgements

First of all, I would like to express my foremost gratitude to my advisor Prof. Jay Lee, for his continuous support and insightful guidance throughout my PhD study. Without his support, I would never have had the chance to accomplish this research in an excellent research environment. My sincere gratitude also goes to my committee members: Prof. Jing Shi, Prof. Manish Kumar, Prof. Thomas Huston, Dr. Hossein Davari, and Dr. Zongchang Liu for their valuable feedback and helpful guidance during my research. I would like to thank Dr. Yilu and Dr. Xinyu from GM R&D center for the internship opportunity to work on autonomous vehicle diagnosis and prognosis project. I would like to thank Dr. Zwich Tang and Ms. Aisha Yousuf from Eaton for the internship opportunity to work on electricity grid fault isolation and localization project. These internship experiences gave me the chances to put my research into a wider context, investigate how it fits into a bigger picture, and inspire new ideas for my research. I would like to thank all the collaborators from various companies including Applied Materials, Plastic Omnium, Electric, PWC, Kinpo-ACCL, etc. Special thanks given to Dr. James Moyne from University of Michigan and Applied Global Services Group. His expertise and insights in semiconductor manufacturing inspired me a lot during our collaboration. I would like to give my thanks to all in IMS center for the collaborative work and great support. They are Dr. Hossein Davari Ardakani, Dr. Wenjing , Dr. Chao Jin, Dr. Ann Kao, Dr. Zongchang Liu, Dr. Zhe Shi, Dr. Di, Dr. Xiaodong , Dr.

ix Shaojie , Dr. Jaskaran Singh, Dr. , Mr. Behrad Bagheri, Mr. Pin Li, Mr. Bin , Mr. Qibo , Ms. Laura Pahren, Ms. Sherry , Mr. Honghao , Mr. Runfeng , Mr. Yuan-Ming Hsu, Mr. Vibhor Pandhare, Mr. Cyrus Azamfar, Mr. Himanshu Grover, Mr. Feng , Mr. Wenzhe Li, Mr. Fei Li, Ms. Yang, Miss Yinglu Wang, Mr. Shaojie Yang, Mr. Shahin Siahpour, Ms. Marcella Miller, Dr. Ming , Dr. Yibing Yin, Dr. Yuan-Jen , Dr. Huijie , Mr. Yubin , Mr. Hongsheng , Dr. Mingqiang Zhu, Dr. Jianhai , Dr. Yang Tang, Dr. Zhongwei Wang, and many others. I appreciate Mr. Patrick Brown and Mr. Michael Lyons for the grateful organization and administration in past years. It is my honor to be an IMSer and work with guys! Finally, I would like to thank the whole big family, especially my parents Yujie Feng ( 冯玉杰 : Féng Yùjié) and An ( 安贤 : An¯ Xián), and my love Dantong. The consistent supports and unwavering love from my big family have always given me the strength to get through tough times on this journey.

x Table of contents

List of figures xv

List of tables xxi

Nomenclature xxiii

1 Introduction1

2 Literature Review and Related Works5

2.1 Overview of Prognostics and Health Management ...... 5 2.2 Recent Adaptive PHM Practices ...... 7 2.3 Related Research Topics ...... 9 2.4 Challenges and Research Gaps ...... 28

3 Development of Adaptive PHM Methodology 31

3.1 Framework of Adaptive PHM Methodology ...... 31 3.2 Sample Selection Strategy of Adaptive PHM ...... 34 3.3 Models of Adaptive PHM ...... 44 3.3.1 Bayesian Methods ...... 46 3.3.2 Adaptive Ensemble Methods ...... 48 3.3.3 Neural Nets ...... 49 3.3.4 Online Kernel Methods ...... 51

xi Table of contents

3.4 Off-line Sample Selection and Modeling Techniques ...... 52 3.5 On-line Sample Selection and Modeling Techniques ...... 53 3.6 Justification of Sample Importance Test ...... 54 3.7 An Intuitive Case of Sample Selection and Modeling ...... 57 3.7.1 Background ...... 58 3.7.2 Design of Experiments ...... 59 3.7.3 Results and Discussions ...... 59 3.7.4 Summary ...... 63

4 Case Studies 65

4.1 Overview of Case Studies ...... 65 4.2 Case Study I - Hard Disk Drive Online Fault Detection ...... 66 4.2.1 Background ...... 66 4.2.2 Data Description ...... 68 4.2.3 Methodology ...... 68 4.2.4 Results and Discussions ...... 73 4.2.5 Summary ...... 81 4.3 Case Study II – Adaptive Virtual Metrology of CMP Process ...... 82 4.3.1 Background ...... 82 4.3.2 CMP Process Introduction ...... 86 4.3.3 Methodology ...... 88 4.3.4 Results and Discussions ...... 100 4.3.5 Summary ...... 112 4.4 Case Study III – Battery Capacity Prognosis ...... 113 4.4.1 Background ...... 113 4.4.2 Methodology ...... 114 4.4.3 Data Description and Experiment Design ...... 121

xii Table of contents

4.4.4 Results and Discussions ...... 124 4.4.5 Summary ...... 128

5 Conclusions and Future Work 131

5.1 Conclusions ...... 131 5.2 Future Work ...... 134

References 137

Appendix A List of Publications in PhD Study 163

xiii xiv List of figures

2.1 General PHM analytics approach ...... 8 2.2 Traditional PHM modeling and deployment ...... 10 2.3 An illustration of two different types of concept drift in the context of classification problems. The samples with different colors are two classes, and the red dash line is the decision boundary (a) original observed data (which can be seen as offline available data); (b) the observed data with real concept drift; (c) the observed data with virtual concept drift. . . . . 12 2.4 An illustration of change points in a time series, where scatter points are time series samples, and the horizontal lines indicate separate working regimes (which are different rotating speeds in this illustration). . . 13 2.5 An illustration of incremental learning process based o streaming samples (modified based on Žliobait˙e(2010))...... 16 2.6 Different learning mechanism between traditional machine learning and transfer learning (Copyright: Pan and Yang(2009)) ...... 20 2.6 Intuitive cases for difference learning method comparison ...... 25

3.1 Proposed adaptive PHM framework for offline initialization and online evolving...... 33 3.2 Sample Importance test (SIT) ...... 35

xv List of figures

3.3 Alignment illustration: associate each element of sequence X to one or more elements of sequence Y and vice-versa, arrows show the desirable points of alignment...... 38 3.4 DTW path illustration...... 38 3.5 Data-based sample selection ...... 40 3.6 Model-based sample selection ...... 42 3.7 An illustration of dynamic weighted ensemble model...... 49 3.8 Offline sample selection and modeling ...... 54 3.9 On-line sample selection and modeling ...... 56 3.10 Illustrations of SIT ...... 57 3.11 Diagram of a combined cycle power plant (CCPP) ...... 58 3.12 Flow chart of power prediction of CCPP ...... 60 3.13 Test settings with different train-test ratios ...... 60 3.14 Prediction RMSE comparison between the benchmarking method and proposed method ...... 61 3.15 Modeling time comparison between the benchmarking method and pro- posed method ...... 62 3.16 Sample size comparison ...... 62

4.1 Computer hard disk drive diagram (source: Wang et al.(2011)) . . . . . 67 4.2 2 typical variables with uncertainties from one healthy hard disk drive . . 69 4.3 Data partition for model off-line training and on-line updating . . 70 4.4 Flow chart of HDD fault diagnosis ...... 71 4.5 Off-line model result ...... 73

xvi List of figures

4.6 On-line model results when different online streaming samples are learned: (a) initial model performance from off-line modeling, no samples are learned; (b) model performance when streaming sample 4 is learned; (c) model performance when streaming sample 6 is learned; (d) model performance when streaming sample 18 is learned; (e) model performance when stream- ing sample 31 is learned: (f) model performance when streaming sample 40 is learned...... 77 4.7 Prediction accuracy benchmarking with other algorithms ...... 80 4.8 Online sample selection benchmarking with other algorithms ...... 80 4.9 CMP process schematic diagram ...... 87 4.10 Metrology and Virtual Metrology in semiconductor manufacturing process 87 4.11 Dynamic phenomena in PHM 2016 dataset: (1) recurring; (b) drifting . . 88 4.12 Flowchart of Adaptive Virtual Metrology Modeling for MRR prediction in CMP...... 94 4.13 Other VM methods for benchmarking ...... 95 4.14 The proposed methodology for off-line sample selection and model training. 96 4.15 The proposed methodology for the on-line sample selection and model update...... 96 4.16 Data separation for modeling training and testing ...... 100 4.17 The (a) hSI and (b) MRR prediction at 50th iteration for data group 1. hSI = 0 means the sample is not selected, hSI = 1 means the sample is selected as a candidate. Here gives an examples of selected sample due to a large prediction error...... 103

xvii List of figures

4.18 The (a) hSI and (b) MRR prediction at 50th iteration for data group 1. hSI = 0 means the sample is not selected, hSI = 1 means the sample is selected as a candidate. Here gives an examples of selected sample due to a large prediction uncertainty...... 104 4.19 Visualization of (a) hSI and (b) MRR prediction for the first 100 samples in data group 1 at the 100th iteration. All training samples can be categorized into 3 groups. The sample with hSI = 0 means the sample are failed to pass the SIT. The samples with hSI = 1 means the samples pass the SIT in this iteration and are selected as candidates. The samples in DL indicates the already selected samples in DL in previous iterations. . . . . 104 4.20 Visualization of (a) hSI and (b) MRR prediction for the first 100 samples in data group 1 at the 240th iteration. All training samples can be categorized into 3 groups. The sample with hSI = 0 means the sample are failed to pass the SIT. The samples with hSI = 1 means the samples pass the SIT in this iteration and are selected as candidates. The samples in DL indicates the already selected samples in DL in previous iterations. . . . . 105 4.21 Change of MSE for both training and testing samples over iteration steps when split ratio is 0.6...... 105 4.22 The visualization of predicted MRR by the proposed method...... 107

SI 4.23 The visualization of (a) h , (b) dt for FT, (c) wt for ET and (d) MRR predictions for one-step-ahead prediction...... 108 4.24 The number of selected training samples...... 109 4.25 The benchmarking of computation time...... 109 4.26 Predicted MRR from proposed method vs actual MRR ...... 110 4.27 Plot regression of predicted MRR vs actual MRR ...... 110

xviii List of figures

4.28 Three GPR models for battery SoH prediction: (a) Single Input Single Output (SISO) GPR; (b) Multiple Input Multiple Output (MIMO) GPR; (c) Multiple Input Single Output (MISO) GPR...... 116 4.29 Illustration of trajectory selection from DB by SIT and battery capacity prognosis based on up-to-date DL...... 120 4.30 Normalized capacity trajectory of all battery samples ...... 121 4.31 Normalized capacity trajectory of offline available battery samples . 122 4.32 Normalized capacity trajectory of streaming-in battery samples . . . . . 122 4.33 Normalized capacity trajectory offonline test samples ...... 123 4.34 Data partition and experiment design for model validation ...... 123 4.35 Normalized capacity Prediction based on offline available trajetories (with- out trajectory selection) ...... 125 4.36 Normalized capacity Prediction based on offline available trajetories (with trajectory selection) ...... 126 4.37 Normalized capacity Prediction based on proposed method: (2) after 1st online model update when the 1st important streaming-in capacity trajectory passed the SIT; (b) after 2nd online model update when the 2nd important streaming-in capacity trajectory passed the SIT; (c) after 3rd online model update when the 3rd important streaming-in capacity trajectory passed the SIT ...... 127 4.38 RMSE comparison between M3 (incremental model) and M4 (proposed method). A pair of dash line solid line with same color means the results comparison of two models based on the same test. The vertical dash lines represent the M4 model updates when the important trajectories streams in. e.g. the first vertical dash line locates at x = 3, which means that the 3rd streaming-in trajectory passes the SIT and triggers model updating. . 129

xix List of figures

4.39 AEB comparison between M3 (incremental model) and M4 (proposed method). A pair of dash line solid line with same color means the results comparison of two models based on the same test. The vertical dash lines represent the M4 model updates when the important trajectories streams in. e.g. the first vertical dash line locates at x = 3, which means that the 3rd streaming-in trajectory passes the SIT and triggers model updating. . 130

xx List of tables

3.1 SIT conclusions for example samples in Fig. 3.10...... 55 3.2 Detailed results comparison between benchmarking method and the pro- posed method ...... 63

4.1 Case study summary ...... 66 4.2 SMART attributes and their definitions ...... 67 4.3 Data description of 2016 PHM data challenge CMP data ...... 89 4.4 Typical VM methods adopted in semicondustor manufacturing industry . 94 4.5 CMP VM data partition ...... 98 4.6 Design of experiments ...... 101 4.7 Benchmark of static VM models ...... 102 4.8 Benchmark of experiment B ...... 106 4.9 Benchmark with existing methods in literature ...... 111 4.10 Offline modeling results comparision between with and without sample selection ...... 125 4.11 Online prediction results by proposed method M4 ...... 126

xxi xxii Nomenclature

Acronyms / Abbreviations

AEB Area of Error Band

ANN Artificial Neural Networks

ARX Auto-Regressive eXogenous

AUC Area Under ROC

BLR Bayesian Linear Regression

CBM Condition-Based Maintenance

CCPP Combined Cycle Power Plant

CMP Chemical-Mechanical Planarization

CovF Covariance Function

CUSUM Cumulative Sum

CVD Chemical Vapor Deposition

DB Data Base

DBI Data-Based Importance

xxiii Nomenclature

DevOps Development, Operation, and Management

DL Data Library

DOE Design of Experiment

DTW Dynamic Time Warping

ET Error Test

EWMA Exponentially Weighted Moving Average

FPR False Positive Rate

FT Freshness Test

GNSOM Growing Hierarchical Self Organizing Map

GMDH Group Method of Data Handling

GPR Gaussian process regression

GSI Global Sampling Indicator

HDD Hard Drive Disk

HMM Hidden Markov Model

HVCB High-Voltage Circuit Breakers

IoT Internet of Thing

JIT Just-in-Time

KerF Kernel Function

KNN k Nearest Neighbors

xxiv Nomenclature

KPCA kernel Principal Component Analysis

LASSO Least Absolute Shrinkage and Selection Operator

LDA Linear Discriminant Analysis

LiB Lithium-ion Batteries

LR Linear Regression

LW-PLS Locally Weighted Partial Least Square Regression

LWR Locally Weighted Regression

MA Matérn

MAPE Mean Absolute Percentage Error

MBI Model-Based Importance

MF Mean Function

MIMO Multiple Input Multiple Output

MISO Multiple Input Single Output

MLR Multiple Linear Regression

MMD Maximum Mean Discrepancy

MRR Material Removal Rate

MSE Mean Squared Error

MTBF Mean Time Between Failure

MTGP Multi-Task Gauffian Process

xxv Nomenclature

NLML Negative Log Marginalized Likelihood

NN Neural Nets

OGD Online Gradient Descent

PCA Principal Component Analysis

PF Particle Filters

PHM Prognostics and Health Management

PLSR Partial Least Square Regression

PSD Positive Semi-Definite

R2R Run-to-Run

RBF Radial Basis Function

RF Random Forest

RKHS Reproducing Kernel Hilbert Space

RMSE Root Mean Square Rrror

ROC Receiver Operating Characteristic Curve

RUL Remaining Useful Life

RVM Relevance Vector Machine

SE Squared Exponential

SGD Stochastic Gradient Descent

SISO Single Input Single Output

xxvi Nomenclature

SIT Sample Importance Test

SMART Self-Monitoring, Analysis and Reporting Technology system

SME Subject-Matter Expert

SoC State of Capacity

SoH State of health

SUM Sequential Update Model

SVD Singular Value Decomposition

SVM Support Vector Machine

SVR Support Vector Regression

TPR True Positive Rate

VM Virtual Metrology

W2W Wafer-to-Wafer

xxvii xxviii Chapter 1

Introduction

“Don’t learn to do, but learn in doing.” Samuel Butler(1912) , novelist

In the modern industry, the requirements of high reliability, high availability and low risk of expected failure and downtime has become essentially necessary. Various condition monitoring and maintenance strategies and methodologies have been developed from both academia and industry to ensure the high performance of industrial products and equipment since the last decade (Zio, 2013). With the development of sensing technology and networked monitoring system, condition-based maintenance (CBM) has been implemented as an efficient maintenance strategy to significantly reduce thehigh maintenance expense (Jardine et al., 2006; Liu et al., 2017a). Unlike previous maintenance strategy such as unplanned maintenance which takes place only when failure occurs (Yildirim et al., 2017; et al., 2009), or time-based preventive maintenance which performs the maintenance activities periodically regardless of the health condition of the targeted devices (Dekker, 1996; Jardine et al., 2006), CBM is a set of maintenance approaches which schedule the maintenance actions based on the equipment’s health status such that the unnecessary maintenance tasks and cost can be avoided (Andrawus

1 Introduction et al., 2007; Froger et al., 2016; Heng et al., 2009; Jardine et al., 2006; Lee et al., 2014; Peter et al., 2014; Yoon et al., 2019). Therefore, the evaluation of devices’ current health status and prediction of corresponding health status in the future are two key aspects in the development of effective CBM activities (Ahmad and Kamaruddin, 2012; Jin et al., 2016). According to IEEE standards (Sheppard et al., 2008), prognostics and health management (PHM) is an enabling technology which incorporates functions of condition monitoring, status assessment, fault or failure diagnostics, prognostics, and maintenance or operational decision support to maximize the operational availability and safety of the target system. Various works of literature on development and implementation of the PHM system, including theories and practical applications, have been published in the last decade, which already benefits the industrial applications since the last decade. Among all the technologies in the PHM system, diagnosis and prognostics have become a major focus in both academia and industrial areas as the crucial role it plays to achieve zero-downtime performance.

To date, most industrial applications of failure diagnosis and prognostics are problem- focused scenarios of a single unit, equipment, or manufacturing process in certain specific applications such as automotive, manufacturing energy, process, aerospace and so on (Davari Ardakani and Lee, 2018; Feng et al., 2019a,c; Garinei and Marsili, 2012; Jia et al., 2019; Jin et al., 2017; Kim et al., 2016; Lall et al., 2011; , 2017b). However, with the rapid advancement of sensor technology and Internet of Things (IoT), a huge amount of real-time data is generated from various applications industry, which brings new challenges to PHM in the context of big data streams (Ditzler et al., 2015; Feng, 2019; Feng et al., 2019b; Jia et al., 2018b; Lee et al., 2015; Yang et al., 2015).

2 On one hand, high-volume stream data places a heavy demand on data storage, communication, and PHM modeling. On the other hand, continuous change and drift are essential properties of stream data in the evolving environment, which requires the diagnosis and prognosis techniques be capable to capture the new formation in stream data adaptively and continuously. What’s more, current industrial systems become much more complex than ever, which are composed of multiple systems, sub-systems, and components of different usage patterns, different ages and different working regimes (Gunes et al., 2014).

To address the gap between traditional PHM technologies and growing needs of diagnosis and prognosis activities in modern industry, this research proposes a systematic framework of PHM methodology with adaptive learning capacity under dynamic working environment. The feasibility of the proposed methodology is validated in four typical industrial applications including including power forecasting of a combined cycle power plant, fault detection of hard disk drive, virtual metrology in semiconductor manufactur- ing processes, and prognosis of battery state of health. The result comparison between the proposed methodology and state-of-art benchmark methods indicates that the proposed methodology is capable to build an adaptive PHM with sustainable performance to deal with dynamic issues in process, which provides a promising way to prolong the PHM model lifetime.

The layout of this dissertation is organized as follows:

• Chapter 1 introduces the background, motivation and research scope of this dissertation work.

• Chapter 2 reviews the methodology and analytics of PHM, and summarizes progress of current investigations of adaptive PHM practices and the existing issues,

3 Introduction

discusses the challenges of PHM modeling in a big stream data environment, and targets the research gaps.

• Chapter 3 describes the proposed methodology for adaptive PHM modeling. The key elements and implementation procedures of the proposed methodology are proposed.

• Chapter 4 demonstrates the effectiveness and superiority of the propoersed method- ology with 4 typical industrial scenarios.

• Chapter 5 concludes the whole work, summarized the broader impact, and high- lights the future research works.

4 Chapter 2

Literature Review and Related Works

In this chapter, related research areas in PHM and stream data mining, such as diagnosis and prognostics, stream data mining, sampling selection, and online model update, are reviewed. On one hand, the methodology PHM, along with major objectives and challenges are overviewed. On the other hand, current research progress on stream data mining and related topics are overviewed as well. Further, online model learning and the capability of enhancing PHM are investigated.

2.1 Overview of Prognostics and Health Manage-

ment

Prognostics and Health Management (PHM) is an engineering discipline developed based on the popularization of CBM deployment. The mature CBM technologies provide accurate process data, condition data, and other sensor data as input to the PHM system. PHM system provides a systematic methodology to evaluate the health condition of

5 Literature Review and Related Works a system or the critical components of interest, track their degradation process, and estimate the remaining useful life (RUL) before the failure occurs (Kim et al., 2016). Specifically, PHM takes advantage of historical, current and future information from the working environmental, operational parameters and process data from a system or equipment to detect its degradation, diagnose its faults, predict and proactively manage its failures (Zio, 2013). Unlike in CBM techniques, which mainly schedules the maintenance when certain indicators of decreasing performance or upcoming failure show up, PHM is capable to further improve the operation and maintenance cycle, op- timize the maintenance scheduling and increase the benefits in a lifetime (Tsui et al., 2015).

In general, there are 3 different categories of approaches available for PHM model- ing. They are (1) first-principle model-based approaches, (2) reliability model-based approaches; and (3) process sensor data-driven approaches (Zio, 2013). In first-principle model-based approaches, a mathematical model is usually established to describe the degradation process. The model is then deployed to predict the performance changing trend and estimate when the failure will occur. These models have a good performance when the physical model is not very complex (Bole et al., 2014b; Liu et al., 2013), but it may turn out to be practically impossible build an exact model to represent the physical mechanism when the system becomes complex in a dynamic working environment. Relia- bility model-based approaches evolve from the discipline of reliability engineering, from which the reliability model is built to estimate the reliability indicators such as mean time between failures (MTBF), reliability rate, etc ( et al., 1993). These terms can be used to evaluate the current performance and predict the RUL (Jia et al., 2018b). For example, exponential distribution is usually employed to describe the performance of electronic components (Coppola, 1984), and Weibull distribution is usually used to describe a more general integrated system (Chookah et al., 2011). However, the complete

6 2.2 Recent Adaptive PHM Practices time-to-failure data is rare in the real situation, especially for the new deployed or highly reliable equipment, which hinders these approaches’ implementation (Zio, 2009, 2013).

Recently, the process sensor data-driven approaches have obtained many attentions with the advent of the Internet of Thing (IoT) and big data analytics. These approaches don’t rely on prior knowledge of the physical system. The model is developed to learn the system’s behavior and to perform the prognostics mainly using process sensor data which is related to the change of system performance (Cai et al., 2020a,d; Li et al., 2018a, 2019a; Sutharssan et al., 2015) . Various algorithms such as artificial neural networks (Chowdhury et al., 2013; Wang et al., 2019), support vector machines (Liu et al., 2017b; Patil et al., 2015), self-organizing map (Jin et al., 2013; Yu, 2015), Gaussian regression (Cai et al., 2019, 2020b,c; Liu et al., 2013; Pinson et al., 2009; Richardson et al., 2018), Decision Tree (Kimotho et al., 2013), etc. are adopted in many industrial applications. Several good summaries of the data-driven PHM models and their applications are given by some review papers from different perspectives (Lee et al., 2014; Sutharssan et al., 2015; Zio, 2009).

As presented in Fig. 2.1, a general PHM analytics approach typically has 5 steps: (1) Task formulation; (2) Data preparation; (3) PHM modeling; (4) PHM model evaluation; and (5) PHM deployment.

2.2 Recent Adaptive PHM Practices

It has been recognized that PHM is now widely used in various industrial areas as an effective and practical way to improve productivity and reduce repair and mainte- nance costs. However, the rapid advancement of the big data era brings new challenges. Large-volume and rapid-arrival stream data places a higher demand on PHM modeling,

7 Literature Review and Related Works

Fig. 2.1 General PHM analytics approach and continuous change and drift of stream data require the PHM model to be capable to capture the new formation in stream data adaptively and continuously. As shown in Fig. 2.2, in the traditional PHM modeling process, the model is built based on the available historical data prior to deployment. Once the model was deployed, the model will not adaptively be changed. The performance of the model may gradually decrease if there are changes, drifts or other dynamic issues in the working environment or system itself (Ditzler et al., 2015).

To address this issue, researchers investigated various approaches to enable the PHM model’s learning ability in the stream data environment. Many pieces of research have dedicated to retraining a local model to learn the newly emerged patterns during the data streaming. et al. proposed a Just-in-Time (JIT) model in dealing with the dynamic changes in the semiconductor manufacturing process that once a new data arrives, the similar samples from the historical database will be searched out for a local model

8 2.3 Related Research Topics retraining (Chan et al., 2018). et al. applied to build a Gaussian process regression (GPR) model with a sliding window for the abnormal detection in the data stream (Pang et al., 2014). Other researches studies to discretize the environment parameters or health condition levels of the system, and update model once a new pattern is detected. Yu presented an adaptive Hidden Markov Model (HMM) to build a health assessment for bearing degradation monitoring (Yu, 2017a). The proposed approach measured the difference between the current condition and the baseline dynamically along withthe monitoring and add the new-found state into the HMM model. et al. proposed an adaptive model based on kernel PCA (KPCA) and support vector machine (SVM) for fault diagnosis of high-voltage circuit breakers (HVCBs) in online fashion (Ni et al., 2011). Ippoliti et al. proposed a model based on a growing hierarchical Self Organizing Map model (GHSOM) for abnormal activities detection in the network (Ippoliti and Zhou, 2010). The hierarchical model, GHSOM, was employed by Yuan to build an enhanced system health assessment model (Di, 2018).

In a summary, these approaches are developed based on the specific scenarios, so the investigated approaches are difficult to be expanded to other applications in a systematic way. Most publications mentioned above focus on system health assessment and anomaly detection issues during the manufacturing process, but few researches performed on fault diagnosis and model performance prognosis, which means a systematic and comprehensive adaptive PHM model has not been thoroughly investigated nowadays.

2.3 Related Research Topics

Actually the model performance degradation is not a unique phenomenon that only occurs in industrial processes. In general, the slow and gradual process changes

9 Literature Review and Related Works

Fig. 2.2 Traditional PHM modeling and deployment

(such as shift and drift) are observed in almost all time series data when using models trained from offline samples, therefore, the requirement of model adaptability along with process is ubiquitous. Theoretically, the development of adaptive PHM framework can be categorized as an application of online learning in a dynamic environment with streaming-in time series sequences. Besides the above mentioned investigations on model adaptability in PHM area, some investigations that are partially related with adaptive PHM have been studied in past years, such as change point detection, incremental learning, online learning, transfer learning, and so on. In general, one common ground between all this methods and adaptive PHM is that the streaming time series is expected to have some changes (such as shift or drift) over time, or say there are some unlearned or changed information appearing shown up in the steaming data after model implementation. This data changes at different time is defined as concept drift(Gama et al., 2014):

∃X : pt1(X, y) ≠ pt2(X, y), for t1 ≠ t2 (2.1)

10 2.3 Related Research Topics where X is the data input, y is the corresponding label, time t1 and t2 denote two different time points, and pt∗(X, y) denotes the joint distribution of X and y at time t∗.

More specifically, the concept drift can be categorized into two types:

• Real concept drift, which means that conditional distribution pt1(y|X) ≠ pt2(y|X) for different time points t1 and t2. Here the joint distribution is caused by conditional but be independent to distribution of input X.

• Virtual drift, which mean the variation of joint distribution pt(X, y) is cause by the change of the distribution of input data p(X), but the conditional distribution p(y|X) remains the same.

An illustration of concept drift is shown as in Fig. 2.3. For real concept drift Fig. 2.3.(a) −→ Fig. 2.3.(b), the input distribution p(X) remains the same while the condi- tional distribution p(y|X) changes, which is reflected as the change of decision boundary. For virtual concept drift Fig. 2.3.(a) −→ Fig. 2.3.(c), the conditional distribution p(y|X) remains the same, which is reflected as the same decision boundary as before. However, the input data distribution p(X) becomes different.

To deal with this drift issue, These methods mentioned above (change point detection, incremental learning, online learning, transfer learning, etc.) investigate solutions from different aspects. For example, change point detection methods focus on how to find out the pattern change in the streaming data, the outcomes can be further used for anomaly detection or model updating; transfer leaning tries to generalize the model to improve its performance in unseen data; and online learning and incremental learning investigate how to fuse the streaming information into the model.

11 Literature Review and Related Works

Fig. 2.3 An illustration of two different types of concept drift in the context of classification problems. The samples with different colors are two classes, and the red dash lineis the decision boundary (a) original observed data (which can be seen as offline available data); (b) the observed data with real concept drift; (c) the observed data with virtual concept drift.

12 2.3 Related Research Topics

Change point detection

Change point detection is a family of methods to detect the variations in the time series data. The variations here usually refer to sudden changes or obvious regime/state/condition changes. Change point detection is proven useful in application areas such as healthcare, finance, meteorology, epidemiology, climate change, speech analysis, image recognition, and process quality control (Aminikhanghahi and Cook, 2017; , 1995).

Fig. 2.4 An illustration of change points in a time series, where scatter points are time series samples, and the horizontal lines indicate separate working regimes (which are different rotating speeds in this illustration).

An illustration of a typical time series of machine tool speed signals with several different conditions is shown in Fig. 2.4. In Fig. 2.4, the plot shows the rotating speed changes along with different samples from a time sequence. The recognized change points help to divide whole sequential signals into different working regimes. Once a regime change is recognized, then the further actions, such as switching to the corresponding

13 Literature Review and Related Works model with the same working regime, which is usually know as stitch modeling (Chan et al., 2018), or adjusting the offset based on regime difference, which is usually know as feature calibration and alignment (Nehrkorn et al., 2003; Shelly, 2017), will be per- formed to maintain the model performance. The main purpose of change point detection algorithms is to localize these change points to separate different conditions.

The assumption in change point detection algorithm is that there exists distribution change before and after the change point. Recall the definition of concept drift in Eq.

2.1, the change point detection is to detect and locate the change point Xτ at time τ such that:

pt6τ (X, y) ≠ pt>τ (X, y) (2.2)

Therefore, most change point detection methods is developed to use certain statistical indicators to describe the time series behavior so that it can capture the locations of changes points once the statistical indicators change exceeds certain level. In general, the change point detection methods are categorized as parametric and non-parametric. The popular parametric change point detection methods such as Shewhart, Cumulative Sum (CUSUM), hypothesis test based on different distributions, etc. More details about parametric-based change point detection models can be found in (Tartakovsky et al., 2014). Non-parametric methods is motivated by the idea that the change point detection algorithm should be free of prior distributions defined in advance, most of which are kernel-based non-parametric statistics (Li et al., 2015).

From the viewpoint of process monitoring, the change point detection methods are divided into online detection based on continuous sampling and offline detection based on offline (historical) data. The former one refers to continuously observing a streaming

14 2.3 Related Research Topics sequence, and triggers an alarm when the change is detected, which is mainly employed for abnormal warning. While the latter is mainly employed for change isolation and localization from the obtained time series via an offline fashion. Therefore, on the context of PHM applications, change point detection can be adopted for abrupt regime change detection and anomaly detection for online monitoring. As it doesn’t include further actions other than event alarm, the potential application on adaptive PHM could only be informativeness inspection of the online sample from streaming sequence to help figure out when and where to do the model updating operation.

As mentioned above, the change point detection methods are supposed to capture the abrupt change whether caused by working regime shift or obvious environments. As shown in Fig. 2.4), it is proven useful for piecewise stationary scenarios with abrupt change. However, in many industrial processes, the gradual changes instead of abrupt changes are more commonly observed, such as tool wear processes or the control parameters in a semiconductor manufacturing processes. Compared with piecewise stationary working regimes, these scenarios bring more challenges to the model updating, and it is difficult apply change point detection methods to find out the when and where to trigger the model updating accurately and effectively. Incremental learning and Online learning

Incremental learning refers to the model that is capable to learn/retrain incrementally when new samples or batches arrive. Based on Gama et al.(2014), the incremental learning is defined as:

"Incremental algorithms process input examples one-by-one and update the sufficient statistics stored in the model after receiving each example. Incremental algorithms may

15 Literature Review and Related Works

Fig. 2.5 An illustration of incremental learning process based o streaming samples (modified based on Žliobait˙e(2010))

16 2.3 Related Research Topics have random access to previous examples or selected examples."- Gama et al.(2014)

Online learning is a special type of incremental learning that the samples are usually discarded after processing Gama et al.(2014). This action is also called one-shot in machine learning. An informal difference between online and incremental learning is that online learning also include the decremental learning concept, which means to remove the least informative samples from the model training. Most of the publications of incremental learning and online learning are classification problems based. They are error-driven updating the current model depending on it misclassifies the current sample.

According to Hoi et al.(2014), a typical online learning algorithm for classification is presented as below:

Algorithm 1 : Online Learning Framework for Linear Classification 1: procedure 2: for t = 1, 2, ..., T do 3: The learner receives an incoming instance: xt ∈ X; 4: The learner predicts the class label: yˆt = sgn(f(xt; wt)); 5: The true class label is revealed from the environment: yt; 6: The learner calculates the suffered loss: ι(wt;(xt; yt)); 7: if ι(wt;(xt; yt)) > 0 then 8: The learner updates the classification model: 9: wt+1 ← wt + ∆(wt;(xt; yt))

Compared with offline modeling, the incremental/online learning sacrifices some memory and computational resources while keeping the model updated. There are some typical incremental/online algorithms being investigated before. The classic Perceptron introduced by Rosenblatt(1958) can be considered as one of the earliest algorithms with online learning capability. As a neural network, it is convenient to develop the online weights updating form. Similar algorithms include Online Gradient Descent (OGD) (Zinkevich, 2003), Stochastic Gradient Descent (SGD) (Shamir and Zhang, 2013), Second

17 Literature Review and Related Works order Perceptron (Cesa-Bianchi et al., 2005), etc. Another branch of incremental/online learning methods is Bayesian based algorithms such as Naive Beyes (Gumus et al., 2014), recursive linear regression (Särkkä, 2013), online Gaussian process (Csató and Opper, 2002), Bayesian deep learning (Wang et al., 2009) etc., which are capable to realize a sequential updating mechanism by keep updating the model posterior when new observations come in. Other methods with the incremental updating capability include ensemble models such as online random forest (Wang et al., 2009) since the voting mechanism makes it easily to be developed as an online form.

The advantages of incremetnal/online learning mechanism against traditional offline modeling are summarized as: model scalability when new data come in; adaptability dynamic environments; simple algorithm development and implementation; etc. However, there are still some crucial issues to deal with before applying the developed algorithms directly into PHM field:

• Most attention is drawn to theoretical analysis and classification problems, while the regression related methods are not well discussed.

• Most of the current methods trigger the learning operation based on the prediction error, more criteria should be considered and designed for model updating.

• How to select the most representative samples while keep the trade-off between efficient and effect are not well discussed.

• Beside the online learning part, how to integrate the online learning mechanism with offline model initialization also need further investigations.

• What’s more, how to effective combine already learned knowledge with streaming-in samples to improve learning performance needs to be studied systematically.

18 2.3 Related Research Topics

• Besides the prediction error and update cost, some other PHM specified model performance evaluation considerations such as robustness, stability, confidence level, and interpretablity needs to be adopted for modeling.

Transfer learning

Transfer learning is a popular topic with tremendous attention on both theory development and application investigations recently. Similar to the methods mentioned above, transfer learning targets the distribution changes related modeling problem. Fig. 2.6 from Pan and Yang(2009) illustrates the different learning mechanism between traditional machine learning and transfer learning. The former assumes that the data from different tasks (conditions/regimes) have the same distribution, which makes the performance degraded when the testing data have a different distribution, while the transfer learning divides the data into two domains: source and target. Usually, the data from source domains are labeled, but the data from target domains are rarely labeled or not labeled. And transfer learning model is expected to have the ability to recognize and apply knowledge and skills learned in labeled source domains/tasks to target domains/tasks (Pan et al., 2012). Note that as shown in Fig. 2.6, some researchers use the terms source and target "task" rather than "domain" in the classic literature, but by far, the latter terminologies are more commonly used ( and Liu, 2018). Compared with the assumption of concept drift in Eq. 2.1, the assumption of transfer learning mainly focus on the distribution changes from different domains but not from different time horizons:

p(XS, yS) ≠ p(XT , yT ) (2.3)

The procedure of transfer learning can be summarized in 3 steps (Taylor and Stone, 2009):

19 Literature Review and Related Works

Fig. 2.6 Different learning mechanism between traditional machine learning and transfer learning (Copyright: Pan and Yang(2009))

1. For a given task in target domain, select an appropriate source task or set of tasks from which to transfer.

2. Learn how the source and target domain are related based on certain similarity metrics.

3. Perform knowledge transformation from the source domain to the target domain.

There are different options for knowledge/information that can be transferred from source domain to target domain for performance improvements. Based on the survey paper by Pan and Yang(2009), the transfer learning algorithms are categorized as 4 classes:

• Instance based transfer learning, which uses certain parts of instances from the source domain based on instance reweighing rules and importance sampling strategy, and then transfer them to the target domain. This type of methods is similar to the Just-in-Time model, which usually works well when the source and target domain have obvious overlap (Bickel et al., 2007; Dai et al., 2007).

20 2.3 Related Research Topics

• Feature based transfer learning, which uses feature transformation to reduce the difference between source and target domain ( and Yang, 2011), or design certain domain-invariant feature projection functions to project the feature from different domain into the same feature space (Li et al., 2018b; Pan et al., 2010).

• Model based transfer learning, which takes advantages of the model parameters learned from the source domain as the initial condition of modeling in target domain (Nater et al., 2011; Zhao et al., 2010). Given the fact that neural nets are easily to be tuned by weights updating, a lot of neural nets related transfer learning methods belong to this class, e.g. Yosinski et al.(2014) investigated the how to realize transfer learning in deep neural nets and discover some characteristics of deep neural nets’ transferability.

• Relation based transfer learning, which focuses on the relation investigations between source and target domain, and it is rarely studied for now (Davis and Domingos, 2009).

Although the research of transfer learning is popular recently, and it has shown the potentials on PHM related applications such as diagnosis and prognosis of machine tools (Davis and Domingos, 2009), aircraft engine ( et al., 2018), etc., there still a lot of problems need to be solved before the employment of transfer learning on adaptive PHM.

• First, Recall the concept drift phenomenon mentioned above, the change point detection methods are to detect and locate the abrupt condition change position, and the incremental/online learning methods are to enable the models’ adaptability to deal with the in-process unexpected changes, but transfer earning is expected to be designed for the tansferability between different conditions (domains). Usually, the regimes (domains) here refer to the stationary conditions. Currently, most of transfer learning methods are offline model that both source and the target domain

21 Literature Review and Related Works

are given, and no in-process domain changes are expected. transfer learning is mainly to build an offline model with good generalization ability. The disadvantage in such a setting is crucial: it is not capable to learn and update the model based on the streaming sequence.

• Second, a strong assumption of transfer learning is that the target domain has very few or no label information at all, which means that the model is not capable to learn from model feedback or Subject-matter expert (SME) knowledge interactively.

• Third, most of the current transfer learning research focus on classification problem, which is suitable for fault diagnosis in context of PHM, while the prognosis and virtual metrology related problem are not well investigated.

• Forth, besides the model accuracy, more metrics are needed to evaluate the model performance.

Lifelong learning

The motivation of lifelong learning is to build machine learning model similar to human learning to overcome the shortcomings of traditional machine learning model that doesn’t consider the previously learned knowledge but only focuses on given dataset (Chen and Liu, 2018). Note that the term "lifelong learning" has different meanings from various researches, for example, sometimes the model development, operation, and management (DevOps) are also considered as part of the connotation of lifelong learning. A formal definition of lifelong learning is given by Silver et al.(2013):

"Lifelong Machine Learning considers systems that can learn many tasks over a lifetime from one or more domains. They efficiently and effectively retain the knowledge they have learned and use that knowledge to more efficiently and effectively learn new

22 2.3 Related Research Topics tasks"- Silver et al.(2013)

Apart from the characteristics such as continuous learning process and knowledge accumulation, following characteristics make the lifelong learning different from incre- mental/online learning and transfer learning (Chen and Liu, 2018):

• the ability to discover new tasks;

• the ability to learn while working or to learn in process

Therefore, the lifelong learning aims to develop a more intelligent model with the self learning capability in the open dynamic environment. Although incremental/online learning is also capable to update the model based on steaming sequence, their objec- tives are quite different. The incremental/online learning keeps the task the same with incrementally improved performance, but lifelong learning generalizes the model to deal with different tasks. More investigations in lifelong learning area are needed to improve the theoretical maturity.

To make the comparison among different learning methods more intuitive, here some intuitive cases are used to demonstrate the learning methods comparison, as shown in Fig. 2.6. Fig. 2.6 (a) shows the data drift phenomenon of machine A over time. If we assume the first flat segment at the very beginning is the offline available dataand thus an offline model can be trained. The change point detection method is capable to detect the abrupt changes as highlighted with a gray box but incapable to detect the incremental ot gradual change which are highlighted as green box and orange box, respectively. After the concept drift is detected from the time sequence, then incremental or online learning method can be utilized for model updating to guarantee the scalability and model performance.

23 Literature Review and Related Works

Different from Fig. 2.6 (a) which depicts the variance/drift over time, Fig. 2.6 (b) depicts a change over machine unit, that is: assume we have some information about machine A, then how to transplant the knowledge we have learn to machine B? Here we can taken machine A as source domain while machine B as target domain, and the transfer learning can be adopted here for information transplant. However, the modeling and transplant here are offline-based, which means the variation along with timeis usually not considered.

Fig. 2.6 (c) considers the change based on a more longer time horizon. The working task/regime of machine A during beginning to time T can be considered as the same. After a time with different working tasks, the lifelong learning solves how the model work with a new task.

Sample selection strategy

Apart from the investigations of incremental model update, the important sample selection to determine when the model should be updated is a practically meaningful topic. As the passive model updating mechanism that just frequently taking very incoming sample for model updating is time and resource consuming, while the model updating after a long time interval can’t not guarantee the model can be updated timely. Therefore, it is necessary to study how to perform the sample selection with consideration of both effectiveness and efficiency. However, the sample selection mechanism in the adaptive modeling methods mentioned pay little attention on the sample selection but just take the wrongly classified samples for model updating. The different sample selection methods with industrial applications are summarized. The commonly used sampling strategies can be categorized as the following 3 classes: static sampling, adaptive sampling, and

24 2.3 Related Research Topics

(a)

(b)

(c)

Fig. 2.6 Intuitive cases for difference learning method comparison

25 Literature Review and Related Works dynamic sampling.

Static sampling strategy Static sampling strategy refers to the fixed sampling rules designed prior to actual implementation, which has widespread applications because of its simplicity (Guldi, 2004). However, static sampling strategy has obvious drawbacks such as its incapability to track the dynamic drift promptly. Different from static sampling, adaptive sampling strategy is based on adjusted rules at the beginning of each production to deal with the potential drift during manufacturing process.

Adaptive sampling strategy The key point of various adaptive sampling ap- proaches developed recently is to determine the sampling plan based on in-process parameters which can reflect the dynamic changes in manufacturing processes. Mouli and Scott(2007) proposed a systematic architecture about adaptive metrology sampling which adaptively changes sampling amounts based on the previous observations of fab events and process control feedback. Lee(2002) proposed a methodology to change sampling strategy adaptively if pattern changes are detected by an neural network model. Although adaptive sampling strategy is capable of capturing the process dynamics timely, the complexity of the algorithms and the uncertainty of the resource usage make it difficult to be implemented into real industrial applications ( and Kang, 2017; Munga et al., 2011).

Dynamic sampling strategy To deal with the issues of adaptive sampling strategy mentioned above, dynamic sampling strategy appears recently (Lee, 2008; et al., 2013; and Pearn, 2008). Different from adaptive sampling, the dynamic sampling rules are not be defined at the start of each production but determined based onthe in-process information and metrology capacity dynamically during the manufacturing.

26 2.3 Related Research Topics

Thus dynamic sampling strategy realizes a better tradeoff between the cost of metrology and product quality. Some recent publications illustrate the superiority of dynamic sampling strategy compared with static sampling and adaptive sampling methods (Lee, 2008; Wu and Pearn, 2008). The key point of dynamic sampling is to determine the best samples for sampling based on certain sampling rules or indicators which could indicate the instability of the process. Sun and Johnson(2008) proposed a multi-stage approach with a score-boarding algorithm to determine optimal wafer sampling strategy which effectively selects a small subset of wafers yet maximizes the group coverage of process behaviors. Dauzere-Péres et al.(2010) proposed an indicator named Global Sampling Indicator (GSI) to determine whether to select or skip each lot to minimize the tool risk dynamically, which was validated on real industrial data with significantly improved results.

What’s more, several dynamic sampling strategies are proposed inspired by active learning techniques to prioritize the important sample for process monitoring and control recently (, 2016; Kang and Kang, 2017; Wan et al., 2013; Wu et al., 2019). The key motivation of active learning is that the model’s performance could be improved with fewer labeled samples if it is allowed to choose the training data (usually by the expert in an interactive way) from which is learns (Settles, 2012). Kang and Kang (2017) an intelligent virtual metrology system to select the sample with low reliability for estimation, in which an ensemble of neural network is employed as a VM model, and the variance of predictions from different neural network are adopted to quantify reliability. Wan et al.(2013) proposed a dynamic sampling method for plasma etch pro- cesses with GPR model, in which a sampling rule was set based on the prediction variance.

27 Literature Review and Related Works

2.4 Challenges and Research Gaps

The challenges of adaptive PHM modeling in a big stream data environment are summarized as follows:

1. How to effectively select the representative samples from the dynamic working environment? Majority of the researches focus on how to improve the model performance through feature engineering such as enhancement the weak degradation related indicator through signal processing techniques ( et al., 2019), feature selection using certain statistical indices such as Fisher Criterion (Chiang et al., 2004), F1 score (Lipton et al., 2014), information gain ratio (Raileanu and Stoffel, 2004), monotonicity (Coble and Hines, 2009), trendability (Jia et al., 2017), etc. But few pieces of research focus on improving the model’s performance through sample selection. In the context of a big data environment, a mass of data is redundant for PHM modeling. For example, for a machine tool which has a very good health condition, the process sensor data gathered in the first two months may have very similar behaviors. It is inefficient to use up all the data to build a health assessment, and the model’s generalization capacity is impaired (Kulkarni and Harman, 2011). It has been proven that the generalization capacity depends on the geometrical characteristics of the training data but not on the dimensionality in feature space (Baudat and Anouar, 2003; Vapnik, 1998), if we can select representative samples to represent the characteristics in data space, it is concluded that not only the modeling efficiency is improved, but also the model’s generalization capacity could be enhanced.

2. How to build an adaptive model which can capture new information from streaming data continuously? Although there are some auto machine learning algorithms are emerging (Feurer et al., 2015), but the study of an adaptive

28 2.4 Challenges and Research Gaps

model with self-learning capacity, especially in a real application, is still at a very early stage. Additionally, the stream data with a rapid arrival speed brings more challenge to this topic. Extra considerations such as how to develop a query strategy to capture the most informative samples, how to balance the tradeoff between the model’s efficiency and fast-learning ability, etc.

3. How to combine both off-line modeling and online learning in a unified framework for the PHM model? According to state-of-art research, off-line modeling and on-line modeling are always considered separately (Ge, 2016; Lughofer and Pratama, 2017; Tang et al., 2018; Wu et al., 2019). Researchers only consider either off-line modeling without consideration of modeling efficiency and model’s sustainability, or only consider on-line modeling without consideration of absorbing prior knowledge from the available model. The main purpose of this research is to build a unified framework to combine the off-line model and online model, which could enhance the model’s performance most. That’s because that on one hand, an online model with self-learning ability processes samples and update model incrementally, which itself has the efficiency property (Krempl et al., 2014). On the other hand, the off-line model could be built with more time and resource consuming techniques as it doesn’t have a harsh efficiency requirement as the online model.

From the literature review, it is revealed that several research gaps are targeted:

1. As mentioned before, limited research concerns combing sample selection, off-line model training, and online model learning together into a unified framework.

2. Limited research work is investigated on PHM modeling and online updating in the context of stream data in a dynamic work environment. with the increasing of system complexity, the system has the virtue of dynamic issues. To guarantee the

29 Literature Review and Related Works

modeling performance in a life-long period, it is necessary to investigate an online model learning mechanism to address this issue.

3. The integrated modeling of sample selection strategies and adaptive model updating mechanisms in the PHM area is not well studied. Most publications focus on anomaly detection in networks or social media data. In PHM applications, besides anomaly detection, the diagnosis and prognostics are two more important topics (Sheppard et al., 2008). However, the integrated modeling of sample selection strategies and adaptive model updating for diagnosis and prognostics are not well studied.

30 Chapter 3

Development of Adaptive PHM Methodology

This chapter proposes the methodology of adaptive prognostics and health man- agement with streaming data in the dynamic work environment. The objective of the proposal is presented firstly. A systematic framework of the proposed method isthen introduced. And the major tasks, off-line sampling and modeling techniques, on-line sample selection strategy, and online model update mechanism, are described as well.

3.1 Framework of Adaptive PHM Methodology

The objective of this research is to develop a systematic methodology of adaptive prognostics and health management which is capable to evolve with the new information brought with streaming data during the online deployment process in the dynamic working environment. As shown in Fig. 3.1, the proposed methodology has 3 major modules:

1. Off-line sampling and modeling: this module selects the important samples from the original database (DB) and an initial PHM model is trained.

31 Development of Adaptive PHM Methodology

2. On-line sampling: the on-line sampling module determines the importance of each arrival streaming sample and decides whether it is needed to import this new sample to update the data library (DL).

3. Model Update: in this step, an online model self-learning module is developed, which enables the current model to learn from the selected samples sequentially to adapt to dynamic changes without model retraining. Model retaining is a usually adopted technique to address the dynamic issues in PHM modeling, but in the current big streaming data environment, this approach is too time and resource consuming.

There are 2 concepts, DB and DL, introduced in this work. The concept of DB here doesn’t strictly equal to the concept of database in data science ( et al., 2011), but generally means a set of all the collected historical data. The concept of DL in the proposed method is quite like the support vector set in sparse kernel machines such as SVM (Bishop, 2006; Saunders et al., 1998), which are the important data points to determine the hyperplane and margin between two different classes (Saunders et al., 1998; Smola and Schölkopf, 2000; Vapnik, 1998), so that predictions for test samples only depend on the kernel function evaluated at a subset of the training data points with reduced volume (Bishop, 2006). In our research, the PHM model is developed directly based on the DL. In this way, the tedious model training process based on huge DB consisting of all historical data samples with redundant information is avoided. As mentioned above, it is natural that there exists gradual behavior shift or drift in the industrial process, especially when the model has been deployed for a long run. It is either inefficient to re-train the PHM model when new samples streaming in or unrealistic to discard the model when it is found not as good as it used to be. In the proposed method, it is determined whether it is necessary to update the model by monitoring DL change. A sample selection strategy is developed to check if the new coming-in sample is informative

32 3.1 Framework of Adaptive PHM Methodology or not, which determine whether it will be included into the DL or not. Once the DL is updated, the model updating mechanism will be triggered, otherwise the model keeps the same no matter how many new samples are streaming in. In this way, the model can always follow the data behavior automatic but avoid updating or re-training too frequently.

Fig. 3.1 Proposed adaptive PHM framework for offline initialization and online evolving.

Compared with the conventional PHM model, the advantages of the proposed method- ology are summarized as follows.

1. The data volume for modeling is significantly reduced, and training efficiency is boosted by the off-line sampling and modeling modules, meanwhile the model performance is less affected;

2. Fast model selection and performance benchmarking are side benefits as results of the boosted training efficiency, which is proven instructive for industrial scenarios;

3. The built model is kept updated with the online model update module once an important sample is identified in the online monitoring process. Thus, the enhanced

33 Development of Adaptive PHM Methodology

model performance is expected when there are dynamic issues such as slow drifting or changing the environment and system noise existed in the streaming data.

4. The proposed methodology provides a good trade-off between model efficiency and accuracy.

3.2 Sample Selection Strategy of Adaptive PHM

In the proposed methodology, the sample selection operation is realized by a de- veloped sampling strategy approach named Sample Importance Test (SIT). The SIT determines whether a new coming-in sample is informative or not by comparing with current elements in the most up-to-date DL. Some methods of sample selection have been presented in the literature about how to learn new data points for modeling (Mohamad et al., 2018; Tang et al., 2018; Wan et al., 2013; Wang et al., 2010; Wang and , 2015; Wu et al., 2019; Yu and Kim, 2010). In these proposed works, the sampling selection is either guided by prediction accuracy or density estimation of the inputs of the data points. As shown in Fig. 3.2, the SIT proposed in this work is composed of two tests established based on both considerations mentioned above. The Data-based importance (DBI) test considers determining the informativeness of the new sample based on the characteristics of the data. In other words, to make it general, the DBI test is defined as freshness test (FT), as it evaluate the importance regarding informativeness just as it is "fresh" or not compared with previously shown samples. The Model-based importance (MBI) test considers determining the informativeness of the new sample based on the performance of the model’s prediction on this sample. The prediction performance here has two meanings: (1) whether the prediction has a good prediction accuracy; and (2) whether the prediction has a low uncertainty. In other words, to make it general, the MBI test is defined as error test (ET), which identified whether the sample hasalarge

34 3.2 Sample Selection Strategy of Adaptive PHM prediction error, or has a high prediction uncertainty.

Fig. 3.2 Sample Importance test (SIT)

FT is generally based on the similarity to evaluate whether a coming-in sample is "fresh" or not. There are various distance metrics developed to describe the similarity between elements, samples, time series, or populations. Calculating the distance and similarity between two samples (points, vectors, matrices, tensors, etc.) is the basis of many machine learning algorithms. It is proven a good distance metric determines the performance of the machine learning algorithm. For example, KNN classifier are very sensitive to the distance. Some multi-variate time series abnormal detection meth- ods have better performance when using Mahalanobis distance than using Euclidean distance. GPR model’s performance reply heavily on the kernel function choice and hyper-parameters tuning. Therefore, similarity and distance metrics are very important in machine learning. Some frequently used measurement in data-driven methods are

35 Development of Adaptive PHM Methodology given here:

1. Euclidean distance

d d In Cartesian coordinates, denote x1 ∈ R and x2 ∈ R refer to the d-dimensional vectors (points) in Euclidean d-dimensional space, then the Euclidean distance between x1 and x2 is:

q ⊤ dEuc(x1, x2) = (x1 − x2) (x1 − x2) (3.1)

2. Minkowski distance

Minkowski distance, which is also known as p-norm distance, is a general metric in a normed vector space which can be considered as a generalization of both the Euclidean

d d distance and the Manhattan distance. Again, denote x1 ∈ R and x2 ∈ R .The Minkowski distance of order p (where p is an integer) between two points x1 and x2 is:

p 1/p dMinkow(x1, x2) = (||x1 − x2|| ) (3.2)

Minkowski distance is typically used with p being 1 or 2, which correspond to the Manhattan distance and the Euclidean distance, respectively.

3. Mahalanobis distance

d The Mahalanobis distance is a measure of the distance between a point x1 ∈ R and a distribution D (Chandra et al., 1936), which is a multi-dimensional generalization of the idea of measuring how many standard deviations away x1 is from the mean of D.

The Mahalanobis distance of an observation x1 from a set of observations with mean µ

36 3.2 Sample Selection Strategy of Adaptive PHM and covariance matrix C is defined as (De Maesschalck et al., 2000):

q ⊤ −1 dMahala(x1, x2) = (x1 − µ) C (x1 − µ) (3.3)

Mahalanobis distance can also be used to quantify the dissimilarity measure between

d d two points x1 ∈ R and x2 ∈ R from the same distribution with the covariance matrix C:

q ⊤ −1 dMahala(x1, x2) = (x1 − x2) C (x1 − x2) (3.4)

If the covariance matrix is the identity matrix, the Mahalanobis distance reduces to the Euclidean distance.

4. Kernel distance

Instead of invariant distance in Euclidean space or Mahalanobis distance, the kernel distance is defined as the distance after transformed to the kernel space rather thanthe

d d raw data space. For two point x1 ∈ R and x2 ∈ R , the kernel distance dkern(xi, xj) is written as:

dkern(x1, x2) = ∥Φ(x1) − Φ(x2)∥ q (3.5) = 1/2(k(x1, x1) + k(x2, x2)) − k(x1, x2) where Φ(·) represents the kernel mapping and k(·, ·) is the kernel function. φ(·) is a basis function which maps observations x1 and x2 from original input space to the projected space. Rather than explicitly mapping the observations in the projected space, the distance in projected space can be efficiently performed through a kernel function (Schölkopf, 2001). The kernel function k(·, ·) in Eq. 3.5 can be linear kernel, polynomial kernel or Radial Basis Function (RBF) kernel. Normally, the kernel function is selected

37 Development of Adaptive PHM Methodology as linear kernel, Gaussian kernel, etc.

5. DTW distance

Dynamic Time Warping (DTW) is designed to measure the similarity between two temporal sequences, which may vary in speed (Senin, 2008). It searches for an optimal alignment between the sequences as shown in Fig. 3.3.

Fig. 3.3 Alignment illustration: associate each element of sequence X to one or more elements of sequence Y and vice-versa, arrows show the desirable points of alignment.

Fig. 3.4 DTW path illustration.

A DTW path is shown as Fig. 3.4. Assume there are two sequences with length n and m. Denote the warper function from two sequences as π1(i), π2(j) with move in 3

38 3.2 Sample Selection Strategy of Adaptive PHM directions −→, ↑, ↗. The cost per alignment is:

|π| X D(x,y)(π) = φ(xπ1 (i), yπ2 (i)) (3.6) i=1 where |π| is the length of this alignment π, and φ is the distance metric, which can be any distance measure such as Euclidean distance or kernel distance.

DTW searches for the optimal aliment:

∗ π = min D(x,y)(π) (3.7) π∈A(n,m) where A(n, m) is the set of all possible alignment. The corresponding DTW distance is:

d x , x φ x ∗ i , y ∗ i DTW ( 1 2) = ( π1 ( ) π2 ( )) (3.8)

6. MMD

Maximum Mean Discrepancy (MMD) is defined by the idea of representing distances between distributions as distances between mean embeddings of features in Reproducing Kernel Hilbert Space (RKHS) (Borgwardt et al., 2006; Gretton et al., 2012). For two population X and Y , the MMD(X,Y ) is written as:

n1 n2 X X dMMD(X,Y ) = φ(xi) − φ(yj) (3.9)

i=1 j=1 H where φ(·) is the mapping function from original space to RKHS, n1 and n2 are the sample size, and H is the corresponding RKHS.

39 Development of Adaptive PHM Methodology

Fig. 3.5 Data-based sample selection

Fig. 3.5 gives an example of DBI test (FT). The DBI test (FT) evaluates the freshness of the new example by the distance between a streaming input xt at time t against the existing selected samples in DL. If the distance is sufficiently large, this sample is regarded as an important sample since it introduces novel information to the DL. As mentioned above, there are quite various choices of distance metrics under different circumstances. To be specific, the RBF is mainly adopted due to its better performance than other kernels in general case, but to make the theory be generally meaningful, the term d is used to present the select distance metric. The test criterion in the FT can be written as:

dt =min {d (xt, xi) ∀ xi ∈ DLt} (3.10) q UL DLt dt =d+tα,Nt−1 × σbd × Nt − 1 (3.11)

40 3.2 Sample Selection Strategy of Adaptive PHM

where xt is the current observation, and xi is the element in DLt . The distance between xt and most updated DLt is defined as the minimum among all distances between xt and every element in DLt . This distance dt is employed as the test score of FT, d and

DLt σbd are the sample mean and standard deviation of the distance set is defined as:

D = {d1, d2, . . . , dNt(Nt−1)/2} (3.12)

D is a set of distances between every two important samples from DLt , which is assumed to follow a normal distribution. Therefore, Student t -test is adopted here to obtain the test upper bound in Eq. 3.11, where α defines the significance level and is empirical setas

0.05, tα,N−1 is the critical value of t distribution at significance level α, Nt is cardinality

¯ DLt the of DLt.. d and σbd is the mean and the standard deviation of the distance set D. The i-th element in set D is calculated as

di = min {d (xi, xj) |j = 1, ··· , N, j ≠ i} (3.13)

UL when dt < d , xt will not be incremented to the DL and the VM model. However, when

U dt ≥ d L, xt is regarded as an important sample.

In the DBI test, besides the prediction accuracy mentioned in previous literature, the prediction uncertainty is also considered, which provides the confidence of the prediction and the corresponding decisions making could make on the equipment more effectively and reliably [1]. If the prediction error or the prediction uncertainty is very large, the input xt is valuable to the model, since the present model fails to perform well on xt. Fig. 3.6 illustrates 3 streaming samples with different importance. Candidate algorithms in this application scenario include all the regression techniques that can be fit into a probabilistic framework, such as Bayesian linear regression (BLR), GPR, Ensemble

41 Development of Adaptive PHM Methodology

Fig. 3.6 Model-based sample selection models, etc. In the present demo case, without loss of generality, GPR is mainly adopted for modeling. The testing criterion for ET is proposed as:

y − y t ¯t wt = × σt (3.14) yt

where yt is the metrology value of input xt, y¯t and σt are the predicted value and uncertainty given by the PHM model. Like the FT in the previous discussion, the upper limit of IT is also derived based on a t-test, which is written as:

q UL DLt wt = w + tα,Nt−1 × σbw × Nt − 1 (3.15)

where α is the significance level and is empirical set as 0.05, tα,N−1 is the critical value

DLt of t-distribution at significance level α. w and σbw are the sample mean and standard

42 3.2 Sample Selection Strategy of Adaptive PHM

deviation of the set W = {w1, w2, . . . , wNt }. The k-th element in set W is calculated as

y − y k ck wk = × σk (3.16) yk

UL When wt < w , the input xt is not an important sample since the model is well

U fitted around xt. However, when wt ≥ w L, it is important to include xt to the PHM model, since the current model either have a large error around xt or is less confident about the prediction. In this case, the streaming sample with high prediction uncertainty and low prediction accuracy will be considered as the most informative samples for the existing DL and model. Overall, after the two-step importance test strategy, the samples which are both important based on data property and model property will be selected out effectively and efficiently.

Based on the proposed FT and ET, the test decision for the overall is obtained

FT ET SI as Eq. 3.17, where ∥ is the logical ’OR’ operator. ht , ht and ht in Eq. 3.17 are all binary outputs, where ’0’ means the test is failed and ’1’ means the test is passed.

SI Besides these test decision indicators, the wt ranks the importance of candidate samples

SI that pass the SIT. wt is useful in the off-line sample selection to prioritize the most important sample, since only the most important sample is selected at one iteration.

SI FT ET ht =ht || ht (3.17)

SI FT wt =ht × wt (3.18)

43 Development of Adaptive PHM Methodology

3.3 Models of Adaptive PHM

In general, all the data-driven methods that are capable to update the prediction rule based on new instances can be considered as a candidate algorithm with adaptability. Some hierarchical models are capable to deal with multi-stationary environment (Di, 2018). The stitch or JIT models can also solve the problem in the piecewise stationary environment with a satisfactory performance with many thorough investigation in past years (Javadian Kootanaee et al., 2013; Shinn and Williams, 1998). However, most of these methods (Di, 2018; Javadian Kootanaee et al., 2013; Shinn and Williams, 1998) have a strong assumption that the changes jump to another steady state, and the streaming time sequences can be divided into different segments that represent different working regimes individually, thus, the model can be established based on each segment accordingly. However, this is not always the scenario we need to deal with for adaptive PHM modeling. As Fig. 2.6 in Chapter2 shows, in a more general case, the adaptation refers to the learning process in a nonstationary environment that the model performance improves with more streaming samples. From such a point of view, this study will focus on the algorithms that can repetitively update the model with a consistent form instead of the similarity based methods such as stitch, JIT, etc.

Another consideration of the models is that it should be an active one rather than a passive one. In general, the algorithms with adaptability in the presence of concept drift are primarily based on either an active or passive approach (Ditzler et al., 2015). The passive algorithms are also known as "lazy" ones which keeps update the model continuously once new sample/measurement are obtained, regardless whether new infor- mation is presented or not. In some applications such as control and motion tracking, it is necessary to adopt these methods (e.g. Kalman Filter Leffens et al.(1982)), but in the online environment that time sequence has a relatively high streaming velocity, it is

44 3.3 Models of Adaptive PHM required that the model could actively select the samples to trigger the model updating mechanism for efficient time and resources consumption.

what’s more, the knowledge heritage also needs to be considered. In some industrial processes, it is necessary to retain the original knowledge such as Design of Experiment (DOE) data into the model, therefore, the sliding window based methods and Exponen- tially Weighted Moving Average (EWMA) control chart are not necessarily usable as in their settings the samples further removed in time will have less and less information retained in the current model. An example of contrast is the Bayesian methods as the previous knowledge can be included in the prior hyperparameters.

Based on the considerations mentioned above, the qualified learning algorithm for Adaptive PHM modeling should have following properties:

• The model is expected to be tunable to be incrementally updatable.

• The model is capable to integrate with prior knowledge (e.g. DOE, SME, etc.).

• The model is flexible to combine with extra sample selection strategy.

According to properties, following four algorithm families are adopted as good candi- date for adaptive PHM modeling:

• Bayesian methods;

• Adaptive ensemble methods;

• Neural nets;

• Incremental kernel methods.

45 Development of Adaptive PHM Methodology

3.3.1 Bayesian Methods

Bayesian related methods is a big family in statistical learning with broad applica- tions (Gelman et al., 2013). The fundamental of Bayesian methods is Bayes’ theorem. Bayesian methods are by default sequential in nature, which provide them potential to deal with streaming sequence. The property of Bayesian methods that the model parameter can be updated sequentially in an online fashion when new observations arrive makes it suitable for adaptive PHM modeling. The Bayesian family has various algorithm which includes both classification models and regression models, parametric models and non-parametric models, linear models and non-linear models, which provides comprehen- sive options for different applications scenarios. Bayesian methods have a consistent form after model updating, which make the posterior can be adopted by new prior flexibly, so the model updating and inference can be made at any point in the data stream. The combination of this property with sample selection strategy proposed in previous section make the Bayesian methods a perfect candidate to capture the information in new data and update the model efficiently and effectively.

Based on the applications, the Bayesian methods can be categorized as Bayesian classification methods and Bayesian regression methods. The former category canbe adopted for fault detection and diagnosis in the context of PHM, and the typical algo- rithms include Naïve Bayes, Gaussian Naïve Bayes, Multinomial naïve Bayes, etc. The latter category can be adopted for prognosis, degradation condition estimation, and virtual metrology in the context of PHM, and the typical algorithms include Bayesian linear regression, Gaussian process, Dirichlet process, etc.

Assume there is a data library DLt which contains all the important sample (X1, y1), (X2, y2), ..., (Xm, ym) at current time t. Without loss of generality, Naïve Bayes and Bayes

46 3.3 Models of Adaptive PHM regression are presented for classification and regression modeling. More details of other novel Bayesian methods such as Bayesian ARX and Gaussian process regression are presented in Chapter4 with case studies.

Bayes classifier

Assume there are in total K different classes C1,C2, ..., CK . Based on Bayes’ theorem, the conditional probability of text sample Xt belongs to class k is decomposed as:

p(Ck)p(Xt|Ck) p(Ck|Xt) = (3.19) p(Xt) where p(Ck) (1 6 k 6 K) is the prior, p(X|Ck) is the likelihood. And the predicted class yt is:

yt = arg max p(C|Xt) (3.20) C

When new important sample is detected by sample importance test, then the posterior p(Ck|X1, X2, ..., Xm, ) (1 6 k 6 K) could be adopted by prior for updating recursively. specifically, the Naïve Bayes is a special case of Bayes classifier with the assumption that all elements in the input Xi are independent.

Bayes regression

Considering a general form of a regression model:

2 yt = ft(θ; Xt) + ε, ε ∼ N (0, σ ) (3.21)

where ft is the regression model, and θ is the parameters of the regression model. Based on the observed important samples from current data library DLt, the posterior of model

47 Development of Adaptive PHM Methodology parameters is decomposed as:

m Y p (θ, σ|y1:m, X1:m) ∝ p (θ, σ) p (|Xi, θ, σ) (3.22) i=1 where p (θ, σ) denotes a probability distribution of the prior knowledge of parameters, m Q p (yi|Xi, θ, σ) is the likelihood function, which is determined through the regression i=1 model and the distribution of the error ε in Eq. 3.21, and in this case it is:

m m Y 1 1 X 2 p (yi|Xi, θ, σ) = 2 m/2 exp(− 2 (yi − fi(θ)) ) (3.23) i=1 (2πσ ) 2σ i=1

When a new important sample Xm+1 passes the SIT, then the sequentially updated posterior is:

m+1 Q p (θ, σ|y1:m+1, X1:m+1) ∝ p (θ, σ) p (yi|Xi, θ, σ) i=1 m Q ∝ [p (θ, σ) p (yi|Xi, θ, σ)]p (ym+1|Xm+1, θ, σ) (3.24) i=1

∝ p (θ, σ|y1:m, X1:m) p (ym+1|Xm+1, θ, σ)

which is the recursive form for model updating from important sample (Xm, ym) to

(Xm+1, ym+1). More details about model development together with sample importance test can be found in case stuy II in Chapter4.

3.3.2 Adaptive Ensemble Methods

Adaptive ensemble methods usually contain multiple experts, and the final decision can be taken as a weighted voting results, as shown in Fig. 3.7. To introduce the "adaptability" of the ensemble models, the weights of expert can be dynamically updated. As a classs of simple yet robust methods, various adaptive ensemble methods have been presented, such as adaptive random forest (Schwing et al., 2011), On-line tree-based

48 3.3 Models of Adaptive PHM ensemble (Saffari et al., 2009), dynamic weighted majority (Kolter and Maloof, 2007), etc. A general and inspiring adaptive weighted ensemble algorithm is presented as below (Hoi et al., 2018):

Algorithm 2 : Adaptive weighted ensemble algorithm 1: procedure 2: Initialize the weights p1, p2, ..., pN of all experts to 1/N. 3: for t = 1, 2, ..., T do ′ ′ ′ 4: Get the prediction y1, y2, ..., yN from N experts. P P 5: Output yˆt = 1 if ′ pi ′ pi; otherwise output yˆt = 0. i:yi=1 > i:yi=0 6: Receive the true value yt. 7: if The ith expert made a mistake then 8: pi = pi × β.(β ∈ (0, 1) is a user define parameter )

Fig. 3.7 An illustration of dynamic weighted ensemble model.

3.3.3 Neural Nets

The investigation of online learning based on neural nets can be traced back to as early as Perceptron, the classical neural net introduced by Rosenblatt(1958), which is a greedy algorithm based on gradient descent to develop a linear classifier. Some variants

49 Development of Adaptive PHM Methodology such as the 2nd order Perceptron are investigated by many researchers as an extension of Perception algorithm by updating the weights based on a 2nd order regularizor (Cesa- Bianchi et al., 2005).

Most of the neural nets are trained by backpropagation algorithm that uses multiple iterations to improve the weights based on gradient descent and its variants. This fact facilities the conversion of neural nets training from batch (off-line) to online. Perceptron presents a natural and intuitive approach by propagating each training example and backpropagate the error through the net only once to realize the online training process. After the initial model is trained by offline data and implemented for online monitoring, once a new sample Xt streams in, it will be fed into the developed neural net, through which a prediction yˆt will be returned. Meanwhile, When the actual outcome yt is avail- able, a comparison between the prediction yˆt and actual result yt will be executed and the error will be employed for neural net net weights adjustment by using backpropagation.

Besides, some other methods that utilizes gradient descent or its variant are con- veniently to convert to an online form, such as Relaxed Online Maximum Margin (Li and Long, 2000), Approximate Maximal Margin Classification(Gentile, 2001), Online Gradient Descent classifier(Zinkevich, 2003), Passive Aggressive method(Crammer et al., 2006), Confidence-Weighted classifier (Dredze et al., 2008), Soft Confidence-Weighted classifier(Wang et al., 2013), Adaptive Regularization of Weight Vectors (Crammer et al., 2009), etc. list in Hoi et al.(2014). More details and results comparison will be given in the case studies in Chapter4.

The main advantage of neural network with online learning is the capability to process an infinite number of examples effectively, but as shown later in the case study results

50 3.3 Models of Adaptive PHM in Fig. 4.8, these passive methods may result in an unstable prediction behavior. The combination of active sample selection strategy with neural nets will improve the model stability.

3.3.4 Online Kernel Methods

Kernel-based Online Learning algorithms aim to deal with the nonlinearity modeling with kernel method. Recently, various kernel based incremental/online learning methods are proposed, such as online kernel ridge regression ( et al., 2019), incremental and decremental SVM Learning (Cauwenberghs and Poggio, 2001; Laskov et al., 2006), in- cremental kernel Singular Value Decomposition (SVD) ( et al., 2006), incremental kernel PCA (Kimura et al., 2005), etc. Compared with direct modeling in raw data space, the incremental algorithms with kernel shows better performance on robustness and generalization. Without loss of generality, here increment SVM is adopted to illustrate how kernel-related online learning methods work. Note that there may exist an overlap between the kernel method with others e.g. Gaussian process can be considered as both Bayesian method and kernel method.

Incremental SVM is introduced in the literature (Cauwenberghs and Poggio, 2001; Diehl and Cauwenberghs, 2003; Laskov et al., 2006) as theoretical investigations in machine learning conferences. In general, it uses the new data points to update original support vectors, which reflects as the adaptive change of hyperplane along with online sequence streaming in. In the incremental SVM, the training problem is formulated as:

C max min W := −1⊤α + α⊤Kα + µy⊤α (3.25) µ 0≤α≤C,y⊤α=0 2

51 Development of Adaptive PHM Methodology where the parameter K is the kernel matrix, α denotes the weights of data, µ is the Lagrange multiplier, penalty parameter C controls the area of the margin.

The basic concept of the incremental SVM is to add a new data point to an existing optimal solution. When a new point xt is added, its weight αt is initially set to 0, which means that the new added point is not a support vector. However, when it is found that xt should become a support vector, the weights of other points and the threshold µ must be updated in order to obtain an optimal solution for updated data set. In order to guarantee the robustness of the updated process of incremental SVM, some prerequisites are needed. For example, the Kuhn-Tucker conditions are satisfied for all the available data points before the new data point comes in. The objective of the weight updating is to find updated weights such that the Kuhn-Tucker conditions are satisfied forthe updated dataset. It is proven that the number of support vectors are highly related with the hyperparameters in SVM model (Baudat and Anouar, 2003; Smola and Schölkopf, 2000). Sometimes it could bring extra computational burdens. By combining incremental SVM and sample importance test, the modeling efficiency could be improved, and the overfitting risk could be improved as only the representative samples in feature spaceare selected.

3.4 Off-line Sample Selection and Modeling Tech-

niques

Based on the above discussed SIT, an off-line and an online sample selection procedure are subsequently proposed. The off-line sample selection targets at reducing the data volume by selecting important samples from the off-line database. The outcome of the off-line sample selection is a DL that is employed to establish the PHM model. Detailed

52 3.5 On-line Sample Selection and Modeling Techniques implementation procedures for the off-line sample selection is presented in Algorithm3, and the proposed algorithm includes the preliminary selection in step 4.1 and the fine selection in step 4.2. The Stopping Criterion in Algorithm3 is evaluated from two aspects. (1) when all the remaining samples in DB fail to pass the SIT, the algorithm is stopped; (2) when the improvements in prediction error are sufficiently small compared with the previous iteration, the algorithm is stopped. Algorithm 3 :Off-line sample selection strategy Input : Off-line database DB Output : a DL with representative samples 1: procedure 2: Randomly select a small subset of data points from DB as DL0 3: Randomly split the remaining samples in DB as validation set V and training set T 4: Train an initial VM model based on DL0 5: loop: 6: while ∼StoppingCriterion do SI 7: Select candidate sample set Sc = {xi|xi ∈ T, hi = 1} SI 8: Select sk from DB and k = arg max{wi |xi ∈ DB} i 9: loop: 10: if ∼StoppingCriterion then 11: Increment sk to DL 12: Re-train VM model based on updated DL; 13: Evaluate the prediction error ϵ on the validation dataset V

3.5 On-line Sample Selection and Modeling Tech-

niques

The on-line deployment procedure is proposed in Algorithm4. The SIT in Sec. 4.2 is employed to evaluate the importance of an incoming sample online. Once the SIT is passed, the incoming sample is incremented to the current VM model and updated to the existing DL. If the number of samples in DL exceeds the maximum number of samples Nmax, the oldest sample is decremented from the current model. In this way,

53 Development of Adaptive PHM Methodology

Fig. 3.8 Offline sample selection and modeling the model and the representative sample are keeping updated.

3.6 Justification of Sample Importance Test

This section justifies the proposed SIT under different scenarios using synthesized data in Fig. 3.10. The 6 data points marked in red dots in Fig. 3.10 are all regarded as important samples. The reasons why they are important and the SIT outcomes are tabulated in Table 3.1.

In Fig. 3.10 , sample 1, 4, and 5 are important since the model suffers large prediction errors around these samples. It does not matter whether they distribute close or far away from existing observations. For sample 2 and 6, the model accidentally gives rather good predictions, but the large prediction uncertainties show that the model is not confident

54 3.6 Justification of Sample Importance Test

Algorithm 4 : On-line sample selection strategy Input : Off-line database DB, current selected data library DL Output : Updated DL and model 1: procedure 2: Select n training samples from DB using Algorithm3, denote the selected training set as DL 3: Train a model M0 by minimizing the model regularization error an obtain the initial parameter Θ0 4: Check the importance of sample {xt, yt} at time t using SIT SI 5: if ht == 1 then 6: Update parameter Θt based on previous model Mt−1 and parameter Θt−1 7: Add {xt, yt} to DL 8: if | DL |> Nmax then 9: Decrement the oldest data sample

Table 3.1 SIT conclusions for example samples in Fig. 3.10

Sample ID FT ET SIT Reason for selection

- The FT is passed 1, 7 1 1 1 - Large prediction uncertainty - Large prediction error

- The FT is passed 2 1 1 1 - Large prediction uncertainty

3 1 0 1 - The FT is passed

- The FT is passed 4 1 1 1 - Large prediction error

- Large prediction uncertainty 5, 8 0 1 1 - Large prediction error

- The FT is passed 6 1 1 1 - Large prediction uncertainty

55 Development of Adaptive PHM Methodology

Fig. 3.9 On-line sample selection and modeling about these predictions. Therefore, it is necessary to include these samples to improve the prediction uncertainty. For sample 3 in Fig. 3.10 , the model accidentally predicts well and the uncertainty is small due to the rapid change of the time series. In this scenario, the FT is essential to identify this critical data point since less previous observation are available around this point.

Based on the above discussion and the test outcomes in Table 3.1, it is found that the ET and FT are equally essential in identifying the important training sample for model performance improvement. This justifies why the proposed SIT uses logic “OR”

56 3.7 An Intuitive Case of Sample Selection and Modeling

Fig. 3.10 Illustrations of SIT operation to fuse the FT and ET outcomes. In addition, sample 3 shows that the SIT could be wrong based solely on the prediction uncertainty.

3.7 An Intuitive Case of Sample Selection and Mod-

eling

In this section, the feasibility of off-line sampling and modeling is intuitively validated by applying to predict the electrical power output of a combined cycle power plant using several different data-driven model algorithms. The prediction of electricity output is always a challenging task because of the dynamic issues such as ambient temperature, atmospheric pressure etc. involved in the power generation process. In this case, in order to validate the effectiveness of the proposed off-line sample selection and modeling method, the power electricity output prediction dataset from UCI public data repository is employed (Ge and , 2010; Tüfekci, 2014), and the results are benchmarked with models trained using all the available samples.

57 Development of Adaptive PHM Methodology

3.7.1 Background

Fig. 3.11 Diagram of a combined cycle power plant (CCPP)

In order for accurate system analysis with thermodynamical approaches, a high number of assumptions is necessary such that these assumptions account for the unpredictability in the solution. Without these assumptions, a thermodynamical analysis of a real application compels thousands of nonlinear equations, whose solution is either almost impossible or takes too much computational time and effort. To eliminate this barrier, the machine learning approaches are used mostly as alternative instead of thermodynamical approaches, in particular, to analyze the systems for arbitrary input and output patterns. Fig. 3.11 shows a combined cycle power plant (CCPP) with two gas turbines, one steam turbine, and two heat recovery systems. Predicting the electrical power output of a power plant has been considered a critical real-life problem to construct a model using machine learning techniques. To predict the full load electrical power output of a base load power plant correctly is important for the efficiency and economic operation ofa power plant. It is useful in order to maximize the income from the available megawatt

58 3.7 An Intuitive Case of Sample Selection and Modeling hours. The reliability and sustainability of a gas turbine depend highly on the prediction of its power generation, particularly when it is subject to constraints of high profitability and contractual liabilities (Tüfekci, 2014).

3.7.2 Design of Experiments

The overall modeling flowchart is shown in Fig. 3.12. In order to validate the effectiveness of the sample importance test and modeling method, the whole datasetis split into different train-test ratios from 0.4 to 0.8. In each test setting, firstly allthe training samples are used for modeling as benchmring, and then the proposed method is applied for sample selection and modeling. The experiment settings are shown in Fig. 3.13. To ensure the universality of the proposed method, 3 different machine learning algorithms are employed here, which are:

• BLR;

• GPR;

• Ensemble neural networks.

These three algorithms cover the different properties of the machine learning algo- rithms such as linear/non-linear; parametric/non-parametric; single/ensemble.

3.7.3 Results and Discussions

The results are shown in the following figures. Figure 12 compares the standard deviation of the prediction errors between two different modeling methods using the measured root mean square error (RMSE). RMSE is defined as:

s PN 2 (yi − yˆi) RMSE = i=1 (3.26) N

59 Development of Adaptive PHM Methodology

Fig. 3.12 Flow chart of power prediction of CCPP

Fig. 3.13 Test settings with different train-test ratios

60 3.7 An Intuitive Case of Sample Selection and Modeling

th where yi is the ground truth of i sample, and yˆi is the corresponding prediction. From Fig. 3.14, it is found that the RMSE of the proposed method and the benchmarking method are comparable with different machine learning algorithms. However, the computational cost of the proposed method is much less than the benchmarking method especially when the training data size becomes larger. Fig. 3.15 shows the different modeling time costs between proposed method and benchmarking methods. With the increasing of training data size, it is shown that the modeling process of proposed methods with sample selection strategy is more efficient than the benchmarking method with all training data. The modeling time difference is very obvious when GPR is applied for modeling, that’s because as a non-linear, non-parametric modeling process (Rasmussen, 2003), GPR usually needs more time and resourcing consuming than other algorithms. The detailed results are tabulated in Table 3.2.

Fig. 3.14 Prediction RMSE comparison between the benchmarking method and proposed method

61 Development of Adaptive PHM Methodology

Fig. 3.15 Modeling time comparison between the benchmarking method and proposed method

Fig. 3.16 Sample size comparison

62 3.7 An Intuitive Case of Sample Selection and Modeling

Table 3.2 Detailed results comparison between benchmarking method and the proposed method

Benchmark Proposed Ratio RMSE Sample # Time(s) RMSE Sample # Time(s) 0.4 4.66 500 0.28 4.68 100 0.27 0.5 4.69 612 0.28 4.72 158 0.29 0.6 4.71 734 0.28 4.70 245 0.27 0.7 4.70 857 0.32 4.70 286 0.28 0.8 4.85 979 0.34 4.80 355 0.28

0.4 4.39 500 3.96 4.41 97 0.22 0.6 4.32 734 7.42 4.32 240 0.45 0.7 4.30 857 15.81 4.31 285 0.59 0.8 5.45 979 22.69 4.44 336 0.89

0.4 4.90 500 2.49 4.51 91 1.50 0.5 5.43 612 3.49 5.57 161 1.30 0.6 4.43 734 3.44 4.96 223 1.85 0.7 4.30 857 2.80 4.14 271 1.38 0.8 4.98 979 2.71 5.08 331 1.52

3.7.4 Summary

In conclusion, the feasibility of off-line sample selection and modeling is verified by the real-world data set for power plant electricity prediction. The results demonstrate that compared with the traditional PHM modeling approaches, which are adopted as the benchmarking methods, the proposed method is capable to provide a much more efficient yet comparable accurate solution, which validates the effectiveness of the off-line sample selection algorithm and demonstrates the feasibility to achieve a comparable prediction accuracy by using a subset of important samples, which significantly improves the model training efficiency in real practice, especially in large volume data environment. Also,it is proven that the proposed SIT effectively identifies important examples from abundant samples in offline environment, which sets up the base for further investigations ofonline

63 Development of Adaptive PHM Methodology sample selection and model updating with typical applications in more case studies in Chapter4.

64 Chapter 4

Case Studies

4.1 Overview of Case Studies

This chapter demonstrated the proposed methodology with three case studies, which are all industrial use cases with real-world data sets collected from regular usage. They are: (1) Hard drive disk (HDD) fault detection with various usage patterns; (2) adaptive Virtual Metrology (VM) of Chemical-Mechanical Planarization (CMP) process; (3) Battery state of health prognosis. As shown in Table 4.1, three cases are designed to validate the adaptive PHM through 3 typical applications (fault detection, virtual metrology, and prognosis) in a comprehensive manner. The capability of both offline model efficiency and on-line model adaptability of the proposed framework are validated in this chapter.

65 Case Studies

Table 4.1 Case study summary

SIT Case studies Model DBI test metric MBI test metric Mahalanobis I: HDD fault detection accuracy Incremental SVM distance Bayesian ARX; Kernel error; II: VM of CMP process SVR; distance uncertainty Random Forest DTW error; III: Battery capacity prognosis MISO GP distance uncertainty

4.2 Case Study I - Hard Disk Drive Online Fault

Detection

In this section, the approach for online sample selection and adaptive modeling updating method is applied to an industrial use case of HDD fault detection.

4.2.1 Background

Most of the data produced in the world are stored on hard disk drives (Mashhadi et al., 2018), therefore, the failure prediction of HDD at an early stage has drawn more and more attention in data storage and data security area. A typical computer used HDD is shown in Fig. 4.1. The complexity of the systems brings challenges in monitoring the HDD health conditions with an indicating process variable. Although the vibration amplitude and acoustic emission signals could be a good source for HDD failure detection, in most cases it is impossible to identify the HDD failures with non-intrusive signals (Wang et al., 2011). Previous efforts have been made to use Self-Monitoring, Analysis and Reporting Technology system (SMART) data, which are real-time measurements of the drives’ technical status, to predict HDD failures.

66 4.2 Case Study I - Hard Disk Drive Online Fault Detection

Fig. 4.1 Computer hard disk drive diagram (source: Wang et al.(2011))

Table 4.2 SMART attributes and their definitions

SMART Attribute Definition SMART 1 Read Error Rate SMART 3 Spin Up Time SMART 4 Start/Stop Count SMART 5 Reallocated Sectors Count SMART 7 Seek Error Rate SMART 9 Power-On Hours SMART 10 Spin Retry Count SMART 12 Power Cycle Count SMART 184 End-to-End error / IOEDC SMART 187 Reported Uncorrectable Errors SMART 188 Command Timeout SMART 189 High Fly Writes SMART 190 Temperature Difference or Airflow Temperature SMART 191 G-sense Error Rate SMART 192 Power-off Retract Count SMART 193 Load Cycle Count SMART 194 Temperature SMART 197 Current Pending Sector Count SMART 198 Uncorrectable Sector Count SMART 199 UltraDMA CRC Error Count SMART 240 Head Flying Hours SMART 241 Total LBAs Written SMART 242 Total LBAs Read

67 Case Studies

4.2.2 Data Description

To evaluate the proposed on-line sample selection and adaptive model updating method, a real-world dataset collected from a data center of Baidu Corp. is used in this case study (Baidu, 2015). Samples were collected from working drives at every hour with SMART attributes as shown in Table 4.2. Each sample contains all the SMART attribute values for a single drive at an exact time. In total it contains 23, 395 drives from an enterprise-class model, labeled good or failed. As a preliminary study, this case only uses part of the data for method validation. One challenging issue in this dataset is the strong uncertainty of the records with a long sampling interval (one sample per hour), which makes the failure related data behaviors cannot be timely and effectively captured.

Fig. 4.2 shows two typical variables with uncertainties from one healthy hard disk drive. There are clear outliers or data drift phenomenon in selected features, Therefore, adaptive modeling updating is necessary in this case to guarantee a robust online fault detection model with good performance.

4.2.3 Methodology

The overall modeling flowchart is shown Fig. 4.4. The input and output variable from the SMART tool are partitioned into off-line training data, online learning streaming data, and testing data. Here the off-line training data means the available data prior to our model process. The online learning data means the streaming data arrived during the on-line monitoring process. And the testing data are the data which have a later timestamp than all off-line training and online learning data. The testing data areused to verify the performance of both off-line model and on-line adaptive model (the data

68 4.2 Case Study I - Hard Disk Drive Online Fault Detection

(a)

(b)

Fig. 4.2 2 typical variables with uncertainties from one healthy hard disk drive

69 Case Studies partition and experiment design are shown in Fig. 4.3).

Fig. 4.3 Data partition for model off-line training and on-line updating

In the off-line part, the representative samples will be selected out of alloffline available samples to formulate an initial DL. In this study, Mahalanobis distance (see Eq. 3.4) is adopted as it is capable to measure the similarity of high dimensional data. During offline stage, the DL and fault detection model will be updated by usingthe SIT through iterations. At each iteration, only one most important sample out of all candidates will be select to add into the DL, and the fault detection model is updated accordingly. The iteration stops when the stopping criterion is met. Thereafter, during the online learning process, the DL and fault detection model obtained through off-line iterations are used as a starting point for online monitoring. Based on online SIT, the im- portant samples are selected out of streaming-in sample sequence, and adopted for model updating. After each model updating, the prediction on testing data will be repeated for model performance evaluation. Each sample would be classified into either a healthy ora faulty label, which indicates whether the HDD at the time is in a healthy condition or not.

It is worthy mentioning that there are quite a number of machine learning/deep learning algorithms that can be adopted to establish a HDD fault detection model, such as decision trees, Bayesian classifiers, Linear Discriminant Analysis (LDA), Neural Nets (NN), etc. In this case, the incremental SVM in adopted to update fault detection model

70 4.2 Case Study I - Hard Disk Drive Online Fault Detection

Fig. 4.4 Flow chart of HDD fault diagnosis

71 Case Studies by learning the novel information obtained from important streaming-in samples which pass the SIT. The reasons why SVM is utilized includes:

• The decision boundary for SVM can be fitted as linear or non-linear based onthe choice of the kernel functions.

• SVM has regularization term the objective function which effective avoids the model overfitting.

• The incremental SVM is efficient for deployment since the model only requires limited number of support vector to build the decision boundary.

Incremental SVM is introduced in the literature (Cauwenberghs and Poggio, 2001; Diehl and Cauwenberghs, 2003; Laskov et al., 2006) as theoretical investigations in machine learning conferences. In general, it uses the new data points to update original support vectors, which reflects as the adaptive change of hyperplane along with online sequence streaming in. In the incremental SVM, the training problem is formulated as:

C max min W := −1⊤α + α⊤Kα + µy⊤α (4.1) µ 0≤α≤C,y⊤α=0 2 where the parameter K is the kernel matrix, α denotes the weights of data, µ is the Lagrange multiplier, penalty parameter C controls the area of the margin.

The basic concept of the incremental SVM is to add a new data point to an existing optimal solution. When a new point xt is added, its weight αt is initially set to 0, which means that the new added point is not a support vector. However, when it is found that xt should become a support vector, the weights of other points and the threshold µ must be updated in order to obtain an optimal solution for updated data set. In order to guarantee the robustness of the updated process of incremental SVM, some prerequisites are needed. For example, the Kuhn-Tucker conditions are satisfied for all the available

72 4.2 Case Study I - Hard Disk Drive Online Fault Detection data points before the new data point comes in. The objective of the weight updating is to find updated weights such that the Kuhn-Tucker conditions are satisfied forthe updated dataset.

4.2.4 Results and Discussions

4 measures are adopted here to quantify the performance of fault detection model, which are: true positive rate (TPR), false positive rate (FPR), the receiver operating characteristic curve ( ROC), which is a graphical plot that illustrates the diagnostic ability of a binary classifier system by plotting the TPR against the FPR at different threshold settings, and area under ROC curve (AUC), which tells how much model is capable of distinguishing between classes.

Fig. 4.5 Off-line model result

73 Case Studies

TPR and FPR are derived as:

# of detected faulty samples TPR = × 100% (4.2) # of total faulty samples

# of healthy samples detected as faulty FPR = × 100% (4.3) # of total healthy samples

Fig. 4.5 shows the detection results of the off-line model. Fig. 4.6 shows the improved model performances when more and more online streaming samples are selected to update the data library. By comparison the results from Fig. 4.5 and Fig. 4.6, it is validated that the on-line sampling and adaptive model updating mechanism can help improve the fault detection accuracy.

74 4.2 Case Study I - Hard Disk Drive Online Fault Detection

(a)

(b)

75 Case Studies

(c)

(d)

76 4.2 Case Study I - Hard Disk Drive Online Fault Detection

(e)

(f)

Fig. 4.6 On-line model results when different online streaming samples are learned: (a) initial model performance from off-line modeling, no samples are learned; (b) model performance when streaming sample 4 is learned; (c) model performance when streaming sample 6 is learned; (d) model performance when streaming sample 18 is learned; (e) model performance when streaming sample 31 is learned: (f) model performance when streaming sample 40 is learned. 77 Case Studies

To validate the superiority of proposed methodology, several typical online learning methods are adopted for comprehensive benchmarking. The benchmark methods chosen here are from a online algorithm library LIBOL (available at http://libol.stevenhoi.org/) developed by Hoi S. et al. (Hoi et al., 2014). According to the developers, LIBOL is designed as an online learning tool that consists of many classical and state-of-the-art online algorithms for both practical usage and experimental tool for algorithm research. The 9 chosen benchmark algorithms are listed as below:

• Perceptron: the classical neural net introduced by Rosenblatt F. et al. (Rosenblatt, 1958), which can be taken as a greedy algorithm based on gradient descent to develop a linear classifier.

• 2nd order Perceptron: the 2nd order Perceptron is an extension of Perception algorithm by updating the weights based on a 2nd order regularizor (Cesa-Bianchi et al., 2005).

• Relaxed Online Maximum Margin: the relaxed online maximum margin algorithms are developed by repeatedly choosing the hyperplane that classifies previously samples correctly with the maximum margin (Li and Long, 2000).

• Approximate Maximal Margin Classification: this algorithm is inspired by SVM and providing an efficient solution to approximate the maximal margin hyperplane with regrding to p-norm (Gentile, 2001).

• Online Gradient Descent: the first order gradient descent is developed withan online form to solve the online convex progamming problem (Zinkevich, 2003).

• Passive Aggressive: Passive Aggressive is also a linear online classifier which introduces Lagrange operator for weight online updating (Crammer et al., 2006).

78 4.2 Case Study I - Hard Disk Drive Online Fault Detection

• Confidence-Weighted: Confidence-Weighted algorithm assigns each hyperpa- rameter a confidence level depicted by Gaussian distribution. The parameter with a lower confidence level updates more frequently (Dredze et al., 2008).

• Soft Confidence-Weighted: Soft Confidence-Weighted algorithm is a variant of Confidence-Weighted algorithm that consider 2nd order terms during the online optimization (Wang et al., 2013).

• Adaptive Regularization of Weight Vectors: Adaptive Regularization of Weight Vectors algorithm improves the Confidence-Weighted algorithm by designing an adaptive regularization of the prediction function, which is capable to handle the non-separable or noisy labeled data (Crammer et al., 2009).

The chosen benchmark methods listed above cover different online algorithm family such as classical, most-recent, first order based, second order based, etc. One main difference between these algorithms with the proposed method is that: these algorithms only consider to use the prediction error to trigger model update, but in the proposed method, prediction error, prediction uncertainty, and data representatives are considered.

Fig. 4.7 and Fig. 4.8 show the results comparison between benchmark algorithms and the proposed method. One can find out that the proposed method’s prediction accuracy is quite comparable with the best online learning algorithms, but it has much lower model updating frequency,which could be more evident when abundant historical data samples are available. What’s more, after each model updating, the proposed method shows more stable behaviors compared with benchmarking algorithms. The reasons of proposed method’s superiority can be summarized as follows:

79 Case Studies

Fig. 4.7 Prediction accuracy benchmarking with other algorithms

Fig. 4.8 Online sample selection benchmarking with other algorithms

80 4.2 Case Study I - Hard Disk Drive Online Fault Detection

1. The proposed method is capable to keep enriching the DL with informative and important samples for model updating , but the online algorithms doesn’t have a memory of the previous samples.

2. The proposed method considers three metrics for sample selection by SIT, which are: sample representativeness, prediction accuracy, and prediction confidence, while the online algorithms only consider model performance, which tend to add all misclassified samples for model updating in an inefficient way and introduce model performance fluctuation.

4.2.5 Summary

To sum up, based on the proposed adaptive PHM framework, an adaptive fault detection modeling method is developed and validated with a real application on HDD online fault detection. The proposed method with the adaptive self-learning ability is verified to predict the HDD failures before they occur by monitoring the streaming-in samples. Compared with the traditional static fault detection model, the proposed method obtains a better performance with acceptable computational cost. Compared with the online algorithms, the proposed method hold the advantages on important sample selection from streaming-in sequence, model prediction stability, and model updating efficiency.

81 Case Studies

4.3 Case Study II – Adaptive Virtual Metrology of

chemical-mechanical planarization process

In this section, the proposed methodology is applied to an industrial use case in semiconductor manufacturing process to build an adaptive virtual metrology (VM) model of chemical-mechanical planarization (CMP) process.

4.3.1 Background

In this case study, an adaptive VM modeling is developed based on the proposed methodology and evaluated using a public dataset of CMP manufacturing process from PHM data challenge 2016 (PHM Society, 2016). VM is essential research for manufactur- ing processes, which aims to estimate the difficult-to-measure parameters of products or processes by computer programs. In literature, VM model can be either physics-based or data-driven, where the latter becomes more popular due to the advancement of intelligent analytics.

In literature, various data mining and machine learning techniques are employed to build VM models. Wan et al.(2014) applied the Multiple Linear Regression (MLR), Least Absolute Shrinkage and Selection Operator (LASSO), Artificial Neural Networks (ANN) and Gaussian Process Regression (GPR) to Chemical Vapor Deposition process (CVD) to perform a comparative study. Some other data-driven models such as Partial Least Square Regression (PLSR) (Park and Kim, 2016) and Support Vector Regression (SVR) (Kang et al., 2016) are also widely studied in the literature for VM applications. Kang et al.(2011) proposed a VM approach for Run-to-Run (R2R) control based on Exponentially Weighted Moving Average (EWMA), which is validated with a simulated dataset. Kim et al.(2012) conducted a preliminary study by benchmarking several

82 4.3 Case Study II – Adaptive Virtual Metrology of CMP Process machine learning algorithms such as Principal Component Analysis (PCA), Linear Re- gression (LR) and SVM for faulty wafer detection, and the results showed a high true positive rate. Khan et al.(2008) investigated the application of VM for wafer-to-wafer (W2W) control, in which the quality of VM data for feedback control is comprehensively discussed. Lynn et al.(2011) proposed a sliding windowed VM with PLSR, ANN, and GPR to estimate plasma etch rates over multiple chamber maintenance events. Hung et al.(2007) presented a VM scheme to predict CVD thickness, and the effectiveness of the proposed method is validated with metrology results. Jia et al.(2018a) proposed an adaptive VM method with Group method of data handling (GMDH) -type polynomial neural networks in CMP process. The validation results provide improved accuracy in comparison with several candidate methods.

One challenge to data-driven methods is that manufacturing processes are subjected to drift over production cycles. It renders the regular ML methods less accurate since these models are usually trained by off-line data and not updated online when a new example is observed. To combat this challenge, the JIT modeling is found useful. A typical JIT model achieves the estimation of target value by establishing a local model based on the selected historical samples similar to the new samples (Uchimaru and Kano, 2016). Hirai and Kano(2015) proposed an adaptive VM with Locally Weighted Partial Least Square Regression (LW-PLS) to predict the etching conversion differential in dry etching process. Chan et al.(2018) proposed a GPR based JIT model with a LASSO based variable shrinkage criterion to select the significant variables in each local model. Several other JIT modeling techniques such as Sequential Update Model (SUM), Locally Weighted Regression (LWR) and LW-PLS are proposed and benchmarked in Hirai and Kano(2015).

83 Case Studies

Although JIT modeling has several advantages, it is inefficient for online prediction since it requires to query similar samples from the off-line DL and to train a local model at each time step. What’s more, JIT models don’t have a long-term memory of the previous models. Even worse, the searching complexity of JIT rises rapidly as the samples are accumulated over time, which renders JIT less appropriate for online VM applications. To tackle this challenge, several dynamic sampling strategies are proposed using active learning techniques to prioritize the important sample for metrology. Pioneer studies on AL-based dynamic sampling strategies can be read in (Baagøe-Engels and Stentoft, 2016; Kang and Kang, 2017; Wan et al., 2013). Kang and Kang(2017) proposed an intelligent VM system to select the samples with low reliability for actual metrology, in which an ensemble of ANNs is employed, and the variance of predictions from different ANNs is adopted to quantify reliability. Wan et al.(2013) proposed a dynamic sampling method for plasma etch processes with GPR model, in which a sampling rule was set by prediction variance.

Another method to tackle the drift is stitching model, which divides whole working regions into different local regimes and builds a bunch of local linear models(Chan et al., 2018). However, this method needs a comprehensive data set to cover the whole working regions, and the model management and updating mechanism are complicated.

Summarizing these studies, the important samples for online metrology are identified essentially based on prediction uncertainties provided by probabilistic models, such as ensemble ANN (Kang and Kang, 2017) , GPR (Wan et al., 2013), etc. To the author’s experience, the prediction uncertainties are often model dependent and sometimes are closely related to the model parameter settings, such as choice of kernel functions, initial- ization strategies, etc. Different choices of VM model or model parameters may draw

84 4.3 Case Study II – Adaptive Virtual Metrology of CMP Process contradictory conclusions. Therefore, it might be dangerous to devise a sampling strategy solely based on data-driven models and it is found the SIT only based on prediction uncertainty might be misleading under certain occasions.

To encounter the challenges mentioned above on semiconductor manufacturing pro- cesses with dynamic properties, inspired by a preliminary study of VM model adaption by Han et al.(2009), this study proposes a methodology to efficiently track the process drift online as well as take advantage of the previous model information. In the proposed method, online tracking is substantially achieved by an Auto-Regressive eXogenous (ARX) structure model and by online learning mechanism based on Bayesian rule. To avoid frequent online model adjustment, a SIT is designed to pre-screen the online samples, and only the important ones that pass the SIT are adopted for model update. The SIT determines the sample importance based on the usefulness and the freshness of a sample. The usefulness of a sample is determined by model Error Test (ET), which identifies the sample with a large prediction error or with a large prediction uncertainty. The freshness of a sample is decided by Freshness Test (FT) which judges the closeness of incoming sample to an off-line DL. If an online sample passes either ET or FT,it is regarded as an important one and is incremented to VM model. For the ET, the prediction uncertainty is depicted by a probabilistic prediction model and Bayesian ARX is mainly adopted in the discussion. For the FT, an off-line sample selection algorithm is designed to avoid huge off-line DL and to boost the computation efficiency foronline test. This off-line sample selection algorithm is inspired by the idea of SVMwhich reduces a large training set into a smaller set of support vectors. In the present case, the off-line algorithm finds a smaller set of important samples that can pass the proposed SIT.

85 Case Studies

As a summary, the contributions and novelties of this work are summarized as follows. 1) An SIT is proposed for both off-line and on-line sample selection purposes. TheSIT evaluates the importance of sample based on the sample freshness and usefulness. 2) An off-line sample selection algorithm is proposed based on the SIT. It reducesthe size of off-line DL and thus boosts the efficiency of SIT in online deployment. Inthe following discussion, the DL is defined as the dataset for model training. It is initialized as a subset of the off-line DB and will be updated dynamically online. 3) An online tracking algorithm is proposed to model dynamic drift in manufacturing processes. Online observations are screened by the proposed SIT and the important samples are incremented to VM model incrementally.

The effectiveness of the proposed methodology is demonstrated using public CMP data and the results are benchmarked with existing approaches.

4.3.2 CMP Process Introduction

CMP tool is widely applied to remove the material from the surface of the wafer. The material removal process in CMP uses corrosive chemical slurry together with a polishing pad and a retaining ring to polish the silicon oxide, polysilicon or metal layers. A schematic diagram of CMP is shown in Fig. 4.9. During the CMP polishing process, the polishing pad’s capability of material removal is diminished. Over time, the polishing pad needs to be replaced with a new pad. Meanwhile, the dresser’s capability to roughen the polishing pads is also reduced and then the dresser must be replaced when the performance is unable to meet the polishing requirements. MRR is a crucial indicator for semiconductor process control, but direct measure not only causes time delay but also interrupts the process reliability and consistency. Therefore, it is necessary to develop a robust VM model to predict the polishing material removal rate (MRR) from a wafer

86 4.3 Case Study II – Adaptive Virtual Metrology of CMP Process with adaptively self-learning ability.

Fig. 4.9 CMP process schematic diagram

Fig. 4.10 Metrology and Virtual Metrology in semiconductor manufacturing process

The available variables and description are listed in Table 4.3. The whole data set can be divided into different subgroups based on the chamber ID and stage. To achieve

87 Case Studies a better MRR prediction model, the VM model is trained in each subgroup individually as several studies show that the different subgroups, in this case, have different random drifting properties (Di et al., 2017; Feng et al., 2019b; Jia et al., 2018a). The subgroups used in this case study are tabulated in Table 4.5. One challenging issue in this dataset is the strong uncertainty of MRR caused by dynamic issues such as variable value drifting along with time, which impairs the prediction accuracy. Fig. 4.11 shows two typical variables with uncertainties. There are clear data recurring and data drift phenomenon in selected features, Therefore, adaptive modeling updating is necessary in this case to guarantee a robust online VM model with good performance.

Fig. 4.11 Dynamic phenomena in PHM 2016 dataset: (1) recurring; (b) drifting

4.3.3 Methodology

Online ARX Bayesian

The overall modeling flowchart is shown in Fig. 4.12. An online Bayesian ARX model with SIT is proposed to track the dynamic processes with slow drift in semiconductor

88 4.3 Case Study II – Adaptive Virtual Metrology of CMP Process

Table 4.3 Data description of 2016 PHM data challenge CMP data

Sensor No. Sensor Name Description X1 machine_id Numeric ID of machine Numeric ID of wafer ring location X2 machine_data in machine X3 timestamp Seconds X4 wafer_id Number representing ID of wafer A or B representing a different type X5 stage of processing stage Chamber in machine for wafer X6 chamber processing A usage measure of polish-pad X7 usage_of_backing_film backing film X8 usage_of_dresser A usage measure of dresser X9 usage_of_polishing_table A usage measure of polishing table X10 usage_of_dresser_table A usage measure of dresser table X11 pressurized_chamber_pressure Chamber pressure X12 main_outer_air_bag_pressure Pressure related to wafer placement X13 center_air_ba_pressure Pressure related to wafer placement X14 retainer_ring_pressure Pressure related to wafer placement X15 ripple_air_bag_pressure Pressure related to wafer placement A usage measure of polishing X16 usage_of_membrane membrane A usage measure of wafer carrier X17 usage_of_pressurized_sheet flexible sheet X18 slurry_flow_line_a Flow rate of slurry typeA X19 slurry_flow_line_b Flow rate of slurry typeB X20 slurry_flow_line_c Flow rate of slurry typeC X21 wafer_rotation Rotation rate of wafer X22 stage_rotation Rotation rate of stage X23 head_rotation Rotation rate of head X24 dressing_water_status Status of dressing water X25 edge_air_bag_pressure Pressure of bag on edge of wafer

89 Case Studies manufacturing, where ARX (Särkkä, 2013) is a linear representation of a dynamic system in discrete time, which is the theoretical basis for many methods in process dynamics and control analysis. The online Bayesian ARX model developed in this studies is to deal with the online process changes, and the SIT is developed to screen online samples to determine important ones for online model update. In addition, the SIT can be used for off-line sample selection to achieve a highly accurate model with less training.

Bayesian ARX is studied for VM modeling in this work, which is a 1st order ARX model as described in Eq. 4.4.

T yk = θ1 + θ2 xk + θ3yk−1 + ε (4.4) T = θ Hk + ε where,

T h T i Hk = 1, xk , yk−1 h T iT θ = θ1, θ2 , θ3 ∼ N (M0, P0) (4.5) ε ∼ N (0, σ2)

It is important to note that yk−1 in Eq. 4.4 denotes the most recent metrology value,

d xk ∈ R refers to the d -dimensional features extracted from the current sample, ε is the process noise that follows Gaussian distribution.

Based on the model structured in Eq. 4.4 , the posterior distribution of unknown parameters θ is estimated using Bayes’ rule. The prior distribution, likelihood function and posterior distribution are described in Eq. 4.6 to Eq. 4.12.

90 4.3 Case Study II – Adaptive Virtual Metrology of CMP Process

Prior:

p (θ) = N (θ|m0, P0) (4.6)

Likelihood:

 T 2 p (yk|θ) = N yk|θ Hk, σ (4.7)

Posterior (batch training):

n Q p(θ|y1:n) ∝ p (θ) p (yk|θ) k=1 n Q  T 2 = N (θ|m0, P0) p yk|θ Hk, σ (4.8) k=1

= N (θ|mn, Pn) where,

 −1   −1 1 T 1 T −1 mn = P + H H H y + P m0 (4.9) 0 σ2 σ2 0  −1 −1 1 T Pn = P + H H (4.10) 0 σ2 T H = [H1, H2,..., Hn] (4.11)

T y = [y1, y2, . . . , yn] (4.12)

After the posterior distribution of the model parameters is obtained, the predictive distribution of the Bayesian ARX model can be further written as:

91 Case Studies

Predictive distribution:

R p (˜y | y1:n, xe) = p (˜y | θ) p (θ | y1:n) dθ R  T 2 = N y˜|θ Hf, σ N (θ|mnPn) dθ (4.13)

 T 2 = N θ|mn Hf, σn where,

2 2 T σn =σ + Hf PnHf (4.14)

h T iT Hf = 1, xe , y˜−1 (4.15)

y˜−1 is the previous observation before y˜.

One advantage of Bayesian ARX is that the model parameter can be updated recursively online when new observations arrive. This means the model parameters are time-variant to follow dynamic changes in processes. The posterior distribution of Bayesian ARX from time k to time k+1 is derived as:

p (θ | y1:k) ∝ p (yk | θ) p (θ | y1:k−1)

 T 2 ∝ N yk|θ Hk, σ × N (θ|mk−1Pk−1) (4.16)

∝ N (θ|mk, Pk) where,

 −1   −1 1 T 1 T −1 mk = P + H Hk H yk + P mk−1 (4.17) k−1 σ2 k σ2 k k−1  −1 −1 1 T Pk = P + H Hk (4.18) k−1 σ2 k

92 4.3 Case Study II – Adaptive Virtual Metrology of CMP Process

By using the matrix inversion lemma presented by Murphy K. in Murphy(2012), the recursive update of model posterior can be further written as:

  T 2 Sk = HkPk−1H + σ  k    T −1 Kk = Pk−1Hk Sk (4.19)  mk = mk−1 + Kk[yk − Hkmk−1]     T Pk = Pk−1 − KkSkKk .

Model Development

The overall modeling flowchart of CMP MRR estimation is shown in Fig. 4.12. Other benchmark methods adopted in this study such as static VM, incremental VM, and JIT VM are illustrated in Fig. 4.13 and Table 4.4. Static VM model is trained merely using the off-line available data, and will not be updated once it is built and deployed for on-line MRR prediction. Incremental VM model with retraining strategy is retrained periodically after the accumulation of new coming in streaming samples to ensure the model can always capture the new emerging patterns in the streaming data. JIT VM model updates after each new streaming data comes in by simply adding the incoming sample to the database and then searching for the similar samples from all the historical samples for modeling. A general summary of the advantages and disadvantages of the mentioned methods is tabulated in Table 4.4.

The flowcharts in Fig. 4.14 and Fig. 4.15 depict the off-line and on-line part of the proposed methodology. In the off-line part, the goals include:

• selecting a representative sample subset DL;

• training a VM model based on the sample subset DL.

93 Case Studies

Fig. 4.12 Flowchart of Adaptive Virtual Metrology Modeling for MRR prediction in CMP.

Table 4.4 Typical VM methods adopted in semicondustor manufacturing industry

VM methods Pros Cons Static VM • Easy for modeling and • Can’t deal with dynamic issues implementation in process

VM with • Adaptive to data drift or • Heavy time and resource retrain strategy dynamic work environment

• Efficient for remodeling • Need to search for historical database Just-in-time model • Heavy time and resource consumption • Can’t adjust to data change timely

94 4.3 Case Study II – Adaptive Virtual Metrology of CMP Process

(a)

(b)

(c)

Fig. 4.13 Other VM methods for benchmarking

95 Case Studies

The VM model and DL are initialized based on a small number of randomly selected samples from the database. Then the algorithm updates DL and VM model using an SIT through iterations. At each iteration, only one sample is added to the DL, and the VM model is updated accordingly. The iteration stops when the stopping criterion is met. In the online manufacturing environment, the DL and VM model obtained through off-line iterations are used as a starting point for online tracking. The SIT keeps checking the importance of the incoming observation st . If st passes the SIT, the DL is updated and the VM model is incremented through recursions that are described in Eq. 4.19.

Fig. 4.14 The proposed methodology for off-line sample selection and model training.

Fig. 4.15 The proposed methodology for the on-line sample selection and model update.

The proposed SIT in Chapter 4 evaluates the sample importance based on FT and ET to realize the sample selection, where FT determines the freshness of the new example by comparing the kernel distance with the existing samples in the most up-to-date DL, which selects the sample to explore the data space. ET decides the sample importance by considering both the prediction error and the prediction uncertainty, which selects the sam- ple to improve model performance. More details about the SIT can be found in Chapter 4.

96 4.3 Case Study II – Adaptive Virtual Metrology of CMP Process

The detailed procedure of model realizations is the same as described in Algorithm 1 and Algorithm 2 in Chapter 4. In Algorithm 1, the DL and VM model are updated through iterations and the new important sample is selected at each iteration. It is noted that the VM model update in Algorithm 1 can either use the batch re-training in Eq. 4.8 to Eq. 4.12 or use the online recursion in Eq. 4.16. In addition, The StoppingCriterion in Algorithm 1 is evaluated from two aspects. 1) when all the remaining samples in DB fail to pass the SIT, the algorithm stops; 2) when the improvement in prediction error is sufficiently small (set as 1%) compared with the previous iteration, the algorithm stops. In Algorithm 2, the DL and VM model obtained in Algorithm 1 are used as starting point for online tracking. In the on-line tracking, the SIT decides whether the incoming observation is important or not. If yes, it is updated to the DL and the VM model. The VM model is updated through recursion equations in Eq. 4.16. In the way, the model can effectively follow the slow drift of the manufacturing process. One limitation ofthe proposed method in Algorithm 2 is that the model doesn’t have forgetting mechanism. This is limited by Bayesian ARX itself. Possible solutions involve replacing Bayesian ARX with sparse machine learning algorithms such as SVR or GPR, which will be explored in future investigation. Besides the proposed online SIT, the authors find it important to limit the duration to obtain new metrology sample.

Data Description and Design of Experiments

The proposed method is validated based on the public CMP data published in PHM data competition in 2016. This dataset consists of the trace signals from multiple wafer runs. A summary of the dataset can be found in Table 4.3. More details and background of this dataset can be found in (Jia et al., 2018a,b) or the URL from PHM society website: https://www.phmsociety.org/events/conference/phm/16/data-challenge.

97 Case Studies

Table 4.5 CMP VM data partition

MRR Range Sample size Chamber ID Stage (nm/min) Training Testing Group 1 4, 5, 6 A 50-110 798 165 Group 2 4, 5, 6 B 50-110 815 186 Group 3 1, 2, 3 A 140-170 364 73

Typically in semiconductor area, statistical features of process variables are extracted for VM modeling to predict MRR (Hirai and Kano, 2015). In this work, the step number is used to segment the trace signals and extract statistical features from each step. These statistical features include mean, variance, and peak to peak values. The raw features are finely selected by the forward and backward search based on MLR. For details about feature extraction and selection, please refer to the related research work published on this dataset (Feng et al., 2019b; Jia et al., 2018a).

In this research, there are three experiments designed in Table 4.6. The purposes of experiments design are described as below:

• Experiment A is designed to validate the off-line part of the proposed method in Fig. 4.14 and Algorithm3.

• Experiment B is designed to validate the on-line tracking part of the proposed method in Fig. 4.15 and Algorithm4.

• Experiment C aims to compare proposed method with existing methods in literature.

In experiment A and B, data samples in data group 1 and data group 2 are sorted by timestamp and the sorted data is further divided into training and testing subsets in each data group, respectively, as shown in Fig. 4.16.

98 4.3 Case Study II – Adaptive Virtual Metrology of CMP Process

In experiment A, the off-line model training and testing are implemented to demon- strate the effectiveness of off-line sample selection and model construction. To validate the robustness of the proposed offline modeling and sampling meyhods, various - chine leearning algorithms such as Beyasian ARX, SVR, and Random Forest (RF) are benchmarked based on different train/test data partition ratios from 0.4 - 0.8 withthe resolution 0.1. In this Experiment, the first part (training data) is taken as available historical data for off-line model training and the second part is taken as online data for model performance evaluation.

In experiment B, the online tracking algorithm is validated through one-step-ahead prediction. In this experiment, the first part is taken as available historical data for off-line model training and the second part is taken as online streaming data, which is used to test on-line sampling and model updating. To simulate the on-line MRR prediction process, only MRR value in one-step ahead is predicted in each step, and the mean squared error (MSE) and of all testing data is used to evaluate the performance of the model. Different from first experiment which just focus on the off-line modeling and on-line prediction respectively, this experiment considers the development both off-line model initialization and on-line model evolution. Concerning the off-line training data, firstly, the initial model (or say off-line model) is established based on selected samples. The details of the interactive sample selection and modeling process is presented in section 4.4 Off-line Sampling and Modeling. Thereafter, during the online learning process, the informative samples are selected step by step. Once a streaming sample passes SIT, it will be used to update the data library and developed model, which guarantees the data library and model could always be adaptive to the dynamic changes caused by either environment noise or other unobserved drift issues. After each model updating, the new

99 Case Studies model will be used to predict the MRR value in the next step.

In experiment C, the proposed method and existing methods are benchmarked under the same setting with the data competition and the literature (Jia et al., 2018a; Li et al., 2019b; Wang et al., 2017). MATLAB 2017b was used in this work for model development. Experiments were run on a computer with an Intel Xeon processor running at 3.50GHz using 32 GB of RAM, running Windows 10 Enterprise.

Fig. 4.16 Data separation for modeling training and testing

4.3.4 Results and Discussions

Experiment A: Off-Line Sample Selection and Modeling

The benchmark results of Bayesian ARX, SVR, and RF are shown in Table 4.7. Besides MSE, another measure, mean absolute percentage error (MAPE) is also employed

100 4.3 Case Study II – Adaptive Virtual Metrology of CMP Process

Table 4.6 Design of experiments

Experiment Design of Experiments Step1. Sort the time stamp of both training and testing samples; A Step2: Vary the data split ratio for model training and testing; Step3: Benchmarking of multiple static VM models

Step 1: Sort the time stamp of both training and testing samples; B Step 2: Use the first 60% samples for model training; Step 3: Make one-step-ahead prediction for the rest 40% samples

Use the same training and testing data set in the PHM data C competition to benchmark with existing literature to quantify the prediction accuracy of the VM methods. MAPE is defined as:

1 yi − yˆi MAP E = × | | ×100 (4.20) N yi

th where yi is the ground truth of i sample, and yˆi is the corresponding prediction. Based on MSE and MAPE of VM models with and without samples selection, one can find that a comparable model is derived based on a smaller sample subset. Table IV also tabulated training complexity of SVR, RF and Bayesian ARX by the number of selected training samples Ntr. Therefore, the number of selected training samples out of the total samples serves as a quantity index for training efficiency. A smaller Ntr means better efficiency. The proposed off-line sample selection algorithm effectively removes sample redundancy yet maintain model performance. In addition, Bayesian ARX performs well compared with more complex algorithms such as SVR and RF because it models the yk−1 as an exogenous variable. This result implies that the MRR is a dynamic time series over time, where yk and yk−1 is strongly correlated.

101 Case Studies SVR RF MSE (Models without Sample Selection) ARX Bayesian Number of Training Samples SVR RF Number of Selected Samples ARX Bayesian Table 4.7 Benchmark of static VM models SVR RF MSE (Models with Sample Selection) ARX Bayesian 0.40.50.6 12.510.7 11.87 19.590.8 9.73 12.89 14.10 10.56 12.29 12.880.4 10.77 12.98 11.010.5 207 12.39 13.570.6 234 10.77 12.890.7 370 12.24 261 66.830.8 471 286 209 12.87 27.12 25.49 295 236 11.16 16.20 563 21.52 655 11.00 14.54 270 24.42 743 177 291 14.72 382 20.90 206 300 14.49 477 389 232 572 489 253 184 668 586 270 214 763 12.06 682 236 11.36 780 19.59 250 395 12.51 14.10 277 9.34 494 10.51 12.08 592 10.80 12.88 12.98 691 11.07 12.49 13.57 790 11.74 12.80 13.05 66.83 12.75 27.83 25.49 10.89 17.75 21.52 11.01 13.63 24.42 14.89 20.90 14.58 Test Ratio Train/ 1 2 Data Group

102 4.3 Case Study II – Adaptive Virtual Metrology of CMP Process

Fig. 4.17 and Fig. 4.18 give two examples of the results of SIT. In Fig. 4.17 (a) and Fig. 4.18 (a), hSI = 1 means the data sample is one of the candidate samples that might be selected. Fig. 4.17 (b) shows an example of the selected sample which has a large prediction error. Likewise, Fig. 4.18 (b) demonstrates an example of a selected sample due to a large prediction uncertainty. These two examples indicate that the proposed algorithm for off-line sample selection functions as expected. The prediction intervals in all figures are obtained based on 95% confidence interval (significance level α = 0.05 ).

Fig. 4.17 The (a) hSI and (b) MRR prediction at 50th iteration for data group 1. hSI = 0 means the sample is not selected, hSI = 1 means the sample is selected as a candidate. Here gives an examples of selected sample due to a large prediction error.

Fig. 4.19 and Fig. 4.20 visualize the sample selection for the first 100 training samples in data group 1. Approaching the end of the iterations, as shown in Fig. 4.20, very few samples can pass SIT and the training error tends to converge as shown in Fig. 4.21. It is found that the training error converges after a certain number of iterations. Similarly, testing error also converges over iterations. These results demonstrate that the proposed method can converge as expected.

103 Case Studies

Fig. 4.18 The (a) hSI and (b) MRR prediction at 50th iteration for data group 1. hSI = 0 means the sample is not selected, hSI = 1 means the sample is selected as a candidate. Here gives an examples of selected sample due to a large prediction uncertainty.

Fig. 4.19 Visualization of (a) hSI and (b) MRR prediction for the first 100 samples in data group 1 at the 100th iteration. All training samples can be categorized into 3 groups. The sample with hSI = 0 means the sample are failed to pass the SIT. The samples with hSI = 1 means the samples pass the SIT in this iteration and are selected as candidates. The samples in DL indicates the already selected samples in DL in previous iterations.

104 4.3 Case Study II – Adaptive Virtual Metrology of CMP Process

Fig. 4.20 Visualization of (a) hSI and (b) MRR prediction for the first 100 samples in data group 1 at the 240th iteration. All training samples can be categorized into 3 groups. The sample with hSI = 0 means the sample are failed to pass the SIT. The samples with hSI = 1 means the samples pass the SIT in this iteration and are selected as candidates. The samples in DL indicates the already selected samples in DL in previous iterations.

Fig. 4.21 Change of MSE for both training and testing samples over iteration steps when split ratio is 0.6.

105 Case Studies

Table 4.8 Benchmark of experiment B

Data group 1 Data group 2 Method MSE MAPE MSE MAPE

M0 : MLR 13.17 3.79 24.90 4.99

M1 : Bayesian ARX with 9.73 3.17 12.87 3.38 time invariant parameters

M2 : Online Bayesian ARX 9.10 3.05 11.21 3.17 with online retraining

M3 : JIT model 9.37 3.10 12.36 3.23

M4 : Proposed method (Online Bayesian ARX with 8.89 3.01 11.10 3.14 sample selection)

Experiment B: On-Line Sample Selection and Model Update

In the setting of this experiment, all samples are sorted by time firstly. This reorganizes the dataset into multivariate time sequences. The first 60% of multivariate temporal sequences are adopted for model training. For the rest 40%, one-step-ahead prediction is implemented. The benchmarked algorithms and results are tabulated in Table 4.8, where both MSE and MAPE are presented. In Table 4.8, M0 is the regular MLR without the AR term of ARX model, M1 is trained by off-line data and not updated online. M2 is trained off-line first and then retrained after every new observation. M3 is a regular JIT model and MLR is adopted for modeling. The benchmarking results in Table 4.8 demonstrate that the proposed method M4 gives the best prediction accuracy. It is found that the Bayesian ARX (M1 ) can reduce the prediction error significantly compared with MLR in M0 . The prediction error is further reduced by the online Bayesian ARX (M2 ) with time variant model parameters. Next, sample selection by SIT avoids very frequent online model and improves model

106 4.3 Case Study II – Adaptive Virtual Metrology of CMP Process performance by tracking the process variation and keep updating the model parameters based on the online information from most updated important samples. Furthermore, the performance of the proposed method is much better than the JIT models (M3 ). In comparison, M4 outperforms other dynamic models M1 , M2 , and M3 . A detailed comparison between predicted MRR and ground truth is shown in Fig. 4.22.

Fig. 4.22 The visualization of predicted MRR by the proposed method.

Fig. 4.23 shows the results of online sample selection at the first 50 time-steps. It is found that the less well-predicted samples in Fig. 4.23 (d) are selected to update DL by the proposed algorithm. Some of these points are selected by the FT while others are picked by ET. It is important to note that dUL and wUL for FT and ET are changing dynamically over time, although the variations are too small to be visualized.

Fig. 4.24 shows the number of selected training samples at different time steps. This number increases linearly over time in M2 and M3 . That is because that the ARX is retrained at every time step for M2 and the data library for JIT model (M3 ) updates at every time step. For M1 , the number of training sample is fixed since the model is not updated over time. For the proposed method M4 , the number of training samples

107 Case Studies slowly increases over time as the selected new observations are selected to update DL and the model. Fig. 4.25 compares the computation time for model update for M2 , M3 , and M4 . It is found that the required computation time for M4 is significantly smaller compared with other methods since the recursion is adopted for model update.

SI Fig. 4.23 The visualization of (a) h , (b) dt for FT, (c) wt for ET and (d) MRR predictions for one-step-ahead prediction.

Experiment C: Benchmarking With Existing Approachesg

Experiment C benchmarks proposed method with existing approaches in literature. The training data and testing data keep the same as that in data competition. The reported prediction MSE of MRR are shown in Table 4.9. Fig. 4.26 and Fig. 4.27 show the detailed predicted MRR results and the regression plot between prediction and ground truth, from which one can tell the predicted results from proposed method are very accurate compared with the actual MRR value. Although this is just an initial result,

108 4.3 Case Study II – Adaptive Virtual Metrology of CMP Process

Fig. 4.24 The number of selected training samples.

Fig. 4.25 The benchmarking of computation time.

109 Case Studies the effectiveness of the proposed method is validated.

Fig. 4.26 Predicted MRR from proposed method vs actual MRR

Fig. 4.27 Plot regression of predicted MRR vs actual MRR

It is found that the prediction accuracy of proposed method is comparable with other sophisticated machine learning models, especially deep learning methods. Meanwhile, the

110 4.3 Case Study II – Adaptive Virtual Metrology of CMP Process proposed method keeps a simple structure without parameter tuning efforts. Compared with other models, the proposed method holds following advantages.

1. Both model simplicity and competitive prediction accuracy are achieved;

2. The model is constructed based on the selected samples with smaller size;

3. The computation efficiency for online model update is improved by comparing with M2 and M3 .

Table 4.9 Benchmark with existing methods in literature

Method MSE

M0: MLR (Jia et al., 2018a) 16.24

M1: Bayesian ARX with time-invariant parameters 7.53

M2: Online Bayesian ARX with online retraining 7.09

M3: JIT model 16.03

M4: Proposed method(Bayesian ARX + SI) (Feng et al., 2019b) 7.06

PHM 2016 data challenge champion (Jia et al., 2018b) 7.07

ELM-stacking (Li et al., 2019b) 21.53

Integrated model (Jia et al., 2018a) 6.62

GMDH (Jia et al., 2018a) 6.79

DBN (Wang et al., 2017) 7.29

2nd physical model (Wang et al., 2017) 57.76

111 Case Studies

4.3.5 Summary

This study proposes an intelligent VM system with adaptive sample selection and online model learning. In the proposed method, the VM model is initially trained offline, and it is recursively incremented online. The off-line part employs an iteration strategy to identify the important samples from a large database. With the selected DL, the training efficiency of the initial VM model is improved but the model performance is maintained. By applying the online SIT, the importance of new observations in an online environment is dynamically identified and the process drift is tracked by updating the model parameters through recursions. It is noted that computation complexity for online parameter estimation is constant and it is determined by Eq. 4.19. The proposed online SIT helps improving the efficiency by avoiding frequent model update.

The effectiveness of the proposed method is demonstrated in case A and Bbyusing a CMP dataset. The advantages of the proposed method over other dynamic models are verified in experiment C.

112 4.4 Case Study III – Battery Capacity Prognosis

4.4 Case Study III – Battery Capacity Prognosis

4.4.1 Background

State of Capacity (SoC), State of health (SoH) forecasts and RUL prediction are increasingly important in battery prognostics as they are key parameters of appropriate battery management strategy to avoid catastrophic failure, to enhance battery durability and to optimize cost (Hu et al., 2015). Also, due to the widespread applications of Lithium-ion Batteries (LiB) in many industrial sectors, research of SoH, SoC prediction holds great academic value and economic impact. In general, SoH is defined as a perfor- mance index to describe the degree of degradation of battery, and SoC denotes capacity of battery in compared to the capacity in its fully charged state (Ng et al., 2020). In the present study, SoC, SoH, and RUL prediction mainly predicts future battery capacity, which helps battery to be used to designed potential and maximum life expectancy before failed.

In past decades, various methods have been proposed for battery SoH prediction. Similarity-based method ( et al., 2015), fuzzy logic (Salkind et al., 1999), and neural networks (Qian et al., 2010) are explored in several early studies and satisfactory pre- diction results are reported. However, a common shortcoming of these methods is the need for abundant historical data to describe the prediction uncertainties and capability to adapt to the in-process dynamics. To address this issue, GPR, particle filters (PF) and relevance vector machine (RVM) are discussed recently to obtain the battery SoH prediction and confidence interval. Although many recent studies can effectively describe the prediction uncertainty, most of these methods still predict the capacity in a single cell setting and fail to be robust against the variations among different batteries which go through different usage patterns. Improved prediction accuracy is reported inamore

113 Case Studies recent study using the multi-task GP (MTGP) (Richardson et al., 2017). In their study, the superiority of MTGP for battery prognostics, especially for long term prediction, is well demonstrated and the improvements in the results are contributed by exploiting the cross-correlations between current SoH trajectory and historical trajectories. Although the results are promising, this method is computationally expensive since it requires to compute the inverse of large kernel matrix, which makes it more useful in offline stage but not in a online fashion. This renders it less practical in real applications.

To present an efficient, adaptive, and accurate solution for battery capacity prediction in a multi-cell setting, this study aims to build up an adaptive battery capacity prognosis solution as a case study of the proposed Adaptive PHM methodology. The solution demonstrates an efficient yet effective way to exploit the cross-trajectory correlations without adding many computation complexities to the standard GPR model, but taking advantages of the historical available data.

4.4.2 Methodology

Gaussian Process for Battery Health Prediction

GPR is a non-parametric method that can be used to model complex systems and it is preferred in battery prognostics due to its flexibility to provide uncertainty representations. GPR models battery degradation as a real process f(x) that is characterized by the mean function (MF) m(x) and a covariance function (CovF) k(x, x′), as described below:

y = f(x) ∼ N (m(x), k(x, x′)) (4.21)

In Eq.4.21, f(x) and f(y) denote the input and output pairs in the training dataset T = {f(x), f(y)} denotes the mean function which is often set to 0. The CovF k(x, x′),

114 4.4 Case Study III – Battery Capacity Prognosis which describes the similarity between data points, is the key ingredient in GPR since data points with similar input k(x are likely to have similar target value k(y (Rasmussen, 2003). By choosing CovF properly, GPR can model arbitrarily complex systems. According to (Rasmussen, 2003), a large class of CovFs is available, among which the squared exponential (SE) and Matérn (MA) kernel function (KerF) are commonly studied in battery prognostics (Richardson et al., 2017), as shown below:

2 2 d kSE(d) =θ1exp(− 2 ) (4.22) 2θ2 " 2 # 2 sin [(2πθ2)d] kPER(d) =θ exp − (4.23) 1 2  ν   1−ν √ √ 2 2 d d kMA(d) =σ  2ν  Kν 2ν  (4.24) Γ(ν) ρ ρ

′ where d is the Euclidean distance between two indices d = ∥x, x ∥. Kν denotes the modified Bessel function. Parameters θ1 and θ2 in Eq. 4.22 and Eq. 4.23, σ, ν and ρ in Eq. 4.24 are hyperparameters that need to be estimated iteratively. The covariance for input x results in a covariance matrix K(x, x), where the covariance function k(xi, xj) determines the element Kij in the matrix. In general, covariance functions have to fulfill Mercer’s theorem, meaning that K(x, x) is symmetric and positive semi-definite (PSD). Complex covariance functions can be constructed by affine transformations (such as addition, sum, etc.) of basic covariance functions (Bonilla et al., 2008; Rasmussen, 2003). During the model training, the aim of GPR is to learn a regression model y = f(x) + ϵ ,

2 where f(x) represents the underlying regression function and ϵ ∼ N(0, σn) is the random noise. The hyperparameters in the MF and CoF are obtained by minimizing the negative log marginalized likelihood (NLML) in Eq. 4.25 based on its partial derivatives, which can be used in conjunction with a numerical optimization routine such as gradient descent optimization to find optimal hyperparameters settings. More details about partial derivatives of NLML and hyperparameter selection can be found in the publication

115 Case Studies

(Rasmussen, 2003).

NLML = −log(p(y|x, θ)) (4.25) 1 1 −1 n = log|K + σ2I| + y⊤(K + σ2I) y + log(2π) 2 n 2 n 2

In the model testing phase, the predictive distribution of GPR at input data point x∗ can be described as:

f∗|x, y, x∗ ∼ N(f∗, cov(f∗)) (4.26) where

2 −1 f∗ =m(x∗) + K(x, x∗)[K(x, x∗) + σnI] (y − m(x)) (4.27)

2 −1 cov(f∗) =K(x∗, x∗) − K(x∗, x)[K(x, x) + σnI] K(x, x∗) (4.28)

In the two eqs above, f∗ represents the mean of prediction output and cov(f∗) can be employed to describe the confidence interval.

Fig. 4.28 Three GPR models for battery SoH prediction: (a) Single Input Single Output (SISO) GPR; (b) Multiple Input Multiple Output (MIMO) GPR; (c) Multiple Input Single Output (MISO) GPR.

116 4.4 Case Study III – Battery Capacity Prognosis

For battery SoH prediction, three different GPR models are employed in current literature, as in Fig. 4.28. The SISO-GPR in Fig. 4.28(a) is discussed in (Goebel et al., 2008) to predict the battery internal resistance and subsequently transferred the predicted values to capacity domain. Though the result is satisfactory, the extrapolation accuracy deteriorates very fast as the prediction horizon increases. To make GPR more useful for long term prediction, Liu et al.,(Liu et al., 2013) propose to use a linear or quadratic mean function to improve the prediction accuracy. However, the assumption in MF is too subjective. Another study in (Li and Xu, 2015) proposes a mixture of GP model to predict the battery capacity in a multi-cell setting. This model employs the MISO-GPR model in Fig. 4.28(c). The unknown parameters are updated recur- sively using a PF and the GPR is mainly employed to initialize the parameter of PF model.

A more recent study using MTGP model for battery capacity prediction demonstrates significant improvements in the long term prediction. The MTGP model is originally proposed in (Bonilla et al., 2008; Dürichen et al., 2014) and also demonstrates great potential in physiological time series prediction. The MTGP model employs the MIMO- GPR model in Fig. 4.28(b) since it can predict multiple outputs (tasks) at the same time. In comparison, model Fig. 4.28(a) and (c) are referred to as single task GPR model. The uniqueness of the MTGP is that it models the cross-trajectory correlation by constructing a novel CovF as shown in Eq. 4.29.

′ ′ ′ ′ kMT GP (x, x , l, l ) = kc(l, l ) × kt(x, x ) (4.29)

where kc and kt models the cross-trajectory correlation and the covariance for one trajectory, respectively. l, l′ ∈ {1, 2, ..., m} represents the indices of tasks and there are m tasks in total. x, x′ represent the time indices for the task l and l′, respectively. Based on the kernel function in Eq. 4.29, one can thus construct the covariance matrix for MTGP

117 Case Studies as in Eq. 4.30.

KMT GP (X, l, θc, θt) = Kc(l, θc) ⊗ Kt(X, θt) (4.30)

P t P t t where KMT GP is a ( t n ) × ( t n ) matrix and n denotes the number of data points

th t t for t task. Kc and Kt are m × n and n × n matrix, respectively. ⊗ is the Kronecker product. θc and θt are the hyperparameters for MTGP model.

Improvements of MTGP in battery capacity prognosis are mainly contributed by exploiting the cross-correlation among historical trajectories, as stated in (Richardson et al., 2018). Although, the shortcoming of MTGP is also obvious since it requires to inverse large kernel matrix in Eq. 4.30 when historical data is abundant. If it is assumed that nj is the same for evey training trajectory j ∈ {1, 2, ..., m} and nj = n for simplicity, the computation complexity for evaluating MTGP is O(m3n3) compared with (n3) for standard GPR (Dürichen et al., 2014).

By comparing the three model structures in Fig. 4.28, one can find that the SISO-GP is over-simplistic by predicting battery capacity in a single-cell setting. However, the MIMO-GP is over-complex for battery prediction since only one prediction output is desired in most scenarios. What’s more, most of current GPR related battery capacity and health research focus on the off-line scenarios, which hinders its implementation in practice. Inspired by these observations, it is practically crucial to integrate SIT with MISO-GP to build a novel prognosis model to achieve following advantages:

1. To enhance the prediction performance by predicting the battery capacity in a multi-cell setting;

118 4.4 Case Study III – Battery Capacity Prognosis

2. To simplify the MTGP model while keeping all its advantages for battery capacity prediction.

3. To be adaptive to deal with the dynamic issues during the battery degradation process.

Model Development

Similar to the previous methodology, an online-adaptive battery prognosis model is developed based on the SIT. As each battery capacity degradation trajectory is taken as a "sample", the term "trajectory" will be used in this case in stead of "sample". In the off-line part, the representative trajectories will be selected out of all offline trajectories to formulate an initial DL. In this study, DTW (see Eq. 3.8) is adopted as it is capable to measure the similarity between two temporal sequences even when two vary in speed. During offline stage, the DL and prognosis model will be updated by using an SITthrough iterations. At each iteration, only one trajectory will be select to add into the DL, and the prognosis model is updated accordingly. The iteration stops when the stopping criterion is met. During the online stage, the DL and prognosis model obtained through off-line iterations are used as a starting point for online prognosis. Once anewbattery cell is observed, the selected trajectories in DL will be used as references for capacity prognosis. And the SIT keeps checking the importance of the incoming observation. If new trajectory passes the SIT, the DL is updated.

The main idea of DL-based MISO-GP can be described by following equation (Feng et al., 2020):

m X ′ y = f(x) ∼ N( αiTi, k(x, x ))) (4.31) i=1

119 Case Studies

Fig. 4.29 Illustration of trajectory selection from DB by SIT and battery capacity prognosis based on up-to-date DL.

where Ti, i ∈ [1, 2, ..., m] is the selected m trajectories in up-to-date DL, and αi, i ∈ [1, 2, ..., m] is the weight factor that describes the similarity between selected trajectory and current capacity data. The idea is that the behavior of current battery’s capacity can be divided into two parts: the main part can be explained by trajectories in DL, and the residue part. Therefore, the weighted summation of representative trajectories in DL is taken as the mean function (MF), and the residue part is described by GPR. During model training, the parameters in MF and CovF are optimzied by maximizing the NLML in Eq. 4.25.

The major superiority of proposed SI-based MISO-GP method over standard GPR involves predicting battery capacity in a growing multi-cell setting, so that (1) the long- term prediction can be more accurate than standard GPR; (2) the model is more robust as it can always integrate the new unobserved degradation path by DL updating. Compared with MTGP, the major advantage of proposed SI-based MISO-GP is the reduced model complexity. The biggest difference between proposed method and MTGP involves how historical data is utilized in the model. In MTGP, all the historical information is

120 4.4 Case Study III – Battery Capacity Prognosis introduced by constructing a large kernel matrix as in Eq. 4.30. However, this large kernel matrix is avoided in this work as shown in Eq. 4.31. If the number of samples for different training trajectories is denoted as n, the number of different trajectories is denoted as m, the number of representative trajectories in DL is denoted as m′, then the complexity of MTGP is O(m3n3), but the complexity of proposed method is O(m′n3). If there are abandunt historical trajectories available, then it exists m′ ≪ m, which means the efficiency will be more significant.

4.4.3 Data Description and Experiment Design

Fig. 4.30 Normalized capacity trajectory of all battery samples

The battery usage data in this case study is also provided by NASA Prognostics Center of Excellence (Bole et al., 2014a). In this dataset, the capacity degradation was charged and discharged using a randomized current varied from 0.5 A to 4 A to a number of LG Chem. 18650 lithium-ion cells at different temperatures to test the battery degradation behavior in an accelerated realistic usage pattern. The experiment data

121 Case Studies was collected in every 50 cycles, And the experimental sequence was randomized to well represent realistic battery usage, which makes SoH prediction very challenging since the degradation trajetories of different batteries have obvious dynamic properties (Bole et al., 2014b; Richardson et al., 2017). In the experiment, 7 groups of experimental conditions are considered, and 4 LiBs are tested under each condition. Because some experiments have much fewer samples than than others, here 18 battery capacity degradation trajec- tories are selected from this dataset for method validation, which can be found in Fig. 4.30.

Fig. 4.31 Normalized capacity trajectory of offline available battery samples

Fig. 4.32 Normalized capacity trajectory of streaming-in battery samples

122 4.4 Case Study III – Battery Capacity Prognosis

Fig. 4.33 Normalized capacity trajectory offonline test samples

Fig. 4.34 Data partition and experiment design for model validation

123 Case Studies

To validate the superiority of proposed method, The NASA randomized walk battery test data are partitioned into off-line data, online learning streaming-in data, and testing data. The plots of these three data subsets can be found in Fig. 4.31- 4.33. The off-line training data means the available battery capacity degradation trajectories before the model implementation. The online learning data means the streaming data arrivs during the on-line monitoring process. And the testing data are used to verify the performance of both off-line model and on-line adaptive model. First, the initial model (or say off-line model) and initial LD are established based on the off-line available trajectories, more details about the algorithm design can be found in Algorithm3. Thereafter, during the online learning process, the informative capacity degradation trajectories are selected and adopted to update the DL. After each updating, the prediction on testing data will be repeated for model performance evaluation, as shown in Fig. 4.34.

4.4.4 Results and Discussions

In the setting of this case study, besides the propose method, three benchmark methods are introduced, which are:

• M1 : MISO-GPR model developed by using all offline available trajectories;

• M2 : SI is applied on the offline trajectories, and MISO-GPR model developed based on DL with the selected offline available trajectories;

• M3 : based on the developed M2 model, M3 updates the model online by using the online learning mechanism, that is, always update the model by including new samples as long as the prediction is not satisfying;

• M4 : the proposed method, which is based on M2, and only updates when the streaming-in trajectory passes SIT.

124 4.4 Case Study III – Battery Capacity Prognosis

Table 4.10 Offline modeling results comparision between with and without sample selection

RMSE AEB Runtime (s) M1 M2 M1 M2 M1 M2 Test 1 0.062 0.052 5.14 5.8 6.44 3.76 Test 2 0.042 0.034 11.88 11.85 4.71 2.68 Test 3 0.021 0.019 2.49 2.99 3.90 2.60

The benchmark results of offline models ( M1 and M2 ) are tabulated in Table 4.10, where three metrics: RMSE, Area of Error Band (AEB), and Rumtime are presented, where RMSE is used to quantify the prediction accuracy, AEB is used to evaluate the prediction uncertainty, and Runtime is employed to compare the modeling efficiency. The benchmark results in Table 4.10 demonstrate that generally method M2 has a better prognosis performance compared with M1 . M2 gives smaller prediction RMSE and relatively less modeling time in all 3 tests. The AEB results are quite comparable in these 2 models. It reflects that the offline trajectory selection and modeling strategy works well for prognosis application. The trajectory selection by SIT reduces the model complexity while keeps most historical information. More details of the prognosis results from M1 and M2 can be found in Fig. 4.35 and Fig. 4.36.

Fig. 4.35 Normalized capacity Prediction based on offline available trajetories (without trajectory selection)

125 Case Studies

Fig. 4.36 Normalized capacity Prediction based on offline available trajetories (with trajectory selection)

During the online test, there are 9 streaming-in trajectories fed in SIT and model updating, among which 3 trajectories are found important to update the DL and used for model updating accordingly. Table 4.11 and Fig. 4.37 show the batter capacity prediction results after each model updating. One can find out that the overall prediction performance is significant improved. For Test 1 and Test 2, the proposed method M4 continuously improves the prediction accuracy and uncertainty when more important trajectories are detected, which provides a much more decent results compared with the predictions from M2. While for Test 3, as the offline model M2 already achieves a satisfying prediction result, M4 shows its ability to maintain the prediction result in a very accurate level without the negative impact brought by the new added important trajectories.

Table 4.11 Online prediction results by proposed method M4

RMSE AEB Runtime (s) M4 M4 M4 M2 M2 M2 U1 1 U2 U3 U1 U2 U3 U1 U2 U3 Test 1 0.052 0.018 0.014 0.013 5.8 1.4 1.0 0.9 3.76 2.89 3.42 3.88 Test 2 0.034 0.034 0.031 0.005 11.9 11.9 12.3 1.6 2.68 2.12 2.84 2.77 Test 3 0.019 0.017 0.016 0.019 3.0 5.6 2.8 3.8 2.60 2.34 2.55 2.21 1 Here U1, U2, U3 correspond model 4 Update 1, Update 2, and Update 3 in Fig. 4.37 when important trajectories are detected by SIT from the streaming in sequence.

126 4.4 Case Study III – Battery Capacity Prognosis

(a)

(b)

(c)

Fig. 4.37 Normalized capacity Prediction based on proposed method: (2) after 1st online model update when the 1st important streaming-in capacity trajectory passed the SIT; (b) after 2nd online model update when the 2nd important streaming-in capacity trajectory passed the SIT; (c) after 3rd online model update when the 3rd important streaming-in capacity trajectory passed the SIT

127 Case Studies

The comparison between M3 and M4 can be found in Fig. 4.38 and Fig. 4.39. M3 can be taken as a classic online learning model, or incremental learning model, which keeps updating the model when new samples are available. Undoubtedly M3 is more time and resources consuming compared with the proposed model M4. Therefore, only the comparisons of RMSE and AEB between M3 and M4 is presented. From the comparisons of RMSE and AEB, one can find out that the proposed method outperforms online learning model in battery prognostics. Although finally The prediction accuracies of M4 and M3 are close, but the efficiency of M4 is significantly better than M3, which could be more evident when abundant historical data samples are available. What’s more, M4 shows more stable behaviors compared with M3. For example, by comparing the RMSE of Test 1 from 2 different models, one can tell that M3 introduces fluctuations during the online stage as it updates too frequently. The similar phenomenon are observed in Case study II Fig. 4.7, where several online learning methods are benchmarked with the proposed method for fault detection. The reason is that the proposed method is capable to keep enriching the DL with informative and important samples for model updating effectively and efficiently.

4.4.5 Summary

As a summary, the benchmark between M1, M2, M3, and M4 illustrates the advantages of proposed adaptive PHM framework in the prognosis applications. To predict battery capacity degradation in a multi-cell setting, GPR model with a novel mean function setting is employed to take advantage of the historical information, which is abstracted as DL via SIT. The uniqueness of proposed method involves contributing an efficient way to integrate historical information and streaming-in information to enhance the prediction outcome. It is shown that the proposed method can be an effective and efficient tool to predict battery capacity in multi-cell setting.

128 4.4 Case Study III – Battery Capacity Prognosis

Fig. 4.38 RMSE comparison between M3 (incremental model) and M4 (proposed method). A pair of dash line solid line with same color means the results comparison of two models based on the same test. The vertical dash lines represent the M4 model updates when the important trajectories streams in. e.g. the first vertical dash line locates at x = 3, which means that the 3rd streaming-in trajectory passes the SIT and triggers model updating.

129 Case Studies

Fig. 4.39 AEB comparison between M3 (incremental model) and M4 (proposed method). A pair of dash line solid line with same color means the results comparison of two models based on the same test. The vertical dash lines represent the M4 model updates when the important trajectories streams in. e.g. the first vertical dash line locates at x = 3, which means that the 3rd streaming-in trajectory passes the SIT and triggers model updating.

130 Chapter 5

Conclusions and Future Work

5.1 Conclusions

To address the gap between traditional PHM technologies and growing needs in modern industry, this research proposes a systematic framework of adaptive PHM, which is capable to evolve with new information brought with streaming data during online process in a dynamic working environment. Unlike the conventional static PHM models, which are usually trained based on historical data or DOE data and keeps unchanged after deployment, the proposed methodology evolutionarily transfers the static PHM model to an adaptive model to accommodate the dynamics in manufacturing processes. Based on the proposed methodology, the PHM is no long static after implementation, but continuously updates with the adaptive self-learning capability, which ensures the developed PHM model has a lifelong adaptability mechanism with sustainable perfor- mances to deal with model aging issues.

To realize such a goal, the proposed framework categorizes the PHM modeling process in two steps: (1) Off-line model initialization before deployment; (2) Online model evolu- tion after deployment. An adaptive sample selection strategy named SIT is developed to

131 Conclusions and Future Work effectively select the important samples during off-line for model initialization anddetect the important samples from streaming-in sequence during online monitoring for model evolution.

To be more specific, three novel modules are developed to build up an adaptive PHM framework:

1. Off-line/initial sampling and modeling: the goals of this module include:

(a) selects important samples from historical data to formulate a representative subset called data library (DL);

(b) trains an off-line PHM model based on the selected DL.

2. On-line sample selection: this module determines the importance of each streaming sample and decides whether to import this new sample to DL and update PHM model or not.

3. On-line model updating: this module enables the current model to learn from selected samples sequentially to adapt to dynamic changes. Considering new samples arrive in sequence, some sequence learning algorithms such as Bayesian- based methods, neural networks, incremental algorithms, and ensemble methods are investigated.

Besides the methodology development, the effectiveness of the presented methodology is validated with four real-world industrial applications. Based on the methodology development and application investigation, the contributions and broader impact of this research can be summarized as follows:

1. Effective PHM modeling from abundant historical data. Traditionally, all available historical data are equally employed for PHM modeling. That turns to be ineffective

132 5.1 Conclusions

in modern manufacturing processes with data generated rapidly. Firstly, large volume of in-process data brings heavy computational burdens; secondly, informa- tion redundancy is an inevitable issue in big data environment. It is necessary to develop an efficient PHM modeling method to process a high volume of datawith limited time and resources.

2. Modeling of new information discovery in manufacturing processes. Manufacturing processes are subjected to disturbances and drifts caused by various issues such as the changes of ambient environment, device wear, adjustment of usage behaviors, etc. All these volatilities could undermine a static PHM model’s performance. However, it is practically impossible to update PHM model every single time when new sample arrives. Therefore, to adaptively update the model efficiently, it is crucial to capture the volatility during the process monitoring.

3. Adaptive PHM modeling in on-line fashion. Thanks to the informative detection model mentioned by Contribution 2, the self-leaning PHM model does not evolve every time but updates only when a streaming-in sample is detected informative. An online self-learning PHM modeling method is proposed to enable the current model to learn from selected samples sequentially without model retraining.

4. Development of a unified framework to combine the off-line PHM model and online PHM model, which could enhance the model’s performance in whole model life. On one hand, an online model with self-learning ability processes samples and updates model incrementally, which itself has the efficiency property. On the other hand, the off-line model maintains the knowledge learned from historical data orDOE.

133 Conclusions and Future Work

5.2 Future Work

The presented adaptive PHM framework provides a promising direction for the PHM model development and operations to shorten the systems development life cycle while keep sustainable performances. Several potential directions for enriching the current framework and investigating more successful applications are listed as follows:

1. The current framework and practices in this research mainly focus on how to capture the new information from streaming-in sequence timely and efficiently, but pay less attention on how to remove the outdated information out of the model. Also, the currently framework considers all important samples are ’equally important’, which makes it may have a tedious response to the abrupt regime change or environment dynamics. Future investigation should be performed to deal with this issues to make the model more comprehensive.

2. Recently, some researchers proposed the desirable characteristics of PHM study with practical applications which are not been brought to the forefront before (Zio, 2009): a. Quick prediction; b. Robustness; c. Confidence estimation; d. Adaptability; e. Clarity of interpretation. Such desirable characteristics have not been well addressed and considered during the framework development and model establishment with real applications. Currently, the proposed methodology meets up several requires such as quick prediction, robustness, adaptability, and con- fidence estimation. Confidence quantification provides more insights about results credibility and interpretability, but the interpretation of data-driven model for root cause analysis and better decision-making support have not been fully discussed. Therefore, besides the enrichment of overall methodology, these characteristics are the considerations of further investigations.

134 5.2 Future Work

3. Current investigation covers the modeling process, but rarely address the data quality issue. As pointed out by Jay(2020), that the 3B data issues (Bad, broken, background) affects the performance of modeling significantly. Therefore, itis meaningful to put for effort on data quality free framework guarantee the sustainable performance of the proposed adaptive PHM methodology.

4. As reviewed in Chapter2, the transfer learning and deep learning have shown their potential on knowledge transfer ability for various applications in recent years, but now the investigation mainly focus on offline approach, it is needed to do further investigations to unleash your potential of these methods on adaptive PHM area.

5. As discussed in the methodology and case studies chapter, there are quite a few analytics tool options to be fitted into the proposed methodology for various scenarios, so it is necessary to build up a toolbox or platform with rich algorithms choices, which can not only serve as a reconfigurable tool for practical use, but also a library for algorithm benchmarking and research work.

135 136 References

Rosmaini Ahmad and Shahrul Kamaruddin. An overview of time-based and condition-

based maintenance in industrial application. Computers & industrial engineering, 63

(1):135–149, 2012.

Samaneh Aminikhanghahi and Diane J Cook. A survey of methods for time series change

point detection. Knowledge and information systems, 51(2):339–367, 2017.

Jesse A Andrawus, John Watson, and Mohammed Kishk. Wind turbine maintenance

optimisation: principles of quantitative maintenance optimisation. Wind Engineering,

31(2):101–110, 2007.

Victoria Baagøe-Engels and Jan Stentoft. Operations and maintenance issues in the

offshore wind energy sector. International Journal of Energy Sector Management,

2016.

Baidu. Smart data set from Nankai University and Baidu Inc., 2015. http://pan.baidu.

com/share/link?shareid=189977&uk=4278294944.

Gaston Baudat and Fatiha Anouar. Feature vector selection and projection using kernels.

Neurocomputing, 55(1-2):21–38, 2003.

137 References

Steffen Bickel, Michael Brückner, and Tobias Scheffer. Discriminative learning for differing

training and test distributions. In Proceedings of the 24th international conference on

Machine learning, pages 81–88, 2007.

Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.

B Bole, C Kulkarni, and M Daigle. Randomized battery usage data set. NASA AMES

prognostics data repository, 2014a.

Brian Bole, Chetan S Kulkarni, and Matthew Daigle. Adaptation of an electrochemistry-

based li-ion battery model to account for deterioration observed under randomized

use. Technical report, SGT, Inc. Moffett Field , 2014b.

Edwin V Bonilla, Kian M , and Christopher Williams. Multi-task gaussian process

prediction. In Advances in neural information processing systems, pages 153–160, 2008.

Karsten M Borgwardt, Arthur Gretton, Malte J Rasch, Hans-Peter Kriegel, Bernhard

Schölkopf, and Alex J Smola. Integrating structured biological data by kernel maximum

mean discrepancy. Bioinformatics, 22(14):e49–e57, 2006.

Haoshu Cai, Xiaodong Jia, Jianshe Feng, Qibo Yang, Yuan-Ming Hsu, Yudi Chen, and

Jay Lee. A combined filtering strategy for short term and long term wind speed

prediction with improved accuracy. Renewable energy, 136:1082–1090, 2019.

Haoshu Cai, Jianshe Feng, Wenzhe Li, Yuan-Ming Hsu, and Jay Lee. Similarity-based

particle filter for remaining useful life prediction with enhanced performance. Applied

Soft Computing, page 106474, 2020a.

Haoshu Cai, Jianshe Feng, Qibo Yang, Wenzhe Li, Xiang Li, and Jay Lee. A virtual

metrology method with prediction uncertainty based on gaussian process for chemical

mechanical planarization. Computers in Industry, 119:103228, 2020b.

138 References

Haoshu Cai, Xiaodong Jia, Jianshe Feng, Wenzhe Li, Yuan-Ming Hsu, and Jay Lee. Gaus-

sian process regression for numerical wind speed prediction enhancement. Renewable

Energy, 146:2112–2123, 2020c.

Haoshu Cai, Xiaodong Jia, Jianshe Feng, Wenzhe Li, Laura Pahren, and Jay Lee. A

similarity based methodology for machine prognostics by using kernel two sample test.

ISA transactions, 2020d.

Gert Cauwenberghs and Tomaso Poggio. Incremental and decremental support vector

machine learning. In Advances in neural information processing systems, pages 409–415,

2001.

Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. A second-order perceptron

algorithm. SIAM Journal on Computing, 34(3):640–668, 2005.

Lester Lik Teck Chan, Xiaofei Wu, Junghui Chen, , and Chun-I Chen. Just-in-

time modeling with variable shrinkage based on gaussian processes for semiconductor

manufacturing. IEEE Transactions on Semiconductor Manufacturing, 31(3):335–342,

2018.

Mahalanobis Prasanta Chandra et al. On the generalised distance in statistics. In

Proceedings of the National Institute of Sciences of India, volume 2, pages 49–55, 1936.

Zhiyuan Chen and Bing Liu. Lifelong machine learning. Synthesis Lectures on Artificial

Intelligence and Machine Learning, 12(3):1–207, 2018.

Leo H Chiang, Mark E Kotanchek, and Arthur K Kordon. Fault diagnosis based on fisher

discriminant analysis and support vector machines. Computers & chemical engineering,

28(8):1389–1401, 2004.

139 References

Tat-Jun Chin, Konrad Schindler, and David Suter. Incremental kernel svd for face

recognition with image sets. In 7th International Conference on Automatic Face and

Gesture Recognition (FGR06), pages 461–466. IEEE, 2006.

Mohamed Chookah, Mohammad Nuhi, and Mohammad Modarres. A probabilistic

physics-of-failure model for prognostic health management of structures subject to

pitting and corrosion-fatigue. Reliability Engineering & System Safety, 96(12):1601–

1610, 2011.

Rubana H Chowdhury, Mamun Reaz, Mohd Alauddin Bin Mohd Ali, Ashrif AA

Bakar, Kalaivani Chellappan, and Tae G Chang. Surface electromyography signal

processing and classification techniques. Sensors, 13(9):12431–12466, 2013.

Jamie Coble and J Wesley Hines. Identifying optimal prognostic parameters from data:

a genetic algorithms approach. In Annual conference of the prognostics and health

management society, volume 27, 2009.

Anthony Coppola. Reliability engineering of electronic equipment a historical perspective.

IEEE Transactions on Reliability, 33(1):29–35, 1984.

Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer.

Online passive-aggressive algorithms. Journal of Machine Learning Research, 7(Mar):

551–585, 2006.

Koby Crammer, Alex Kulesza, and Mark Dredze. Adaptive regularization of weight

vectors. In Advances in neural information processing systems, pages 414–422, 2009.

Lehel Csató and Manfred Opper. Sparse on-line gaussian processes. Neural computation,

14(3):641–668, 2002.

140 References

Wenyuan Dai, Qiang Yang, -Rong , and Yong Yu. Boosting for transfer learning.

In Proceedings of the 24th international conference on Machine learning, pages 193–200,

2007.

Stéphane Dauzere-Péres, Jean-Loup Rouveyrol, Claude Yugma, and Philippe Vialletelle.

A smart sampling algorithm to minimize risk dynamically. In 2010 IEEE/SEMI

Advanced Semiconductor Manufacturing Conference (ASMC), pages 307–310. IEEE,

2010.

Hossein Davari Ardakani and Jay Lee. A minimal-sensing framework for monitoring

multistage manufacturing processes using product quality measurements. Machines, 6

(1):1, 2018.

Jesse Davis and Pedro Domingos. Deep transfer via second-order markov logic. In

Proceedings of the 26th annual international conference on machine learning, pages

217–224, 2009.

Roy De Maesschalck, Delphine Jouan-Rimbaud, and Désiré L Massart. The mahalanobis

distance. Chemometrics and intelligent laboratory systems, 50(1):1–18, 2000.

Rommert Dekker. Applications of maintenance optimization models: a review and

analysis. Reliability engineering & system safety, 51(3):229–240, 1996.

Yuan Di. Enhanced System Health Assessment using Adaptive Self-Learning Techniques.

PhD thesis, University of Cincinnati, 2018.

Yuan Di, Xiaodong Jia, and Jay Lee. Enhanced virtual metrology on chemical me-

chanical planarization process using an integrated model and data-driven approach.

International Journal of Prognostics & Health Management, 8(2), 2017.

141 References

Christopher P Diehl and Gert Cauwenberghs. Svm incremental learning, adaptation and

optimization. In Proceedings of the International Joint Conference on Neural Networks,

2003., volume 4, pages 2685–2690. IEEE, 2003.

Gregory Ditzler, Manuel Roveri, Cesare Alippi, and Robi Polikar. Learning in nonsta-

tionary environments: A survey. IEEE Computational Intelligence Magazine, 10(4):

12–25, 2015.

Mark Dredze, Koby Crammer, and Fernando Pereira. Confidence-weighted linear classifi-

cation. In Proceedings of the 25th international conference on Machine learning, pages

264–271, 2008.

Robert Dürichen, Marco AF Pimentel, Lei Clifton, Achim Schweikard, and David A

Clifton. Multitask gaussian processes for multivariate physiological time-series analysis.

IEEE Transactions on Biomedical Engineering, 62(1):314–322, 2014.

Jianshe Feng. Methodology of adaptive prognostics and health management using

streaming data in big data environment. In Proceedings of the Annual Conference of

the PHM Society, volume 11, 2019.

Jianshe Feng, Xinyu Du, and Mutasim Salman. Wheel bearing fault isolation and

prognosis using acoustic based approach. In Proceedings of the Annual Conference of

the PHM Society, volume 11, 2019a.

Jianshe Feng, Xiaodong Jia, Feng Zhu, James Moyne, Jimmy Iskandar, and Jay Lee.

An online virtual metrology model with sample selection for the tracking of dynamic

manufacturing processes with slow drift. IEEE Transactions on Semiconductor Manu-

facturing, 32(4):574–582, 2019b.

142 References

Jianshe Feng, Xiaodong Jia, Feng Zhu, Qibo Yang, Yubin Pan, and Jay Lee. An intelligent

system for offshore wind farm maintenance scheduling optimization considering turbine

production loss. Journal of Intelligent & Fuzzy Systems, 1(Preprint):1–13, 2019c.

Jianshe Feng, Xiaodong Jia, Haoshu Cai, Feng Zhu, Li Xiang, and Jay Lee. A cross

trajectory gaussian process regression model for battery health prediction. Journal of

modern power systems and clean energy, 1:1–1, 2020.

Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum,

and Frank Hutter. Efficient and robust automated machine learning. In Advances in

neural information processing systems, pages 2962–2970, 2015.

Aurélien Froger, Michel Gendreau, Jorge E Mendoza, Eric Pinson, and Louis-Martin

Rousseau. Maintenance scheduling in the electricity industry: A literature review.

European Journal of Operational Research, 251(3):695–706, 2016.

João Gama, Indr˙e Žliobait˙e, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid

Bouchachia. A survey on concept drift adaptation. ACM computing surveys (CSUR),

46(4):1–37, 2014.

A Garinei and R Marsili. A new diagnostic technique for ball screw actuators. Measure-

ment, 45(5):819–828, 2012.

Zhiqiang Ge. Active probabilistic sample selection for intelligent soft sensing of industrial

processes. Chemometrics and Intelligent Laboratory Systems, 151:181–189, 2016.

Zhiqiang Ge and Zhihuan Song. A comparative study of just-in-time-learning based

methods for online soft sensor modeling. Chemometrics and Intelligent Laboratory

Systems, 104(2):306–317, 2010.

143 References

Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B

Rubin. Bayesian data analysis. CRC press, 2013.

Claudio Gentile. A new approximate maximal margin classification algorithm. Journal

of Machine Learning Research, 2(Dec):213–242, 2001.

Kai Goebel, Bhaskar Saha, Abhinav Saxena, Jose R Celaya, and Jon P Christophersen.

Prognostics in battery health management. IEEE instrumentation & measurement

magazine, 11(4):33–40, 2008.

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and

Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research,

13(25):723–773, 2012. URL http://jmlr.org/papers/v13/gretton12a.html.

Richard L Guldi. In-line defect reduction from a historical perspective and its implications

for future integrated circuit manufacturing. IEEE transactions on semiconductor

manufacturing, 17(4):629–640, 2004.

Fatma Gumus, C Okan Sakar, Zeki Erdem, and Olcay Kursun. Online naive bayes

classification for network intrusion detection. In 2014 IEEE/ACM International

Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014),

pages 670–674. IEEE, 2014.

Volkan Gunes, Steffen Peter, Tony Givargis, and Frank Vahid. A survey on concepts,

applications, and challenges in cyber-physical systems. KSII Transactions on Internet

& Information Systems, 8(12), 2014.

Jiawei Han, , and Micheline Kamber. Data mining: concepts and techniques.

Elsevier, 2011.

144 References

KP Han, J Moyne, and T Edgar. Implementation of virtual metrology by selection of

optimal adaptation method. In AEC/APC Symposium XXI, 2009.

Aiwina Heng, Zhang, Andy CC , and Joseph Mathew. Rotating machinery

prognostics: State of the art, challenges and opportunities. Mechanical systems and

signal processing, 23(3):724–739, 2009.

Toshiya Hirai and Manabu Kano. Adaptive virtual metrology design for semiconductor

dry etching process through locally weighted partial least squares. IEEE Transactions

on Semiconductor Manufacturing, 28(2):137–144, 2015.

Steven CH Hoi, Jialei Wang, and Peilin Zhao. Libol: A library for online learning

algorithms. Journal of Machine Learning Research, 15(1):495, 2014.

Steven CH Hoi, Doyen Sahoo, Jing Lu, and Peilin Zhao. Online learning: A comprehensive

survey. arXiv preprint arXiv:1802.02871, 2018.

Derek Hu and Qiang Yang. Transfer learning for activity recognition via sensor

mapping. In Twenty-second international joint conference on artificial intelligence,

2011.

Xiaosong Hu, Jiuchun Jiang, Dongpu , and Egardt. Battery health prognosis for

electric vehicles using sample entropy and sparse bayesian predictive modeling. IEEE

Transactions on Industrial Electronics, 63(4):2645–2656, 2015.

Min-Hsiung Hung, Tung-Ho , Fan-Tien , and Rung-Chuan Lin. A novel

virtual metrology scheme for predicting cvd thickness in semiconductor manufacturing.

IEEE/ASME Transactions on mechatronics, 12(3):308–316, 2007.

145 References

Dennis Ippoliti and Xiaobo Zhou. An adaptive growing hierarchical self organizing map

for network intrusion detection. In 2010 Proceedings of 19th International Conference

on Computer Communications and Networks, pages 1–7. IEEE, 2010.

Andrew KS Jardine, Daming Lin, and Dragan Banjevic. A review on machinery diagnos-

tics and prognostics implementing condition-based maintenance. Mechanical systems

and signal processing, 20(7):1483–1510, 2006.

Akbar Javadian Kootanaee, K Nagendra Babu, and Hamid Talari. Just-in-time man-

ufacturing system: from introduction to implement. Available at SSRN 2253243,

2013.

Lee Jay. Industrial AI: Applications with Sustainable Performance. Springer, 2020.

Xiaodong Jia, Ming Zhao, Yuan Di, Qibo Yang, and Jay Lee. Assessment of data

suitability for machine prognosis using maximum mean discrepancy. IEEE transactions

on industrial electronics, 65(7):5872–5881, 2017.

Xiaodong Jia, Yuan Di, Jianshe Feng, Qibo Yang, Honghao Dai, and Jay Lee. Adaptive

virtual metrology for semiconductor chemical mechanical planarization process using

gmdh-type polynomial neural networks. Journal of Process Control, 62:44–54, 2018a.

Xiaodong Jia, Bin Huang, Jianshe Feng, Haoshu Cai, and Jay Lee. A review of phm

data competitions from 2008 to 2017. In Proceedings of the Annual Conference of the

PHM Society, volume 10, 2018b.

Xiaodong Jia, Haoshu Cai, Yuanming Hsu, Wenzhe Li, Jianshe Feng, and Jay Lee. A

novel similarity-based method for remaining useful life prediction using kernel two

sample test. In Proceedings of the Annual Conference of the PHM Society, volume 11,

2019.

146 References

Jinyang Jiao, Ming Zhao, Jing Lin, and Kaixuan . Hierarchical discriminating sparse

coding for weak fault feature extraction of rolling bearings. Reliability Engineering &

System Safety, 184:41–54, 2019.

Wenjing Jin, Yan Chen, and Jay Lee. Methodology for ball screw component health

assessment and failure analysis. In ASME 2013 International Manufacturing Science and Engineering Conference collocated with the 41st North American Manufacturing

Research Conference. American Society of Mechanical Engineers Digital Collection,

2013.

Wenjing Jin, Zongchang Liu, Zhe Shi, Chao Jin, and Jay Lee. Cps-enabled worry-free

industrial applications. In 2017 Prognostics and System Health Management Conference

(PHM-Harbin), pages 1–7. IEEE, 2017.

Xiaoning Jin, Brian A Weiss, David Siegel, and Jay Lee. Present status and future growth

of advanced maintenance technology and strategy in us manufacturing. International

journal of prognostics and health management, 7(Spec Iss on Smart Manufacturing

PHM), 2016.

Pilsung Kang, Dongil Kim, Hyoung-joo Lee, Seungyong Doh, and Sungzoon Cho. Virtual

metrology for run-to-run control in semiconductor manufacturing. Expert Systems with

Applications, 38(3):2508–2522, 2011.

Pilsung Kang, Dongil Kim, and Sungzoon Cho. Semi-supervised support vector regression

based on self-training with label uncertainty: An application to virtual metrology in

semiconductor manufacturing. Expert Systems with Applications, 51:85–106, 2016.

Seokho Kang and Pilsung Kang. An intelligent virtual metrology system with adaptive

update for semiconductor manufacturing. Journal of Process Control, 52:66–74, 2017.

147 References

Aftab A Khan, James R Moyne, and Dawn M Tilbury. Virtual metrology and feedback

control for semiconductor manufacturing processes using recursive partial least squares.

Journal of Process Control, 18(10):961–974, 2008.

Dongil Kim, Pilsung Kang, Sungzoon Cho, Hyoung-joo Lee, and Seungyong Doh. Ma-

chine learning-based novelty detection for faulty wafer detection in semiconductor

manufacturing. Expert Systems with Applications, 39(4):4075–4083, 2012.

Nam-Ho Kim, Dawn An, and Joo-Ho . Prognostics and health management of

engineering systems: An introduction. springer, 2016.

James K Kimotho, Christoph Sondermann-Woelke, Tobias Meyer, and Walter Sextro.

Application of event based decision tree and ensemble of data driven methods for

maintenance action recommendation. International Journal of Prognostics and Health

Management, 4, 2013.

Shosuke Kimura, Seiichi Ozawa, and Shigeo Abe. Incremental kernel pca for online

learning of feature space. In International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent

Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC’06), volume 1,

pages 595–600. IEEE, 2005.

J Zico Kolter and Marcus A Maloof. Dynamic weighted majority: An ensemble method

for drifting concepts. Journal of Machine Learning Research, 8(Dec):2755–2790, 2007.

Georg Krempl, Indre Žliobaite, Dariusz Brzeziński, Eyke Hüllermeier, Mark Last, Vincent

Lemaire, Tino Noack, Ammar Shaker, Sonja Sievi, Myra Spiliopoulou, et al. Open

challenges for data stream mining research. ACM SIGKDD explorations newsletter, 16

(1):1–10, 2014.

148 References

Sanjeev Kulkarni and Gilbert Harman. An elementary introduction to statistical learning

theory, volume 853. John Wiley & Sons, 2011.

Tze Leung Lai. Sequential changepoint detection in quality control and dynamical

systems. Journal of the Royal Statistical Society: Series B (Methodological), 57(4):

613–644, 1995.

Pradeep Lall, Madhura Hande, Chandan Bhat, and Jay Lee. Prognostics health mon-

itoring (phm) for prior damage assessment in electronics equipment under thermo-

mechanical loads. IEEE Transactions on Components, Packaging and Manufacturing

Technology, 1(11):1774–1789, 2011.

Pavel Laskov, Christian Gehl, Stefan Krüger, and Klaus-Robert Müller. Incremental

support vector learning: Analysis, implementation and applications. Journal of machine

learning research, 7(Sep):1909–1936, 2006.

Hyung Joo Lee. Advanced process control and optimal sampling in semiconductor

manufacturing. 2008.

Jang Hee Lee. Artificial intelligence-based sampling planning system for dynamic

manufacturing process. Expert Systems with Applications, 22(2):117–133, 2002.

Jay Lee, Fangji Wu, Wenyu Zhao, Masoud Ghaffari, Linxia , and David Siegel.

Prognostics and health management design for rotary machinery systems—reviews,

methodology and applications. Mechanical systems and signal processing, 42(1-2):

314–334, 2014.

Jay Lee, Hossein Davari Ardakani, Shanhu Yang, and Behrad Bagheri. Industrial big

data analytics and cyber-physical systems for future maintenance & service innovation.

Procedia Cirp, 38:3–7, 2015.

149 References

EJ Leffens, F Landis Markley, and Malcolm D Shuster. Kalman filtering for spacecraft

attitude estimation. Journal of Guidance, Control, and Dynamics, 5(5):417–429, 1982.

Fan Li and Jiuping Xu. A new prognostics method for state of health estimation of

lithium-ion batteries based on a mixture of gaussian process models and particle filter.

Microelectronics Reliability, 55(7):1035–1045, 2015.

Pin Li, Xiaodong Jia, Jianshe Feng, Hossein Davari, Guan , Yihchyun Hwang, and

Jay Lee. Prognosability study of ball screw degradation using systematic methodology.

Mechanical Systems and Signal Processing, 109:45–57, 2018a.

Pin Li, Jianshe Feng, Feng Zhu, Hossein Davari, Liang-Yu Chen, and Jay Lee. A deep

learning based method for cutting parameter optimization for band saw machine. In

Annual Conference of the PHM Society, volume 11, 2019a.

Shuang Li, Xie, Hanjun Dai, and Le Song. M-statistic for kernel change-point

detection. In Advances in Neural Information Processing Systems, pages 3366–3374,

2015.

Ya Li, Mingming , Xinmei Tian, Tongliang Liu, and Dacheng Tao. Domain

generalization via conditional invariant representations. In Thirty-Second AAAI

Conference on Artificial Intelligence, 2018b.

Yi Li and Philip M Long. The relaxed online maximum margin algorithm. In Advances

in neural information processing systems, pages 498–504, 2000.

Zhixiong Li, Dazhong Wu, and Tianyu Yu. Prediction of material removal rate for

chemical mechanical planarization using decision tree-based ensemble learning. Journal

of Manufacturing Science and Engineering, 141(3), 2019b.

150 References

Z Chase Lipton, Charles Elkan, and Balakrishnan Narayanaswamy. Thresholding classi-

fiers to maximize f1 score. Machine Learning and Knowledge Discovery in Databases,

8725:225–239, 2014.

Bin Liu, Shaomin Wu, Min Xie, and Way Kuo. A condition-based maintenance policy

for degrading systems with age-and state-dependent operating cost. European Journal

of Operational Research, 263(3):879–887, 2017a.

Datong Liu, Jingyue Pang, Jianbao Zhou, Yu , and Michael Pecht. Prognostics

for state of health estimation of lithium-ion batteries based on combination gaussian

process functional regression. Microelectronics Reliability, 53(6):832–839, 2013.

Jie Liu, Yan- Li, and Enrico Zio. A svm framework for fault detection of the braking

system in a high speed train. Mechanical Systems and Signal Processing, 87:401–409,

2017b.

CJ Lu, WQ Meeker, and Q Meeker. Measures to estimate using degradation a distribution.

Technometrics, 35(2):161–174, 1993.

Edwin Lughofer and Mahardhika Pratama. Online active learning in data stream

regression using uncertainty sampling based on evolving generalized fuzzy models.

IEEE Transactions on fuzzy systems, 26(1):292–309, 2017.

Shane A Lynn, John Ringwood, and Niall MacGearailt. Global and local virtual metrology

models for a plasma etch process. IEEE Transactions on Semiconductor Manufacturing,

25(1):94–103, 2011.

Ardeshir Raihanian Mashhadi, Willie Cade, and Sara Behdad. Moving towards real-

time data-driven quality monitoring: A case study of hard disk drives. Procedia

Manufacturing, 26:1107–1115, 2018.

151 References

Saad Mohamad, Moamar Sayed-Mouchaweh, and Abdelhamid Bouchachia. Active

learning for classifying data streams with unknown number of classes. Neural Networks,

98:1–15, 2018.

Chandra Mouli and Michael J Scott. Adaptive metrology sampling techniques enabling

higher precision in variability detection and control. In 2007 IEEE/SEMI Advanced

Semiconductor Manufacturing Conference, pages 12–17. IEEE, 2007.

Justin Nduhura Munga, Stéphane Dauzère-Pérès, Philippe Vialletelle, and Claude

Yugma. Dynamic management of controls in semiconductor manufacturing. In 2011

IEEE/SEMI Advanced Semiconductor Manufacturing Conference, pages 1–6. IEEE,

2011.

Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.

Fabian Nater, Tatiana Tommasi, Helmut Grabner, Luc Van Gool, and Barbara Caputo.

Transferring activities: Updating human behavior analysis. In 2011 IEEE International

Conference on Computer Vision Workshops (ICCV Workshops), pages 1737–1744. IEEE,

2011.

Thomas Nehrkorn, Ross N Hoffman, Christopher Grassotti, and Jean-François Louis.

Feature calibration and alignment to represent model forecast errors: Empirical

regularization. Quarterly Journal of the Royal Meteorological Society: A journal of the

atmospheric sciences, applied meteorology and physical oceanography, 129(587):195–218,

2003.

Man-Fai Ng, Jin Zhao, Qingyu Yan, Gareth J Conduit, and Zhi Wei Seh. Predicting

the state of charge and health of batteries using data-driven machine learning. Nature

Machine Intelligence, pages 1–10, 2020.

152 References

Jianjun Ni, Chuanbiao Zhang, and Simon X Yang. An adaptive approach based on kpca

and svm for real-time fault diagnosis of hvcbs. IEEE Transactions on Power Delivery,

26(3):1960–1971, 2011.

Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on

knowledge and data engineering, 22(10):1345–1359, 2009.

Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang Yang. Domain adaptation via

transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199–210,

2010.

Sinno Jialin Pan, Qiang Yang, and Wei Fan. Transfer learning with applications. 2012.

Jingyue Pang, Datong Liu, Haitao Liao, Yu Peng, and Xiyuan Peng. Anomaly detection

based on data stream monitoring and prediction with improved gaussian process

regression algorithm. In 2014 International Conference on Prognostics and Health

Management, pages 1–7. IEEE, 2014.

Chanhee Park and Seoung Bum Kim. Virtual metrology modeling of time-dependent

spectroscopic signals by a fused lasso algorithm. Journal of Process Control, 42:51–58,

2016.

Meru A Patil, Piyush Tagade, Krishnan S Hariharan, Subramanya M Kolake, Taewon

Song, Taejung Yeo, and Seokgwang Doo. A novel multistage support vector machine

based approach for li ion battery remaining useful life estimation. Applied energy, 159:

285–297, 2015.

W Tse Peter, Joseph Mathew, King Wong, Rocky Lam, and CN Ko. Engineering Asset Management-Systems, Professional Practices and Certification: Proceedings of the

153 References

8th World Congress on Engineering Asset Management (WCEAM 2013) & the 3rd

International Conference on Utility Management & Safety (ICUMAS). Springer, 2014.

PHM Society. PHM data challenge dataset, 2016. https://www.phmsociety.

org/sites/all/modules/pubdlcnt/pubdlcnt.php?file=https://www.phmsociety.org/

sites/phmsociety.org/files/2016%20PHM%20DATA%20CHALLENGE%20CMP%

20DATA%20SET.zip&nid=2152.

Pierre Pinson, Henrik Madsen, Henrik Aa Nielsen, George Papaefthymiou, and Bernd

Klöckl. From probabilistic forecasts to statistical scenarios of short-term wind power

production. Wind Energy: An International Journal for Progress and Applications in

Wind Power Conversion Technology, 12(1):51–62, 2009.

Kejun Qian, Chengke Zhou, Yue Yuan, and Malcolm Allan. Temperature effect on electric

vehicle battery cycle life in vehicle-to-grid applications. In CICED 2010 Proceedings,

pages 1–6. IEEE, 2010.

Laura Elena Raileanu and Kilian Stoffel. Theoretical comparison between the gini index

and information gain criteria. Annals of Mathematics and Artificial Intelligence, 41(1):

77–93, 2004.

Carl Edward Rasmussen. Gaussian processes in machine learning. In Summer School on

Machine Learning, pages 63–71. Springer, 2003.

Robert R Richardson, Michael A Osborne, and David A Howey. Gaussian process

regression for forecasting battery state of health. Journal of Power Sources, 357:

209–219, 2017.

154 References

Robert R Richardson, Christoph R Birkl, Michael A Osborne, and David A Howey.

Gaussian process regression for in capacity estimation of lithium-ion batteries.

IEEE Transactions on Industrial Informatics, 15(1):127–138, 2018.

Frank Rosenblatt. The perceptron: a probabilistic model for information storage and

organization in the brain. Psychological review, 65(6):386, 1958.

Amir Saffari, Christian Leistner, Jakob Santner, Martin Godec, and Horst Bischof. On-

line random forests. In 2009 ieee 12th international conference on computer vision

workshops, iccv workshops, pages 1393–1400. IEEE, 2009.

Alvin J Salkind, Craig Fennie, Pritpal Singh, Terrill Atwater, and David E Reisner. Deter-

mination of state-of-charge and state-of-health of batteries by fuzzy logic methodology.

Journal of Power sources, 80(1-2):293–300, 1999.

Samuel Butler. The Note-books of Samuel Butler. Fifield, London, E. P. Dutton and Co.,

New York, 1912.

Simo Särkkä. Bayesian filtering and smoothing, volume 3. Cambridge University Press,

2013.

Craig Saunders, Mark O Stitson, Jason Weston, Leon Bottou, A Smola, et al. Support

vector machine-reference manual. 1998.

Bernhard Schölkopf. The kernel trick for distances. In Advances in neural information

processing systems, pages 301–307, 2001.

Alexander G Schwing, Christopher Zach, Yefeng , and Marc Pollefeys. Adaptive

random forest—how many “experts” to ask before making a decision? In CVPR 2011,

pages 1377–1384. IEEE, 2011.

155 References

Pavel Senin. Dynamic time warping algorithm review. Information and Computer Science

Department University of Hawaii at Manoa Honolulu, USA, 855(1-23):40, 2008.

Burr Settles. Active learning. Synthesis Lectures on Artificial Intelligence and Machine

Learning, 6(1):1–114, 2012.

Ohad Shamir and Zhang. Stochastic gradient descent for non-smooth optimization:

Convergence results and optimal averaging schemes. In International conference on

machine learning, pages 71–79, 2013.

Aaron Shelly. An adaptive recipe compensation approach for enhanced health prediction

in semiconductor manufacturing, 2017.

John W Sheppard, Mark A Kaufman, and Timothy J Wilmering. Ieee standards for

prognostics and health management. In 2008 IEEE AUTOTESTCON, pages 97–103.

IEEE, 2008.

Angelique Shinn and Tammy Williams. A stitch in time: a simulation of cellular

manufacturing. Production and Inventory Management Journal, 39(1):72, 1998.

Daniel L Silver, Qiang Yang, and Lianghao Li. Lifelong machine learning systems:

Beyond learning algorithms. In 2013 AAAI spring symposium series, 2013.

Alex J Smola and Bernhard Schölkopf. Sparse greedy matrix approximation for machine

learning. In Seventeenth International Conference on Machine Learning (ICML 2000).

International Machine Learning Society, 2000.

Chuang Sun, Ma, Zhibin Zhao, Shaohua Tian, Ruqiang Yan, and Xuefeng Chen.

Deep transfer learning based on sparse autoencoder for remaining useful life prediction

of tool in manufacturing. IEEE Transactions on Industrial Informatics, 15(4):2416–

2425, 2018.

156 References

Susan Sun and Kari Johnson. Method and system for determining optimal wafer

sampling in real-time inline monitoring and experimental design. In 2008 International

Symposium on Semiconductor Manufacturing (ISSM), pages 44–47. IEEE, 2008.

Thamo Sutharssan, Stoyan Stoyanov, Chris Bailey, and Chunyan Yin. Prognostic and

health management for engineering systems: a review of the data-driven approach and

algorithms. The Journal of engineering, 2015(7):215–222, 2015.

Qifeng Tang, Dewei Li, and Yugeng Xi. A new active learning strategy for soft sensor

modeling based on feature reconstruction and uncertainty evaluation. Chemometrics

and Intelligent Laboratory Systems, 172:43–51, 2018.

Laifa Tao, Chen Lu, and Azadeh Noktehdan. Similarity recognition of online data

curves based on dynamic spatial time warping for the estimation of lithium-ion battery

capacity. Journal of Power Sources, 293:751–759, 2015.

Alexander Tartakovsky, Igor Nikiforov, and Michele Basseville. Sequential analysis:

Hypothesis testing and changepoint detection. CRC Press, 2014.

Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains:

A survey. Journal of Machine Learning Research, 10(Jul):1633–1685, 2009.

Kwok L Tsui, Nan Chen, Qiang Zhou, Yizhen Hai, and Wenbin Wang. Prognostics and

health management: A review on data driven approaches. Mathematical Problems in

Engineering, 2015, 2015.

Pınar Tüfekci. Prediction of full load electrical power output of a base load operated

combined cycle power plant using machine learning methods. International Journal of

Electrical Power & Energy Systems, 60:126–140, 2014.

157 References

Taku Uchimaru and Manabu Kano. Sparse sample regression based just-in-time modeling

(ssr-jit): Beyond locally weighted approach. IFAC-PapersOnLine, 49(7):502–507, 2016.

Vladimir Vapnik. Three remarks on the support vector method of function estimation.

Advances in Kernel Methods: Support Vector Learning, pages 25–41, 1998.

Jian Wan, Bahman Honari, and Seán McLoone. A dynamic sampling methodology for

plasma etch processes using gaussian process regression. In 2013 XXIV International

Conference on Information, Communication and Automation Technologies (ICAT),

pages 1–6. IEEE, 2013.

Jian Wan, Simone Pampuri, Paul G O’Hara, Adrian B Johnston, and Seán McLoone.

On regression methods for virtual metrology in semiconductor manufacturing. In 25th IET Irish Signals & Systems Conference 2014 and 2014 -Ireland International

Conference on Information and Communications Technologies, pages 380–385. IET,

2014.

Aiping Wang, Guowei Wan, Zhiquan Cheng, and Sikun Li. An incremental extremely

random forest classifier for online learning and tracking. In 2009 16th ieee international

conference on image processing (icip), pages 1449–1452. IEEE, 2009.

Di Wang, Bo Zhang, Peng Zhang, and Hong Qiao. An online core vector machine with

adaptive meb adjustment. Pattern Recognition, 43(10):3468–3482, 2010.

Jialei Wang, Peilin Zhao, and Steven CH Hoi. Cost-sensitive online classification. IEEE

Transactions on Knowledge and Data Engineering, 26(10):2425–2438, 2013.

Peng Wang and Robert X Gao. Adaptive resampling-based particle filtering for tool life

prediction. Journal of Manufacturing Systems, 37:528–534, 2015.

158 References

Peng Wang, Robert X Gao, and Ruqiang Yan. A deep learning-based approach to

material removal rate prediction in polishing. CIRP Annals, 66(1):429–432, 2017.

You- Wang, Jin, -dong Sun, and Can-fei Sun. Planetary gearbox fault

feature learning using conditional variational neural networks under noise environment.

Knowledge-Based Systems, 163:438–449, 2019.

Yu Wang, Qiang Miao, and Michael Pecht. Health monitoring of hard disk drive based on

mahalanobis distance. In 2011 Prognostics and System Health Managment Confernece,

pages 1–8. IEEE, 2011.

Chien-Wei Wu and Lea Pearn. A variables sampling plan based on cpmk for product

acceptance determination. European Journal of Operational Research, 184(2):549–560,

2008.

Dongrui Wu, Chin- Lin, and Jian Huang. Active learning for regression using greedy

sampling. Information Sciences, 474:90–105, 2019.

Shan Xu, Zhang, and Shizhong Liao. New online kernel ridge regression via incre-

mental predictive sampling. In Proceedings of the 28th ACM International Conference

on Information and Knowledge Management, pages 791–800, 2019.

Shanhu Yang, Behrad Bagheri, Hung-An Kao, and Jay Lee. A unified framework and

platform for designing of cloud-based machine health monitoring and manufacturing

systems. Journal of Manufacturing Science and Engineering, 137(4), 2015.

Murat Yildirim, Nagi Z Gebraeel, and Xu Andy Sun. Integrated predictive analytics

and optimization for opportunistic maintenance and operations in wind farms. IEEE

Transactions on power systems, 32(6):4319–4328, 2017.

159 References

Joung Taek Yoon, Byeng D Youn, Minji , Yunhan Kim, and Sooho Kim. Life-cycle

maintenance cost analysis framework considering time-dependent false and missed

alarms for fault diagnosis. Reliability Engineering & System Safety, 184:181–192, 2019.

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features

in deep neural networks? In Advances in neural information processing systems, pages

3320–3328, 2014.

Hwanjo Yu and Sungchul Kim. Passive sampling for regression. In 2010 IEEE Interna-

tional Conference on Data Mining, pages 1151–1156. IEEE, 2010.

Jianbo Yu. Machine health prognostics using the bayesian-inference-based probabilistic

indication and high-order particle filtering framework. Journal of Sound and Vibration,

358:97–110, 2015.

Jianbo Yu. Adaptive hidden markov model-based online learning framework for bearing

faulty detection and performance degradation monitoring. Mechanical Systems and

Signal Processing, 83:149–162, 2017a.

Jianbo Yu. Aircraft engine health prognostics based on logistic regression with penalization

regularization and state-space-based degradation framework. Aerospace Science and

Technology, 68:345–361, 2017b.

Zhongtang Zhao, Yiqiang Chen, Junfa Liu, and Mingjie Liu. Cross-mobile elm based

activity recognition. International Journal of Engineering and Industries, 1(1):30–38,

2010.

Xiaojun Zhou, Lifeng Xi, and Jay Lee. Opportunistic preventive maintenance scheduling

for a multi-unit series system based on dynamic programming. International Journal

of Production Economics, 118(2):361–366, 2009.

160 References

Martin Zinkevich. Online convex programming and generalized infinitesimal gradient

ascent. In Proceedings of the 20th international conference on machine learning

(icml-03), pages 928–936, 2003.

Enrico Zio. Reliability engineering: Old problems and new challenges. Reliability

Engineering & System Safety, 94(2):125–141, 2009.

Enrico Zio. Prognostics and health management of industrial equipment. In Diagnostics

and prognostics of engineering systems: methods and techniques, pages 333–356. IGI

Global, 2013.

Indr˙e Žliobait˙e. Learning under concept drift: an overview. arXiv preprint

arXiv:1010.4784, 2010.

161 162 Appendix A

List of Publications in PhD Study

1. Journal Paper

• Jianshe Feng, Xiaodong Jia, Feng Zhu, James Moyne, Jimmy Iskandar, and Jay Lee. An online virtual metrology model with sample selection for the tracking of dynamic manufacturing processes with slow drift. IEEE Transactions on Semiconductor Manufacturing, 32(4):574–582, 2019

• Jianshe Feng, Xiaodong Jia, Feng Zhu, Qibo Yang, Yubin Pan, and Jay Lee. An intelligent system for offshore wind farm maintenance scheduling optimization considering turbine production loss. Journal of Intelligent & Fuzzy Systems, 1:1–13, 2019

• Jianshe Feng, Xiaodong Jia, Haoshu Cai, Feng Zhu, Li Xiang, and Jay Lee. Cross trajectory gaussian process regression model for battery health prediction. Journal of modern power systems and clean energy, 1:1–13, 2020

• Haoshu Cai, Jianshe Feng, Wenzhe Li, Yuan-Ming Hsu, and Jay Lee. Similarity-based particle filter for remaining useful life prediction with en- hanced performance. Applied Soft Computing (2020): 106474.

163 List of Publications in PhD Study

• Haoshu Cai, Jianshe Feng, Qibo Yang, Wenzhe Li, Xiang Li, and Jay Lee. A virtual metrology method with prediction uncertainty based on Gaussian pro- cess for chemical mechanical planarization. Computers in Industry, 119:103228, 2020.

• Haoshu Cai, Jianshe Feng, Feng Zhu, Qibo Yang, Xiang Li, and Jay Lee. Adaptive virtual metrology method based on just-in-time reference and particle filter for semiconductor manufacturing, IEEE Transactions on Semiconductor Manufacturing,(under review)

• Haoshu Cai, Xiaodong Jia, Jianshe Feng, Wenzhe Li, Laura Pahren, and Jay Lee. A similarity based methodology for machine prognostics by using kernel two sample test. ISA transactions, 1:1-13, 2020

• Haoshu Cai, Xiaodong Jia, Jianshe Feng, Qibo Yang, Yuan-Ming Hsu, Yudi Chen, and Jay Lee. A combined filtering strategy for short term and long term wind speed prediction with improved accuracy. Renewable energy, 136:1082–1090, 2019.

• Haoshu Cai, Xiaodong Jia, Jianshe Feng, Wenzhe Li, Yuan-Ming Hsu, and Jay Lee. Gaussian process regression for numerical wind speed prediction enhancement. Renewable Energy, 146:2112–2123, 2020

• Xiaodong Jia, Yuan Di, Jianshe Feng, Qibo Yang, Honghao Dai, and Jay Lee. Adaptive virtual metrology for semiconductor chemical mechanical planarization process using gmdh-type polynomial neural networks. Journal of Process Control, 62:44–54, 2018.

• Li Pin, Xiaodong Jia, Jianshe Feng, Jay Lee. Prognosability study of ball screw degradation using systematic methodology. Mechanical Systems and Signal Processing, 109: 45-57, 2018.

164 • Jia, Xiaodong, Yuan Di, Jianshe Feng, Jay Lee, Adaptive virtual metrology for semiconductor chemical mechanical planarization process using GMDH- type polynomial neural networks. Journal of Process Control 62:44-54, 2018

2. Conference Paper

• Jianshe Feng, Haoshu Cai, Jimmy Iskandar, James Moyne, Michael Arma- cost, and Jay Lee. A framework for adaptive multivariate limits setting and visualization in semi-automated pattern-based feature extraction FD system. In Advanced Process Control 31st, Oct, 2019, San Antonio, TX

• Haoshu Cai, Jianshe Feng, Jimmy Iskandar, James Moyne, Michael Arma- cost, and Jay Lee. Pattern-based trace data generator for fault detection of fabrication processes. In Advanced Process Control 31st, Oct, 2019, San Antonio, TX

• Jianshe Feng. Methodology of adaptive prognostics and health management using streaming data in big data environment. In Proceedings of the Annual Conference of the PHM Society, volume 11, 2019

• Jianshe Feng, Xinyu Du, and Mutasim Salman. Wheel bearing fault isolation and prognosis using acoustic based approach. In Proceedings of the Annual Conference of the PHM Society, volume 11, 2019

• Li Pin, Jianshe Feng, Feng Zhu, and Jay Lee, A deep learning-based method for cutting parameter optimization for band saw machine. In Proceedings of the Annual Conference of PHM Society. Vol. 11. No. 1. 2019

• Jianshe Feng, Pin Li, Qibo Yang, Hossein Davari, and Jay Lee, Development of an integrated framework for Cyber Physical System (CPS)-enabled rehabil- itation system. In Proceedings of the Annual Conference of PHM Society. Vol. 11. No. 1. 2019

165 List of Publications in PhD Study

• Xiaodong Jia, Haoshu Cai, Yuanming Hsu, Wenzhe Li, Jianshe Feng, and Jay Lee. A novel similarity-based method for remaining useful life prediction using kernel two sample test. In Proceedings of the Annual Conference of the PHM Society, volume 11, 2019

• Xiaodong Jia, Bin Huang, Jianshe Feng, Haoshu Cai, and Jay Lee. A review of phm data competitions from 2008 to 2017. In Proceedings of the Annual Conference of the PHM Society, volume 10, 2018

• Jianshe Feng, Xiaodong Jia, James Moyne, Jimmy Iskandar, Heng Hao, Kommisetti Subrahmanyam, Michael Armacost, and Jay Lee. Pattern-based trace segmentation and feature extraction for semiconductor manufacturing and application to fault detection. In Advanced Process Control 30th, Oct 8-11, 2018, Austin, TX, US

• Tan Hongzhi, Wei Lv, Jin Liwei, Liu Zongchang, and Feng Jianshe. Mod- eling and solution of offshore wind farm maintenance scheduling. DEStech Transactions on Environment, Energy and Earth Sciences (SEEE). 2016

3. Patents

• Jianshe Feng, Xinyu Du, and Mutasim Salman, Method and apparatus for monitoring a machine bearing on-vehicle, U.S. Patent, P04XXXX. Oct. 2018 (pending)

• Aisha Yousuf, and Jianshe Feng, Method and apparatus for arc fault detection and localization in secondary power grid, U.S. Patent, PXXXXXX. Jan. 2020 (pending)

166