for Automated Anomaly Detection in Semiconductor Manufacturing by Michael Daniel DeLaus Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2019 ○c Massachusetts Institute of Technology 2019. All rights reserved.

Author...... Department of Electrical Engineering and Computer Science May 24, 2019

Certified by...... Duane S. Boning Clarence J. LeBel Professor Electrical Engineering and Computer Science Thesis Supervisor

Accepted by ...... Katrina LaCurts Chairman, Master of Engineering Thesis Committee 2 Machine Learning for Automated Anomaly Detection in Semiconductor Manufacturing by Michael Daniel DeLaus

Submitted to the Department of Electrical Engineering and Computer Science on May 24, 2019, in partial fulfillment of the requirements for the degree of Master of Engineering

Abstract In the realm of semiconductor manufacturing, detecting anomalies during manufac- turing processes is crucial. However, current methods of anomaly detection often rely on simple excursion detection methods, and manual inspection of machine sensor data to determine the cause of a problem. In order to improve semiconductor production line quality, machine learning tools can be developed for more thorough and accurate anomaly detection. Previous work on applying machine learning to anomaly detection focused on building reference cycles, and using clustering and forecasting to detect anomalous wafer cycles. We seek to improve upon these techniques and apply them to related domains of semiconductor manufacturing. The main focus is to develop a process for automated anomaly detection by combining the previously used methods of and time series forecasting and prediction. We also explore detecting anomalies across multiple semiconductor manufacturing machines and recipes.

Thesis Supervisor: Duane S. Boning Title: Clarence J. LeBel Professor Electrical Engineering and Computer Science

3 4 Acknowledgments

I would like to take this opportunity to express my great appreciation to all of the people who helped me throughout this project. This journey would not have been possible without the love and support of my parents, Mike and Susan, and two brothers, Robert and Dahn. Thank you for encouraging me every step of the way and for always being there for me.

I would like to thank Prof. Duane Boning for his guidance throughout this entire project and all of his valuable advice. Without his support, this project would not have been possible. I would also like to thank Mr. Jack Dillon, Mr. Dennis Murphy, Mr. Alan Bowers, and Mr. Adrian McKiernan for their help in getting access to the resources that I needed from Analog Devices.

5 6 Contents

1 Introduction 13 1.1 Current Anomaly Detection Practices ...... 14 1.2 Thesis Goals ...... 14 1.3 Methodology ...... 15 1.4 Thesis Outline ...... 15

2 Literature Review 17 2.1 Anomaly Detection ...... 17 2.2 Previous Work ...... 20 2.2.1 Reference Cycles ...... 20 2.2.2 Cluster Analysis ...... 21 2.2.3 Time Series Forecasting with Neural Networks ...... 22

3 Dataset 27 3.1 Data characteristics ...... 28 3.2 Anomalies in Data ...... 30

4 Methods 33 4.1 Time-Series Averaging for Reference Cycle ...... 33 4.1.1 Dynamic Time Warping ...... 33 4.1.2 DTW Barycenter Averaging ...... 34 4.2 Clustering Algorithms ...... 35 4.2.1 K-Means ...... 36

7 4.2.2 K-Medoids ...... 36 4.2.3 CLARA ...... 36 4.2.4 Agglormerative ...... 37 4.2.5 Divisive ...... 37 4.3 Neural Networks ...... 37 4.3.1 Multi-Layer ...... 37 4.3.2 Long Short-Term Memory ...... 38 4.4 Anomaly Detection Pipeline ...... 38

5 Automated Anomaly Detection Experiments 41 5.1 Reference Cycle ...... 41 5.2 Clustering Analysis ...... 42 5.3 Anomaly Detection ...... 47 5.3.1 Training the Models ...... 47 5.3.2 Identification of Anomalous Points ...... 48 5.4 Distribution Validation ...... 55 5.4.1 Empirical Probability Density Function ...... 56 5.4.2 Empirical PDF Experimental Results ...... 58

6 Further Experiments 61 6.1 Deviation Scores ...... 61 6.2 Across Different Machines ...... 63 6.2.1 Clustering different recipes ...... 64 6.2.2 Time-series forecasting across recipes ...... 65

7 Future Work 67 7.1 Experiment Recommendations ...... 67 7.2 Predicting Machine Failures ...... 68

8 Conclusion 69

8 List of Figures

2-1 Example of Cluster Analysis [17] ...... 21 2-2 Example of Time Series Forecasting for Stock Prices [11] ...... 23 2-3 Architecture of a Long Short-Term Memory Model [5] ...... 24

3-1 Difference between recipe 920 and 945 for Parameter 19, for normal (good) runs ...... 28 3-2 Drift between cycles for parameter 17 [5] ...... 29 3-3 Difference between recipe 920 and 945 for Parameter 19 [5] . . 30 3-4 Normal (right) vs. Anomalous (left) data [5] ...... 31

4-1 Mapping between Time-Series A and B [16] ...... 34 4-2 Clustering techniques used ...... 35 4-3 Automated Anomaly Detection Model ...... 38 4-4 Reference Cycle Example ...... 39 4-5 Clustering Plasma Etcher Data ...... 39

5-1 Data format [14] ...... 43 5-2 K-Means method applied to 150 wafer cycles ...... 44 5-3 K-Medoids method applied to 70 wafer cycles ...... 44 5-4 CLARA method applied to 150 wafer cycles ...... 45 5-5 Agglomerative clustering method applied to 20 wafer cycles ...... 45 5-6 Divisive clustering method applied to 20 wafer cycles ...... 46 5-7 MLP forecast one cycle (parameter 19) ...... 48 5-8 LSTM forecast one cycle (parameter 19) ...... 49

9 5-9 MLP-single model ...... 51 5-10 MLP-uni model ...... 52 5-11 LSTM-single model ...... 52 5-12 LSTM-uni model ...... 53 5-13 Forecasting and residual plots of LSTM model trained on recipe 920 and tested on recipe 945 (Parameter 19) ...... 55 5-14 Normal Q-Q plot and histogram of residuals for parameter 5 . . . . . 56 5-15 Normal Q-Q plot and histogram of residuals for parameter 17 . . . . 57 5-16 Normal Q-Q plot and histogram of residuals for parameter 19 . . . . 57 5-17 Empirical PDF frequency plots for Parameter 5, Recipe 920 . . . . . 58

6-1 Sum of residual values for normal (green) and anomalous (red) cycle for one cycle of parameter 5 and parameter 19 (Recipe 920) . . . . . 62 6-2 Deviation plots for ideal (blue), normal (green) and anomalous (red) cycles for parameters 5 and 19 (Recipe 920) ...... 63 6-3 Recipe 920 (Blue) and Recipe 120 (Green), one cycle, parameter 19 . 64 6-4 K-Means clustering for 30 cycles of recipe 920 and 10 cycles of recipe 120 (parameter 19) ...... 65 6-5 LSTM trained on recipe 920, forecasting recipe 120 (parameter 19). The black line represents the actual values and the green line is the predicted values ...... 66

10 List of Tables

3.1 Anomalous Parameters in Plasma Etcher Data ...... 27

5.1 Cluster Validation Results ...... 46 5.2 Anomalous Time-steps in Recipe 920 Parameters ...... 49 5.3 Anomalous Time-steps in Recipe 920 Parameters ...... 51 5.4 Anomalous Time-steps in Recipe 945 Parameters ...... 54 5.5 Anomalous Time-steps in Recipe 920 Downsampled Parameters . . . 54 5.6 Anomalous Time-steps Train on 920, Test on 945 ...... 54 5.7 Performance comparison of empirical PDF vs. Gaussian for anomaly detection ...... 59

11 12 Chapter 1

Introduction

Manufacturing offers a great environment for the application of machine learning and techniques, particularly in the realm of semiconductor man- ufacturing. Semiconductor manufacturing facilities are equipped with many sensors which monitor the manufacturing process and the semiconductors that are made. In order to reduce costs, companies use the data generated by the sensors on the various machines to further optimize their manufacturing processes.

Currently, much of the data that is generated is only used for troubleshooting when a problem arises. A single manufacturing process has hundreds, if not thousands, of parameters from sensors, so efficiently determining the source of a problem in a process is difficult. The wafers that semiconductors are manufactured on go through multiple process cycles. This process is long and when a cycle goes wrong, it is hard to detect anomalies in time and so the process continues until the process in finished. These wafers are expensive to produce, so a process failure can cause a substantial loss in both cost and time.

This is why machine learning offers a great potential for anomaly detection in semiconductor manufacturing. If anomalies in the manufacturing process could be detected, or even predicted, earlier, then a manufacturing facility could halt the process and correct the affected machine. This would increase process yield and

13 reduce costs, both of which are of extreme interest to semiconductor manufacturers.

1.1 Current Anomaly Detection Practices

The primary focus of this study is on semiconductor manufacturing data collected by Analog Devices (ADI). Presently, ADI has had limited success in using Statistical Process Control (SPC) and limits monitoring in their fabrication process. These methods are often unable to reliably detect out-of-control processes and temporal anomalies that occur. This is due to the complex nature of the manufacturing process, including multiple recipes and parameters which makes it difficult to set individual thresholds and limits for each data channel.

A single semi-conductor manufacturing process has hundreds of parameters from many different sensors making it infeasible to monitor each and every parameter effectively. Thus, ADI has depended on a reactive rather than proactive approach to anomalous events, mostly using the data to troubleshoot an issue after it has occurred rather than flagging and analyzing anomalies as they occur. Even then, it can bevery difficult to manually identify the specific parameters of a process that were responsible for an anomalous event.

1.2 Thesis Goals

The current anomaly detection protocol at ADI presents a promising opportunity for the application of machine learning based anomaly detection methods. The aim of this thesis is to explore the feasibility of a pipeline for automatically detecting anomalous events as they occur. We want to develop a model that requires only a small amount of domain knowledge, in the form of labelled data, and thus can be applied to many different machines and recipes.

This thesis is a combination and expansion of three previous theses that each fo- cused on a different aspect of improving upon Analog Devices’ anomaly detection

14 practices which are discussed in depth in Chapter 2 [5, 14, 7]. By building off of these previous works, we hope to implement a real-time analysis system for effec- tively flagging anomalies. Currently, many issues in a manufacturing process may remain undiscovered until the process is complete, which can take up to a few hours, days, or even weeks to process a specific wafer depending on the process. Thus, the benefit of real-time analysis is that it would allow for early termination of theprocess, saving both time and money.

1.3 Methodology

In this thesis, a literature review is conducted first to discuss common techniques of anomaly detection. Time-series averaging, cluster analysis, and time-series forecasting methods are also discussed as these are integral parts of our approach and the main focus of the previous work conducted at Analog Devices. We then examine the datasets that we wish to perform anomaly detection on and examine the properties of this data.

Next, an automated anomaly detection model is developed and evaluated on its ability to correctly identify anomalous events. This process starts with training our model to identify normal wafer cycles and performing one-step ahead prediction on those cycles. The model is then tested on its accuracy in detecting anomalies. Dif- ferent experiments are carried out across different process recipes and the model is assessed on its predictive capabilities. This model is then extended to predicting anomalous trends in data over time rather than at individual time-steps.

1.4 Thesis Outline

There are eight chapters in this thesis. The first chapter introduces the background and purpose of this study, while Chapter 2 is a discussion of existing literature on anomaly detection techniques in the industry and the previous work that this thesis

15 is based off of. The third chapter discusses the primary dataset for this thesis and Chapter 4 outlines the methods and structure of the automated anomaly detection pipeline. Chapter 5 focuses on the experimental results of our model in a number of different scenarios. Chapter 6 discusses alternative anomaly detection methods and experiments on different machines. The seventh chapter discusses further work that can be explored in the future, and Chapter 8 concludes the thesis.

16 Chapter 2

Literature Review

In this chapter, a literature review is conducted to first understand the anomaly detection problem in more detail before presenting literature on cluster analysis, time series forecasting, and anomaly detection methods. Several predictive models used in literature are also presented.

2.1 Anomaly Detection

An anomaly in time series data is defined as a point or sequence of points that deviates from the normal behavior of the data [3]. Anomaly detection is a problem that arises in a variety of domains, including manufacturing, economics, transporta- tion, and health care [1]. There is no single answer to anomaly detection as different domains may define anomalies in different ways [9].

Using machine learning for detecting anomalies in time-series data has been ap- proached from both an unsupervised and semi-supervised manner [1, 6]. The amount of supervision required is determined by the amount of information about the data that is available. For instance, can be applied to a dataset that has previously labelled data. On the other hand, data that has few or no labels requires semi-supervised or methods in order to classify the data. Supervised learning models, while able to detect previously seen anomalies,

17 are often times unable to detect new anomalous patterns. Additionally, anomalies occur (hopefully) infrequently in data, thus creating an unbalanced distribution of anomalous versus normal examples [3].

When it comes to unsupervised learning methods for anomaly detection, cluster- ing methods can be used while neural network models can be used for supervised learning [10, 21]. Anomalies may also manifest in many different ways. The easiest anomaly to detect are extreme values or that exceed the standard operating range of the process. Limits can be placed on each sensor channel to automatically detect these point anomalies when the specified threshold is violated [3]. A bigger challenge are contextual anomalies that occur within the normal operating range but which are not conforming to the expected temporal pattern [20]. These anomalies occur frequently in manufacturing environments and are difficult to detect reliably using SPC methods or limits monitoring.

Another challenge faced is the highly multivariate data from multiple sensor chan- nels that monitor the manufacturing process. Certain anomalies require taking into account the effects that multiple parameters may have on the output of the machine for detection [3]. Even with advanced knowledge of the machinery, it is still difficult to understand the complex relationships within the multiple channels of data. Hence, monitoring each channel by itself may not be able to reveal these multivariate anoma- lies. With these challenges in mind, every solution to the anomaly detection problem must address these few questions [1]:

1. What is considered to be normal or the range of normal behavior?

2. What measure is used to differentiate normal and abnormal behavior?

3. At what point will the abnormal behavior be considered an anomaly?

In this thesis, anomalies are detected through the use of both supervised and unsupervised methods. Predictive models address the first question by forecasting the expected normal behavior based on past data. What is considered ”normal” is

18 decided and learned by the predictive models during training. By characterizing the distribution of prediction errors of the model, anomalies that deviate from the norm (high prediction error) can be determined. Each potentially anomalous point has a certain probability of occurring based on the distribution of prediction errors. The deviation from the expected normal behavior provides a measure of anomalous behavior. An anomaly can then be flagged when it falls outside a specified threshold. Thresholds can be in the form of standard deviations from the mean for Gaussian distribution or a desired level of confidence in the deviation being anomalous for non-parametric distributions.

There are a variety of methods or measures that can be used to differentiate the normal points from anomalous ones, which depend on what is being predicted. The standard forecasting output is a one-step ahead prediction; these have been more widely documented in the literature as compared to multiple step forecasting. Dif- ferent variants to multiple step forecasting are detailed in [22], where they are cate- gorized into either a direct method or a recursive method. However, these multi-step forecasting methods are not used in this thesis.

The methodology of anomaly detection fundamentally compares the expected pre- diction errors to the actual prediction error. The prediction error or residual can be defined as the Euclidean difference between the predicted value and actual value.In the one step forecast, the residual of the current time step is compared to the distri- bution of residuals of the whole sequence. The data point is considered an anomaly if it falls outside the threshold limits set for the given distribution. Another measure for anomalous behavior in time series forecasting is to compute the area bounded be- tween the forecast and the actual data. The bigger the area, the more anomalous the sequence of points are. Moreover, the cumulative sum technique can also be applied to the residuals to identify anomalous trends of the data.

19 2.2 Previous Work

As mentioned in Chapter 1, the work in this thesis is based off of the work from three MIT students who conducted research at Analog Devices with the goal of ap- plying machine learning techniques to semiconductor manufacturing. The scope of the project was to create a platform to analyze data from the manufacturing process. This included automatically detecting suspicious anomalies in data with the ultimate goal of using the techniques developed to provide early warning alerts of process anomalies. The major case studied was related to an unconfined plasma excursion that happened during the plasma etching process. The students explored three dif- ferent sets of machine learning techniques for detecting anomalies in semiconductor manufacturing:

1. Building reference cycles for data comparison [7]

2. Cluster analysis algorithms for anomaly detection [14]

3. Time series forecasting using neural networks [5]

2.2.1 Reference Cycles

Building a reference cycle enables more efficient anomaly detection by creating a standard cycle from data that is considered ”good”. This reference cycle can then be compared to anomalous cycles to more efficiently determine which process parameters were responsible for the problem. Previously, engineers would manually compare data, relying on their experience to determine which parameter went wrong. By building a reference cycle that represents what good, or non-anomalous, data should look like, anomaly detection accuracy can be improved for both manual and automatic methods. These reference cycles also allow for the application of anomaly classification and detection methods such as clustering and time series forecasting.

20 2.2.2 Cluster Analysis

Cluster analysis consists of a set of unsupervised learning methods aiming at de- termining subgroups or clusters of observations within a data set. Such an analysis is based on the information found in the data that describes the objects and their relationships. Unlike classification analysis, clustering requires determining the num- ber and composition of the groups. It is used in a variety of fields, including biology, statistics and pattern recognition. The aim of clustering is to group similar objects whilst the groups are different from one another, such that the clusters present high internal homogeneity in addition to high external heterogeneity. The greater the sim- ilarity within a group and dissimilarity between the groups, the higher the quality of the cluster analysis. Figure 2-1 shows an example of clustering data into three distinct clusters.

Figure 2-1: Example of Cluster Analysis [17]

Multiple types of clustering and clusters have been developed. It is important to make a relevant choice for both the type of clustering and type of clusters when performing cluster analysis. Hierarchical and partitional clustering are the most com- monly used ones. Partitional clustering consists of dividing the set of data objects into mutually disjoint partitions if possible. The clusters are desired to be distinct

21 from one another. consists of permitting clusters to be nested, meaning that clusters can also be part of subgroups; clusters are thus organized as a tree. Cluster analysis can be a useful exploratory tool in discovering outliers in a dataset. An is defined as an anomalous observation, one that significantly deviates from the rest of the data. Identifying outliers in a datasetisa precious source of information on the untypical behavior of a set of observations.

By using a reference cycle, cluster analysis can be applied to process data to detect and visualize anomalies. In [14], various clustering algorithms were used on data from a plasma etcher, including k-means, k-medoids, CLARA, agglomerative, and divisive approaches which are discussed further in Chapter 4. By using these techniques, good cycles of data will be clustered with the reference cycle, while bad cycles will be clustered separately. This cluster analysis was carried out on both univariate (single parameter) and multivariate (multiple parameter) data. Multivariate cluster analysis achieved high accuracy on both single recipe clustering (100%) and dual- recipe clustering (93%) [14].

2.2.3 Time Series Forecasting with Neural Networks

Forecasting, in a nutshell, is a way to predict future data points based on past data. Lookback, b, is defined as the number of points in the past that are considered, while lookahead a is defined as the number of points in the future that will be predicted. This is summarized by Equation 2.1 for a single time step. The same equation can be applied to each of the lookahead points.

푦ˆ푡 = 푓(푦푡−1, 푦푡−2, ..., 푦푡−푏) (2.1)

Here 푦푡 the prediction at time 푡, while 푦푖 are previous values of the time-series at time 푖. Determining the number of points to lookback is essentially finding the optimal number of points that can provide sufficient information for the prediction of future time steps. A large lookback value would ensure an accurate forecast of the normal

22 behavior. However, depending on the model, it may result in an inefficient usage of memory space and processing power. Too large of a lookback value (with equal weights) may also cause the model to be insensitive to recent changes. Intuitively, looking too far ahead would result in a decrease in the forecast accuracy as the point, to be predicted, gets further away from the past data points. The number of lookback and lookahead points depends heavily on the data and model used and are crucial parameters to consider when optimizing the model.

Figure 2-2: Example of Time Series Forecasting for Stock Prices [11]

Machine learning models have been increasingly popular in the forecasting field of research. In particular, neural network models have been found to be robust and highly accurate in applications such as forecasting medical data and stock market prices [20]. Figure 2-2 shows an example of applying time-series forecasting to pre- dict future stock market prices. The Multi-Layer Perceptron (MLP), a simple feed forward neural network with multiple hidden layers, has been found to be able to learn complex time dependencies and correlations within the data. While not specialized for use in forecasting, MLP models are able to achieve satisfactory results in several applications [20].

23 Recurrent neural networks (RNNs) are also a topic of interest in time series fore- casting, as the inherent structure of a RNN enables it to retain a memory state from each time step to the next. This specialized structure makes it excellent in its ap- plication to sequences such as time series data. Basic RNNs, however, have limited performance in long sequences of data due to the vanishing gradient problem [20]. The Long Short Term Memory (LSTM) model, a variant of the RNN structure, in- troduced an innovative solution to the problem [8]. LSTM models have been widely researched and have been found to perform exceptionally well in forecasting time se- ries data with long term temporal dependencies. This increased performance comes with the added complexity of the LSTM model, making it more difficult to implement and train than standard neural networks.

Figure 2-3: Architecture of a Long Short-Term Memory Model [5]

The neural network models discussed can be used to detect anomalies by compar- ing good, non-anomalous time series to other sets of data during the manufacturing process. In [5], multi-layer and recurrent neural networks models were developed for anomaly detection. These models were first trained on non-anomalous time series data. Then, when they are exposed to other data sets, anomalies are flagged when the current data differs from the expected values at a given timebya specified significance level. The models developed were able to detect anomalies reli- ably up to 92% for the best model. However, this high accuracy was only achievable

24 under supervised training of the models [5].

25 26 Chapter 3

Dataset

The primary dataset for experimentation originates from the Brookside server for the plasma etcher machine (OXLR7_LAMAL1.) The Brookside server records a total of 31 parameters, a subset of the many parameters available from the machine. The entire dataset consists of two recipes (920 and 945) that were running on the machine when an unconfined plasma excursion occurred for the period of 2016-07-26 to2016- 08-02. For each of these recipes we have around 500 cycles of data available. This issue resulted in known anomalies identified in three specific parameters as shownin Table 4-1. All dates are in the format YYYY-MM-DD.

Table 3.1: Anomalous Parameters in Plasma Etcher Data

No. Parameter Period of Anomaly Occurrence 5 BOT_RF_RevPwr_In 2016-07-26 to 2016-08-02 17 ProcChm_Bot_Elec_Temp_Mon 2016-07-19 to 2016-08-02 19 ProcChm_EndPt_ChanC_In 2016-07-26 to 2016-08-02

The dataset consists of both good and bad data, as labelled by engineers at ADI. These labels allow us to test our models’ predictive and classification performances.

27 Figure 3-1: Difference between recipe 920 and 945 for Parameter 19, for normal (good) runs

3.1 Data characteristics

The data for the plasma etcher is split up by wafer cycles, where each cycle repre- sents one run of the plasma etcher. A cycle is either 600 timesteps (Recipe 920) or 300 timesteps (Recipe 945). Figure 3-1 shows the difference between the two recipes for a given parameter; note that the major difference between the two is the duration.

Out of the 31 total parameters, six of them (10, 11, 13, 20, 27, and 28) show no variation throughout the whole dataset and thus are excluded from the dataset during our experiments. This is also because we want to avoid these parameters adding noise to our models and decreasing the accuracy of prediction. Within the data there are two properties that we must be aware of when creating our models: drift in the data and known anomalies.

Drift

Several of the parameters seem to exhibit a drifting behavior as illustrated in Fig- ure 3-2. The drift occurs within the wafer cycles of the same lot and resets with each

28 Figure 3-2: Drift between cycles for parameter 17 [5] succeeding lot. Through our experiments, we did not find the drift to seriously inhibit the performance of our model.

Known Anomalies and Noise

Within the dataset, it can be observed that several noise spikes occur randomly throughout the wafer cycle as shown in Figure 3-3. While these spikes would normally be considered as an outlier or an anomaly, such spikes occur regularly and are ignored. These spikes occur due to an inability of a mechanical component within the machine to stabilize its position.

While the problem is known, it has no significance to the process output and is hence considered as part of the ”normal” behavior of the machine. Hence, these known anomalies are meant to be ignored as well. While these noise spikes and anomalies can be removed during preprocessing, they require prior knowledge of the process and parameters to craft specialized preprocessing techniques to remove them. Hence, the predictive models are trained on data that contain these ”normal” behavior to test if they are able to ignore these known anomalies and consider them to be part of the normal behavior of the data.

29 Figure 3-3: Difference between recipe 920 and 945 for Parameter 19 [5]

3.2 Anomalies in Data

Anomalies in the plasma etcher data can be visually identified when compared to expected normal operating behavior. This can be seen in Figure 3-4, where each color line in each plot represents one wafer cycle. The anomalies present in these parameters present themselves in different levels of significance. Parameter 19 and17 show fairly obvious differences between normal and anomalous data, while anomalies in parameter 5 are less obvious.

30 Figure 3-4: Normal (right) vs. Anomalous (left) data [5]

31 32 Chapter 4

Methods

This chapter discusses the specific methods used for our anomaly detection model. First the time-series averaging methods used to create the reference cycle are ex- plained. Next, the specific clustering algorithms used and their differences aredis- cussed. After this, the specific implementation of the MLP and LSTM model are discussed. Finally, an overview of our entire automated anomaly detection model is given.

4.1 Time-Series Averaging for Reference Cycle

4.1.1 Dynamic Time Warping

Dynamic Time Warping (DTW) is a times series alignment algorithm calculating and comparing the dissimilarity between two time series based upon a distance measure. The shorter the DTW distance is, the more similar the two series are. It aims at warping two time series iteratively until optimally minimizing the DTW distance between the two time series and mapping one onto the other [4]. For two time series,

퐴 = (푎1, 푎2, , 푎푛) and 퐵 = (푏1, 푏2, , 푏푚), with lengths of 푛 and 푚 respectively, it initially creates an 푛-by-푚 distance matrix. The time series A and B could be either univariate or multivariate time series, but the two should have the same number of parameters. Each element in the matrix is a cumulative distance of a minimum of

33 Figure 4-1: Mapping between Time-Series A and B [16]

the three surrounding neighbors. The (푖, 푗) element 푌푖,푗 in the matrix is defined in [7] as:

푃 푌푖,푗 = |푎푖−푏푗| +푚푖푛{푌푖−1,푗−1, 푌푖−1,푗, 푌푖,푗−1}(1 ≤ 푖 ≤ 푛, 1 ≤ 푗 ≤ 푚, 푌0,0 = 0, 푌푖,0 = 푌0,푗 = ∞) (4.1)

Here 푌푖,푗 is the summation of the distances between the 푖-th point in the 퐴 series 푝 and the 푗-th point in the 퐵 series, |푎푖 − 푏푗| , and the minimum of the three minimum distances around the (푖, 푗) element. Variable 푝 is the dimension of the |푎푖 −푏푗|-norms. Normally 푝 is chosen to be 2 so that the Euclidean distance is used to measure the distance between two points. The cumulative distance between the two series are

finally determined by 푌푖,푗. An example of the mapping is shown in Figure 4-1, where the query series, 퐴 = {2, 3, 8, 2, 3, 1, 3} is aligned to series, 퐵 = {3, 1, 2, 3, 8, 3, 2} [16].

DTW can find an optimal global alignment between series and thus is probably the most popular measure to quantify the dissimilarity between sequences [12]. The benefit of using DTW is that the two time series do not need to be of equal lengths.

4.1.2 DTW Barycenter Averaging

The DTW Barycenter Averaging (DBA) method generates a centroid from a cluster of time series based upon the DTW [18]. This is an iterative and global method. The latter word means that the order the series get input into the function is not

34 related to the result. During DBA, a centroid is initially selected for the cluster. Normally this begins by randomly selecting a time series from the cluster. On each iteration, the DTW alignment between each time series in the cluster and the centroid is recalculated and updated. All points in the cluster corresponding to the same point in the centroid are grouped and then are averaged to get the new value of that centroid point. Iterations continue until either the upper limit of the iteration time is reached or the centroid is converged.

4.2 Clustering Algorithms

For our cluster analysis, we considered five different techniques. Three of these are partitioning methods: K-means, K-Medoids and CLARA, while two of these are hierarchical: Agglomerative and Divisive. Figure 4-2 shows the grouping of the five different clustering methods used.

Figure 4-2: Clustering techniques used

35 4.2.1 K-Means

The objective of k-means clustering is to minimize the total intra-cluster variance

or the squared error function J [13]. Given a set of 푘 clusters (퐶1, 퐶2, ..., 퐶푘) the k-means objective function would be determined by the following equation:

푘 ∑︁ ∑︁ 2 퐽 = ‖푥 − 휇푖‖ (4.2) 푖=1 푥∈퐶푖

2 Here 휇푖 represents the mean of the 푖-th cluster 퐶푖 and ‖푥−휇푖‖ is the distance metric

between a data point, 푥, and the center of cluster 퐶푖.

4.2.2 K-Medoids

The k-medoids algorithm [2] is a clustering method that is related to k-means. While its aim is also to partition the data set, unlike k-means, k-medoids defines a prototype in terms of a ”medoid”. A medoid is by definition an actual data point and corresponds to the point that best represents most of the points of the cluster. Given a set of 푘

clusters (퐶1, 퐶2, ..., 퐶푘) the k-means objective function would be determined by the following equation: 푘 ∑︁ ∑︁ 2 퐽 = ‖푥 − 푐푖‖ (4.3) 푖=1 푥∈퐶푖

2 Here 푐푖 represents the center of the 푖-th cluster 퐶푖 and ‖푥−푐푖‖ is the distance metric

between a data point, 푥, and the center of cluster 퐶푖.

4.2.3 CLARA

Clustering LARge Applications (CLARA) [23] is an extension of the k-medoids algo- rithm that draws a sample from the large dataset and applies the k-medoids algorithm to determine an optimal set of medoids. For each sample, the k-medoids objective function is minimized. CLARA repeats the sampling and clustering procedures a given number of times until the minimal cost is obtained through the iterations. The corresponding clustering is then selected as the final clustering result.

36 4.2.4 Agglormerative

Known as a ”bottom-up” approach and sometimes called ”Agglomerative Nesting” (AGNES), the agglomerative method is a hierarchical clustering algorithm [15]. Every data point starts in its own cluster of one (or leaf) and new clusters are defined by iteratively merging the different observations while moving up the hierarchy until there is just one big cluster (or root).

4.2.5 Divisive

Known as a ”top-down” approach and also referred to as "Divisive Analysis" (DI- ANA), the divisive method is the opposite of agglomerative clustering [19]. All obser- vations start in the same cluster and they are iteratively split into multiple clusters while moving down the hierarchy.

4.3 Neural Networks

4.3.1 Multi-Layer Perceptron

The model architecture of a basic feed forward neural network can be customized to fulfill different functions other than forecasting such as classification anddimen- sionality reduction. By customizing the number of nodes in the input layer and the number of neurons in the output layer, the MLP is able to conduct both univariate and multivariate forecasts.

In the univariate forecast, the number of input nodes will correspond to the number of lookback points while the number of output neurons will correspond to the number of lookahead points. For the multivariate forecast, all the lookback points for each parameter is inputted into the model. This allows the model to learn the complex relationships between each parameter and between each lookback point. The output layer can be configured to either produce a single step forecast for each parameter or a single step forecast for one parameter.

37 4.3.2 Long Short-Term Memory

Even though each time step produces an output in an LSTM model, only the output of the current time step is used as the one step forecast result. As both the input and output are in the form of vectors, it is possible to use LSTM for both univariate and multivariate forecasting. The complex relationships between the multiple parameters are learned by the hierarchical learning process through multiple stacked LSTM layers.

4.4 Anomaly Detection Pipeline

We plan to use the anomaly detection techniques previously discussed and develop techniques for automated, real-time anomaly detection. In order to develop an auto- mated, unsupervised anomaly detection process, we would like to combine all three of the previous techniques into one pipeline (see Figure 4-3).

Figure 4-3: Automated Anomaly Detection Model

First, a reference cycle is developed from non-anomalous data to compare future data against (see Figure 4-4). With this, cluster analysis can be conducted on data from new wafer cycles to separate normal from anomalous cycles. Figure 4-5 shows clustering using the k-means algorithm on data from the plasma etcher using the cluster toolbox that was implemented in R in [14]. Here, the reference cycle is a member of cluster 3 (blue). This means that all data points within cluster 3 are considered non-anomalous cycles, while all other data points are considered to be anomalous.

Now, with the results from cluster analysis, a new dataset can be made from cycles that were identified as non-anomalous by the cluster analysis. This group ofnon-

38 Figure 4-4: Reference Cycle Example

Figure 4-5: Clustering Plasma Etcher Data

39 anomalous cycles can then be used for supervised training of a time series forecasting model. The advantage of combining all of these methods together would be that to enable an unsupervised model with automated data labeling (or much improved and simplified expert data labeling) conducted through cluster analysis.

40 Chapter 5

Automated Anomaly Detection Experiments

This chapter explains the specific experimental setup and results of the automated anomaly detection pipeline explained previously. First, the setup of each module of the model is discussed and then the results of running this model on the plasma etcher data is presented.

5.1 Reference Cycle

As explained in Chapter 4, the first step in the automated anomaly detection pipeline is building a reference cycle. For our experiments, we build the reference cycle from 30 wafer cycles that have been previously labeled as ”good” or normal operating conditions. This number was chosen as it seems to be a reasonable number of cycles that could be hand labelled by an engineer, as opposed to hundreds of cycles. Having this many cycles will also allow the reference cycle to take into account some of the drift present in the data.

We chose to use the DTW Barycenter Average (DBA) method for this thesis as [7] found this method to have the best performance of accuracy and calculation time. Some optimization was done on on DBA to further its performance; however, most

41 of the parameters for building the reference cycle were taken from the experimental results in [7].

5.2 Clustering Analysis

Once a reference cycle is built from the hand-, we move on to the second step in our model, the cluster analysis. As previously explained, the purpose of this step is to categorize unlabeled cycles as either anomalous or normal depending on their cluster compared to the reference cycle.

Before clustering, the data is pre-processed into a specific data-frame structure, shown in Figure 5-1, as in [14]. Each row of the data-frame represents a wafer cycle and each column represents a certain parameter; so each 푖, 푗 index will be either 300 or 600 values for a given cycle and parameter.

Using the labels for the data we can evaluate the performance of each of these clustering methods in order to determine which one we should use in our full pipeline. For our partitioning cluster methods, we set the number of clusters to be 3. This value accounts for the varying amount of anomalies in the dataset, as cycles are observed to be either normal, somewhat anomalous, or highly anomalous. Figures 5-2 through 5- 6 depict the five clustering algorithms we have discussed applied to various sizesof the recipe 920 dataset. PCA is applied for visualizing the cluster results for the partitioning methods, where the axes are the two main principal components.

For all of the partitioning cluster methods, the normal cycles were well separated from the anomalous and highly anomalous cycles. Additionally, the hierarchical meth- ods do a good job of separating the normal from anomalous data. In order to de- termine the best clustering methods to use moving forward, we ran all five of the clustering algorithms on a set of 300 wafer cycles from recipe 920 and measured their accuracy at clustering anomalous and normal cycles apart from each other. The goal

42 Figure 5-1: Data format [14]

of the clustering is to get high similarity within each group and low similarity be- tween each group. Consequently, the within-cluster variance needs to be low whilst the variance between the cluster needs to be high. For this we use cluster validation to determine the most stable clustering method. Connectivity measures how elements are placed in the same cluster as the data points they are closest to in the data set and should be minimized. The Dunn index measures the minimum and maximum distance between two distinct cluster and takes the ratio 푑푚푖푛 , with a small index rep- 푑푚푎푥 resenting a good clustering. The Silhouette coefficient estimates the average distance between clusters and produces a value between -1 and 1, with 1 representing points that are well clustered.

43 Figure 5-2: K-Means method applied to 150 wafer cycles

Figure 5-3: K-Medoids method applied to 70 wafer cycles

44 Figure 5-4: CLARA method applied to 150 wafer cycles

Figure 5-5: Agglomerative clustering method applied to 20 wafer cycles

45 Figure 5-6: Divisive clustering method applied to 20 wafer cycles

Table 5.1: Cluster Validation Results

Method Connectivity Dunn Silhouette k-means 2.9290 1.0797 0.8008 k-medoids 71.2230 1.0219 0.3156 CLARA 4.2099 1.4518 0.7505 Agglomerative 2.9290 1.0797 0.8008 Divisive 2.9290 1.0797 0.8008

Table 5.1 shows that k-means, agglomerative and divisive produce the most stable clusters. However, because k-means has a shorter computation time, we use this method for the rest of our pipeline.

46 5.3 Anomaly Detection

In addition to identifying anomalous cycles, we would like to be able to identify whether or not each time-step during a cycle is anomalous, using time-series forecast- ing. We can use cluster analysis to perform unsupervised labelling of cycles in order to enable supervised neural network training. This begins with building a reference cycle from 30 cycles. After this, we perform clustering on a new dataset of 300 cycles where the labels are unknown. After clustering the cycles, we identify the cycles that were grouped into the same cluster as the reference cycle. We take the cycles from this cluster and create a training dataset for our neural network models. By exposing our models to only good data, the models will learn to accurately forecast normal data and the residual values should be low. However, when the model tries to forecast anomalous data, the residual values should be high, allowing it to flag anomalies. The exact metrics of this are discussed in the following sections.

5.3.1 Training the Models

For both the MLP and LSTM model, we consider both a multivariate and univariate approach, similar to Chen [5]. For the multivariate (single model) approach, one model is used that takes as input all 25 parameters for each time-step and predicts 25 values at each time-step. The univariate approach uses a separate model for each parameter; thus we will have 25 total models using this approach.

MLP

For the Multi-Layer Perceptron (MLP) model, we use the previous 8 time-steps in a given cycle and perform one-step ahead prediction. We train the model for 30 epochs using the ADAM optimizer and a mean square error loss function. The MLP model takes around 10 minutes for training. Figure 5-7 shows the one step ahead prediction results for the MLP model for one cycle of parameter 19 in recipe 920. Here, the red dashed line represents the predicted values while the black line represents the actual values. The specific cycle is normal and the MLP model is able to accurately perform

47 one-step ahead prediction.

Figure 5-7: MLP forecast one cycle (parameter 19)

LSTM

For the Long Short-Term Memory (LSTM) model, we use the previous 18 time-steps in a given cycle and perform one-step ahead prediction. Like the MLP model, we train the LSTM model for 30 epochs using the ADAM optimizer and a mean square error loss function. The LSTM model takes much longer to train compared to the MLP model, at around 50 minutes. Figure 5-8 shows the one step ahead prediction results for the LSTM model for one cycle of parameter 19 in recipe 920. Like the MLP model, the LSTM is able to accurately perform one-step ahead prediction.

5.3.2 Identification of Anomalous Points

Now that we have verified that both the MLP and LSTM models are able to accu- rately perform one-step ahead prediction for normal data, we can move on to testing these models on detecting anomalous time-steps. The labels for our test dataset are generated by a script written in [5]. For each of the three parameters, a specific

48 Figure 5-8: LSTM forecast one cycle (parameter 19) threshold is used to determine if a given time-step is anomalous. This is only useful for this specific dataset as the anomalies are very distinct and thus a simple threshold can label individual anomalous time-steps. Table 5.2 shows the number of anomalous points identified in Recipe 920 versus the total number of time-steps. Wecansee that for this recipe, the majority of points are considered anomalous. The number of anomalous points for recipe 945 is comparable to recipe 920.

Table 5.2: Anomalous Time-steps in Recipe 920 Parameters

Parameter Number of Anomalous Time-steps Total Timesteps 5 28904 43224 17 31097 43224 19 28647 43224

49 Residual Distribution

With one-step forecasting, anomalies can be detected through the use of the distri- bution of residuals. We use the residual differences between the actual and predicted values from forecasting normal cycles in order to determine this distribution. With this probability distribution, we can run our model on new, potentially anomalous, data. At each time-step, the probability of the residual value occurring can be calcu- lated. We then take the log of this probability and compare it to a pre-determined threshold, which for these experiments is two standard deviations from the mean. If the log probability value for a given residual falls below this threshold, our model will flag this point as anomalous. [5] assumed that the residual values followeda Gaussian distribution and used Maximum Likelihood Estimation to determine the parameters of this distribution. For initial testing of our model, we want to compare our models performance to that of Chen’s fully supervised model and thus use the same Gaussian distribution assumption for anomaly detection. Later, we will explore alternate thresholds for better modeling the distribution of the residuals.

In order to accurately evaluate the model in its anomaly detection capabilities, it is crucial to use quantitative metrics to provide a proper comparison. While qualitative comparison through visualization is appropriate to see how the forecast deviates from the actual data, it does not provide a consistent and reliable method for anomaly detection. Hence, we use the precision and recall scores to evaluate our models.

Recipe 920

Table 5.3 shows the results of running our automated anomaly detection model on the recipe 920 dataset with four different time-series forecasting methods: Univariate (25 total models) MLP and LSTM and Single Model (1 model for all 25 parameters) MLP and LSTM. The univariate approach far surpasses the single model approach. This is expected as a model trained on only one parameter is going to be able to make better predictions for that parameter when compared to a model that is trained on all 25 parameters. The LSTM-Uni model had the best overall performance; however, the

50 MLP-Uni model had a comparable performance. The plots in Figures 5-9 through 5-12 show each of the four models forecasts for four cycles of parameter 5 and the resulting log-probability of the residual values for each time step. In the residual plot, the black dashed line represents the threshold, and thus points below this threshold are flagged as anomalies.

Table 5.3: Anomalous Time-steps in Recipe 920 Parameters

Parameter 5 17 19 Model Precision Recall Precision Recall Precision Recall LSTM-Uni .920 .871 .828 .449 .806 .927 LSTM-Single .645 .164 .245 .041 .607 .421 MLP-Uni .904 .191 .864 .164 .791 .813 MLP-Single .712 .182 .298 .046 .672 .337

(a) Forecast

(b) Residual

Figure 5-9: MLP-single model

51 (a) Forecast

(b) Residual

Figure 5-10: MLP-uni model

(a) Forecast

(b) Residual

Figure 5-11: LSTM-single model

52 (a) Forecast

(b) Residual

Figure 5-12: LSTM-uni model

Recipe 945

Table 5.4 shows the results of running our automated anomaly detection model on the recipe 945 dataset with four different time-series forecasting methods: Univari- ate MLP and LSTM and Single Model MLP and LSTM. Similar to recipe 920, the univariate approach shows the best performance. The similarity in results between recipe 920 and 945 is expected as the only difference between the two in the dura- tion of a cycle. Again, the LSTM-Uni model and MLP-Uni models had comparable performance.

Downsampling

Table 5.5 shows the results of running our automated anomaly detection model on the recipe 920 dataset that was downsampled by a factor of 2. This process significantly reduced the training times of all the models by around 30% without a substantial decrease in performance.

53 Table 5.4: Anomalous Time-steps in Recipe 945 Parameters

Parameter 5 17 19 Model Precision Recall Precision Recall Precision Recall LSTM-Uni .934 .812 .806 .424 .820 .899 LSTM-Single .652 .168 .278 .059 .641 .446 MLP-Uni .881 .213 .853 .187 .795 .835 MLP-Single .692 .156 .357 .048 .512 .303

Table 5.5: Anomalous Time-steps in Recipe 920 Downsampled Parameters

Parameter 5 17 19 Model Precision Recall Precision Recall Precision Recall LSTM-Uni .872 .604 .782 .398 .798 .649 LSTM-Single .553 .124 .264 .052 .597 .495 MLP-Uni .757 .188 .795 .139 .727 .663 MLP-Single .588 .141 .291 .036 .484 .295

Train 920, Test 945

Table 5.6 shows the results of training entirely on recipe 920 and then performing anomaly detection on recipe 945. For this experiment, only the LSTM-Uni and MLP-Uni models were used, as these had shown far superior accuracy in the previous experiments. Overall, the models were able to fairly accurately identify anomalies for a recipe that they had yet to see. Figure 5-13 shows the LSTM-Uni model’s forecast and residual distribution plots for three cycles of parameter 19 of recipe 945. Although the models were tested on an unseen recipe, the similarity between 920 and 945 may explain why the performance of the models declined only slightly.

Table 5.6: Anomalous Time-steps Train on 920, Test on 945

Parameter 5 17 19 Model Precision Recall Precision Recall Precision Recall LSTM-Uni .856 .442 .701 .471 .826 .536 MLP-Uni .839 .203 .738 .114 .697 .490

54 Figure 5-13: Forecasting and residual plots of LSTM model trained on recipe 920 and tested on recipe 945 (Parameter 19)

5.4 Distribution Validation

As mentioned previously, in [5] the residual distribution was assumed to be Gaussian. The Figures 5-14. 5-15, and 5-16 show the normal Q-Q plots and histograms of residual values for each of the three parameters we are investigating. The histograms show somewhat of a resemblance to a Normal distribution; however, Parameter 19 shows an almost bimodal shape. Looking at the Q-Q plots for the three parameters, we can see that the middle of the distribution follows a normal distribution. However, at the tails of the distribution, the residual values diverge from a normal distribution. The danger in assuming that the residual distributions are Gaussian is that when we perform Maximum Likelihood Estimation, the resulting distribution’s parameters could be too wide. What we mean by this is shown in Parameter 19’s residual histogram. If we perform MLE, then the Gaussian will include the values past 0.1 in its estimated parameters. Ultimately, this will lead to values that are far away from the mean of the distribution being considered normal. This will result in a high

55 (a) Forecast (b) Residual

Figure 5-14: Normal Q-Q plot and histogram of residuals for parameter 5 number of False Negatives - values assumed to be normal that are, in fact, anomalous.

5.4.1 Empirical Probability Density Function

In order to improve the estimation of the distribution, we consider using the Empir- ical Probability Density Function which is a non-parametric density estimator. The Empirical PDF uses Kernel Density Estimation (KDE) to estimate the PDF of a random variable. This takes the form of:

푛 푛 ˆ 1 ∑︁ 1 ∑︁ 푥 − 푥푖 푓ℎ(푥) = 퐾ℎ(푥 − 푥푖) = 퐾( ) (5.1) 푛 푖=1 푛ℎ 푖=1 ℎ

Here, K is the kernel, a non-negative function and ℎ > 0 is a smoothing parameter called the bandwidth. In order to implement this into our model, we used the demp function from the R package EnvStats. This function uses kernel density estimation to produce an Empirical PDF of a given sample. Then, for a given point 푥, the function will use linear interpolation to estimate the density. By using the emprical PDF we

56 (a) Forecast (b) Residual

Figure 5-15: Normal Q-Q plot and histogram of residuals for parameter 17

(a) Forecast (b) Residual

Figure 5-16: Normal Q-Q plot and histogram of residuals for parameter 19

57 (a) Anomalous Cycle (b) Normal Cycle

Figure 5-17: Empirical PDF frequency plots for Parameter 5, Recipe 920 can model the non-Gaussian distribution of the residual values for a given parameter. Figure 5-17 shows the comparison between the density values for an anomalous cycle and a normal cycle. The empirical PDF is calculated from the residual values of the LSTM model; thus we expect that other normal cycles will generally have high probability values when their density is calculated. Figure 5-17 confirms this, as the the majority of residual values from the normal cycle have density values near 1, whereas for the anomalous cycle, the majority of the residual values have very low probabilities.

5.4.2 Empirical PDF Experimental Results

We trained our automated anomaly detection model the same as before; however’ we now substitute the empirical PDF for the Gaussian Distribution. Using this method, we evaluate our LSTM-Uni and MLP-Uni models’ performances at detecting anomalies for recipe 920 and compare the results to those obtained when using the Gaussian distribution assumption. Table 5.7 shows the performance of our model using an emprical PDF compared to using a Gaussian distribution function. We can

58 see that by using an empirical PDF to model the residual distributions, we obtain overall better performance at anomaly detection.

Table 5.7: Performance comparison of empirical PDF vs. Gaussian for anomaly detection

Parameter 5 17 19 Empirical PDF Model Precision Recall Precision Recall Precision Recall LSTM-Uni .985 .892 .875 .470 .791 .971 MLP-Uni .977 .343 .867 .381 .764 .935 Gaussian Distribution Model Precision Recall Precision Recall Precision Recall LSTM-Uni .920 .871 .828 .449 .806 .927 MLP-Uni .904 .191 .864 .164 .791 .813

59 60 Chapter 6

Further Experiments

This section details additional experiments that were conducted. The first part details the idea of deviation scores and the second section shows experiments on additional data-sets.

6.1 Deviation Scores

Thus far we have used clustering to identify anomalous cycles and time-series fore- casting for identifying anomalous time-steps in a given cycle. We now want to bridge the gap between these two methods. Oftentimes in manufacturing, we want to know the trend of a machine over time; that is, if the machine is trending in an anomalous direction. We want to introduce the idea of a deviation score, or a method for quan- tifying how well a machine is performing over time. Essentially, we want to see if, over time, the distribution of residual values is becoming more and more anomalous.

One way to do this is to keep a running sum of residual values for a cycle over time and compare them to what the sums for a normal cycle would look like. Figure 6-1 shows the running sum of residual values for normal (green) and anomalous (red) cycles for parameters 5 and 19. As expected, the cumulative sum of residuals for the normal cycle is much less than that of the anomalous cycle, showing that this method could be useful for showing the trend of a cycle over time.

61 (a) Parameter 5 (b) Parameter 19

Figure 6-1: Sum of residual values for normal (green) and anomalous (red) cycle for one cycle of parameter 5 and parameter 19 (Recipe 920)

Another method is to use the residual distribution probabilities. Here, a probability of 1 indicates that the time-step is normal while probabilities towards 0 indicate an anomalous time-step. At each time-step, we take the log of the probability computed from the empirical PDF function and add it to a cumulative sum. Figure 6-2 shows the deviation plots over one cycle for both a normal (green) and anomalous (red) cycle. The blue line in the plots corresponds to the ideal line - every probability value is 1, log(p) = 0. We can see from both of these plots that the normal cycle is closer to ideal than anomalous, though it is not perfect. These results indicate that using the residual probability values could be a good indication for how well a machine is performing over time.

One thing to note about both of the deviation methods discussed is that the normal and anomalous cycles are very distinct from one another. Thus, it makes sense that the trend for the two cycles would be very different. It would be interesting to see how well these deviation metrics would work for anomalous cycles that are less visually obvious.

62 (a) Parameter 5 (b) Parameter 19

Figure 6-2: Deviation plots for ideal (blue), normal (green) and anomalous (red) cycles for parameters 5 and 19 (Recipe 920)

6.2 Across Different Machines

In addition to the original plasma etcher data, we were able to use data from other plasma etcher machines and recipes. The key issue is that we do not have labels for this new data, and thus cannot evaluate our anomaly detection methods on this data. The new data that were are looking at is from the OXLR2_LAMAL1 plasma etcher machine. Unfortunately, the data from this plasma etcher does not have the exact same parameters as the data that we had before, though it does share a few similar- ities. Notably, both plasma etchers record values for ProcChm_EndPT_ChanC_In (Parameter 19). So we can look at this data using recipe 920 from the original plasma etcher and recipe 120 from the new plasma etcher. Figure 6-3 shows one cycle of parameter 19 plotted for recipe 920 (Blue) and recipe 120 (Green). We can see that compared to recipe 920, recipe 120 has a much shorter duration (125 vs. 600 time-steps) and higher max values (10,000 vs 7,500).

63 Figure 6-3: Recipe 920 (Blue) and Recipe 120 (Green), one cycle, parameter 19

6.2.1 Clustering different recipes

Clustering the cycles from recipe 120 along with recipe 920 did not provide much insight as the values for parameter 19 are so distinct between the two recipes that they will always be in different clusters. This can be seen in Figure 6-4 - withall of the recipe 920 results in cluster 1 and all of the recipe 120 cycles in cluster 2. Thus, if compared to recipe 920, the cycles from recipe 120 would be considered anomalous. Given the differences, this is what we would expect and our results indicate clustering can separate different recipes from one another. However, in order to identify anomalous cycles, only one recipe can be used at a time.

64 Figure 6-4: K-Means clustering for 30 cycles of recipe 920 and 10 cycles of recipe 120 (parameter 19)

6.2.2 Time-series forecasting across recipes

We would like to examine how well our models can learn to forecast the new recipe 120 data. Specifically, we would like to see how well our model that is trained on recipe 920 from the original plasma etcher would be able to perform one-step ahead prediction on the new data. This is similar to when we trained our model on recipe 920 and tested it on recipe 945; however, now the recipes are more distinct, thus we would expect to see many anomalies being detected in recipe 120.

As can be seen in Figure 6-5, our model trained on recipe 920 is unable to accurately forecast cycles from recipe 120 whatsoever. This is due to the large difference in cycle length between the two recipes. These results indicate that it is difficult for a simple model to predict across distinct recipes. Training the model using multiple different

65 recipes might be possible. Alternatively, augmenting our model architecture could help in producing better forecasts.

Figure 6-5: LSTM trained on recipe 920, forecasting recipe 120 (parameter 19). The black line represents the actual values and the green line is the predicted values

66 Chapter 7

Future Work

7.1 Experiment Recommendations

Unfortunately, we did not have access to the additional data from other plasma etcher machines with enough time to fully experiment with it. Therefore, a future step would be exploring the pipeline presented in this thesis with this data. There are dozens of other recipes and at least six different plasma etcher units that ADI has data for. Additionally, semiconductor manufacturing processes (and manufacturing processes in general) consist of a number of different machines. It would be interesting to see how well this model transfers to other manufacturing machines.

In terms of quantity or data, we trained our models on a relatively small amount of data. In literature, neural network models are trained on datasets containing poten- tially millions of data points. We believe that with a larger dataset, our model could achieve higher performance. As seen in our experiments, the univariate approach ( one model for each parameter) always out performed the single model approach. This makes sense as our evaluation metric was based on performance for individual pa- rameters. However, it is likely that this is not always the case in manufacturing and that we should consider the combination of all parameters together when detecting anomalous behavior.

67 In the dataset used for this thesis, the anomalies were a factor of just three param- eters, which all showed highly anomalous behavior. But there may exist situations in which slightly abnormal behavior in many different parameters results in a process becoming anomalous, which our current model does not account for. For data with less visually distinct anomalies, performing feature extraction and analyzing those features rather than the raw time-series data itself may be successful.

7.2 Predicting Machine Failures

Related to anomaly detection is the problem of predicting when a given piece of manufacturing equipment will become anomalous or fail. For this thesis, we only considered one-step ahead prediction, but this same method could be applied to multi-step ahead prediction. In this case, a model could be developed to predict an anomalous event, such as the unconfined plasma excursion that was analyzed in this thesis, hours or even days ahead of time. If a dataset was created with data from many anomalous events, then a model could potentially be developed that can recognize the trends that lead up to such events.

Survival analysis is a field of statistics and machine learning where the goal isto analyze and model data where the outcome is the time until an event of interest occurs. Models besides just neural networks have shown promise in survival analysis, such as Bayesian methods and random forests. It would be interesting to develop models such as these for the semiconductor manufacturing process. However, we would need a much larger dataset of failure events in order to properly train models for survival analysis.

68 Chapter 8

Conclusion

In this thesis, we have proposed a pipeline for automated anomaly detection for semiconductor manufacturing. Our pipeline consists of three parts: developing a reference cycle which represents the average time series of good, non-anomalous wafer cycles; performing cluster analysis on unlabelled wafer cycles to separate normal and anomalous cycles; and finally training neural network models to detect anomalous time-steps in a single cycle. Through our experiments, we have shown that our pipeline is able to effectively detect anomalous timesteps when given only a small amount of labelled data to begin with. This mostly unsupervised approach allows our model to only rely on a small amount of domain knowledge in order to perform. Our model has applications not just in semi-conductor manufacturing but manufacturing as a whole, and thus we hope to apply our method to a variety of domains. There are many different extensions for this work that hopefully will be explored in future research.

69 70 Bibliography

[1] Majid S. alDosari. Unsupervised anomaly detection in sequences using long short term memory recurrent neural networks. Master’s thesis, George Mason University, 2016.

[2] Preeti Arora, Deepali, and Shipra Varshney. Analysis of k-means and k-medoids algorithm for big data. Procedia Computer Science, 78:507–512, 12 2016.

[3] Jan Paul Assendorp. for anomaly detection in multivariate time series data. Master’s thesis, Hamburg University, 2017.

[4] Donald J. Berndt and James Clifford. Using dynamic time warping tofind patterns in time series. In KDD Workshop, volume 10, pages 359–370. Seattle, WA, 1994.

[5] Tiankai Chen. Anomaly detection in semiconductor manufacturing through time series forecasting using neural networks. Master’s thesis, MIT, 2018.

[6] Nico Görnitz, Marius Kloft, Konrad Rieck, and Ulf Brefeld. Toward supervised anomaly detection. Journal of Artificial Intelligence Research, 46:235–262, Feb 2013.

[7] Han He. Applications of reference cycle building and k-shape clustering for anomaly detection in the semiconductor manufacturing process. Master’s thesis, MIT, 2018.

[8] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, Nov 1997.

[9] Vandana P. Janeja and Revathi Palanisamy. Multi-domain anomaly detection in spatial datasets. Knowledge and information systems, 36(3):749–788, Sep 2013.

[10] Nguyen Huy Kha and Duong Tuan Anh. From cluster-based outlier detection to time series discord discovery. Revised Selected Papers of the PAKDD 2015 Workshops on Trends and Applications in Knowledge Discovery and Data Min- ing, 2015.

[11] Will Koehrsen. Stock prediction in python. Towards , Jan 2018.

71 [12] J.B. Kruskal and Mark Liberman. The symmetric time-warping problem: From continuous to discrete. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Jan 1983.

[13] Aristidis Likas, Nikos Vlassis, and Jakob J Verbeek. The global k-means clus- tering algorithm. Pattern Recognition, 36(2):451–461, 2003.

[14] Ouiaima Maklouk. Time series data analytics: Clustering-based anomaly de- tection techniques for quality control in semiconductor manufacturing. Master’s thesis, MIT, 2018.

[15] Daniel Müllner. Modern hierarchical, agglomerative clustering algorithms. arXiv preprint arXiv:1109.2378, 2011.

[16] Vit Niennattrakul, Dararat Srisai, and Chotirat Ann Ratanamahatana. Shape- based template matching for time series data. Knowledge-Based Systems, 26:1–8, 2012.

[17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

[18] François Petitjean, Alain Ketterlin, and Pierre Gançarski. A global averag- ing method for dynamic time warping, with applications to clustering. Pattern Recognition, 44(3):678–693, 2011.

[19] Maurice Roux. A comparative study of divisive and agglomerative hierarchical clustering algorithms. Journal of Classification, 35(2):345–366, 2018.

[20] Akash Singh. Anomaly detection for temporal data using long short-term mem- ory (lstm). Master’s thesis, KTH Royal Institute of Technology, 2017.

[21] Richard G. Stafford and Jacob Beutel. Application of neural networks asanaid in medical diagnosis and general anomaly detection, 1994. US Patent 5,331,550.

[22] Souhaib Ben Taieb and Rob J. Hyndman. Recursive and direct multi-step fore- casting: the best of both worlds. 2012.

[23] S. Vijayarani and P. Jothi. An efficient clustering algorithm for outlier detection in data streams. International Journal of Computer Applications, 32:3657–3665, 2011.

72