<<

Journal of Manufacturing and Materials Processing

Article Pattern Recognition in Multivariate Series: Towards an Automated Event Detection Method for Smart Manufacturing Systems

Vadim Kapp 1,2, Marvin Carl May 1 , Gisela Lanza 1 and Thorsten Wuest 2,* 1 Wbk Institute of Production , Karlsruhe Institute of Technology, Kaiserstr. 12, 76131 Karlsruhe, Germany; [email protected] (V.K.); [email protected] (M.C.M.); [email protected] (G.L.) 2 Benjamin M. Statler College of Engineering and Mineral Resource, West Virginia University, Morgantown, WV 26506, USA * Correspondence: [email protected]; Tel.: +1-304-293-9439

 Received: 4 August 2020; Accepted: 2 September 2020; Published: 5 September 2020 

Abstract: This paper presents a framework to utilize multivariate time series data to automatically identify reoccurring events, e.g., resembling failure patterns in real-world manufacturing data by combining selected data mining techniques. The use case revolves around the auxiliary polymer manufacturing process of drying and feeding plastic granulate to extrusion or injection molding machines. The overall framework presented in this paper includes a comparison of two different approaches towards the identification of unique patterns in -world industrial data set. The first approach uses a subsequent heuristic segmentation and clustering approach, the second branch features a collaborative method with a built-in time dependency structure at its core (TICC). Both alternatives have been facilitated by a standard component analysis PCA (feature fusion) and a hyperparameter optimization (TPE) approach. The performance of the corresponding approaches was evaluated through established and commonly accepted metrics in the field of (unsupervised) machine learning. The results suggest the of several common failure sources (patterns) for the machine. Insights such as these automatically detected events can be harnessed to develop an advanced monitoring method to predict upcoming failures, ultimately reducing unplanned machine downtime in the future.

Keywords: smart manufacturing; Industry 4.0; polymer processing; polymer manufacturing; smart maintenance; unsupervised learning; segmentation; clustering; time series analysis

1. Introduction The manufacturing industry is currently in the midst of the fourth industrial revolution. This digital transformation towards smart manufacturing systems (SMS) is based on the three pillars connectivity, virtualization, and data utilization [1]. This circumstance is fueled by the rapid development in both technology (IT) and operational technology (OT), which has led to an increasingly connected and automated world: in , a merging of both worlds (IT/OT). Technologies like sensor technology, cloud computing, and AI/big data analytics lead not only to a dramatic increase in the amount of (manufacturing) data, but also to rapidly developing data processing capabilities, which have raised interest in data mining approaches for automating repetitive processes [2]. Time series data, thereby, are one of the most common information representation variants in a variety of different business areas. Advanced process monitoring, on which we rely regularly in SMS, typically yields multidimensional data to increase its effectiveness. A specific branch in time series analysis deals with the recognition of reoccurring patterns within the data (see Figure1). Time series data analysis generally

J. Manuf. Mater. Process. 2020, 4, 88; doi:10.3390/jmmp4030088 www.mdpi.com/journal/jmmp J. Manuf. Mater. Process. 2020, 4, x FOR PEER REVIEW 2 of 19

methods in order to identify the steady state, anomalies and characteristic failure patterns. However, the identification of such distinct patterns in multivariate time series represents a highly complex problem due to the interactions (correlation) between variables, time dependency and the usually nonlinear of the regarded data [3,4]. Furthermore, real-word data are usually noisy and, thus, require pre-processing to become a valuable input for analytical algorithms. Pre-processing enhances the ability of the overall approach to be flexible enough to detect various disturbances, such as missing values, noise or outliers, within the data set, but also sufficiently restrictive so that not all insignificant fluctuations (e.g., measurement errors) are labeled as irregularities. Standard distance- based similarity metrics, which are often used in related unsupervised learning approaches, can J.there Manuf.fore Mater. not Process.be applied2020, 4 ,without 88 adaptation for different types of problems [4,5]. 2 of 20 There is a growing interest in the field of pattern recognition, especially for multivariate time utilizesseries data established due to, on pattern the one recognition hand, increasing methods availability in order toof identifypotent algorithm the steadys and state, easy anomalies-to-use tools, and characteristicand on the other failure hand patterns., the realization However, of the the identification potential valuable of such impact distinct of patterns insights in derived multivariate from such time seriesdata sets. represents Most of a the highly recent complex work on problem clustering, due however, to the interactions focuses on (correlation) grouping different between time variables, series timeinto dependencysimilar batches and [6,7] the. usually The paper nonlinear by [8] nature for instance of the regarded presents dataa popular [3,4]. Furthermore, approach using real-word fuzzy dataclustering are usually combined noisy with and, a thus,dynamic require time pre-processing warping technique to become resulting a valuable in an enhance input forperformance analytical algorithms.when compared Pre-processing to previous enhances methods. the There ability are of only the overall a few authors approach focusing to be flexible on pattern enough recognit to detection variouswithin a disturbances, single time series, such as such missing as we values, regularly noise are or confronted outliers, within in maintenance the data set, applications but also suffi [4]ciently. restrictiveThis paper so that presents not all insignificant a framework fluctuations to utilize (e.g.,multivariate measurement data to errors) identify are reoccurring labeled as irregularities. patterns in Standardreal-world distance-based manufacturing similarity data. The metrics, objective which is to areidenti oftenfy failure used in patterns related for unsupervised new applications learning in approaches,the area of maintenance can therefore. not be applied without adaptation for different types of problems [4,5].

Figure 1. Abstract schematic to illustrate different patterns in a time series. Figure 1. Abstract schematic to illustrate different patterns in a time series. There is a growing interest in the field of pattern recognition, especially for multivariate time seriesW datae analyze due to, the on drying the one process hand, increasingof plastic granulate availability in an of industrial potent algorithms drying hopper and easy-to-use equipped tools, with andmultiple on the sensors. other hand, The the number realization of different of the potential failure patternsvaluable impact (sources of for insights defects) derived is not from known such databeforehand sets. Most and of the the sensor recent readingswork on clustering, are however, to natural focuses fluctuations on grouping (noise). di ff Theerent overall time series work intopresented similar in batches this manuscript [6,7]. The paper includes by a[8 ] comparison for instance of presents two different a popular approa approachches towards using fuzzy the clusteringidentification combined of unique with patterns a dynamic in the time data warping set. One technique processing resulting path includes in an enhance the sequential performance use of whencommon compared segmentation to previous and clustering methods. Therealgorithms are only to identify a few authors patterns, focusing which onmight pattern lead recognition to a better withinrespective a single (both time steps series, of the such chain) as and we regularly, therefore, are overall confronted performance. in maintenance The second applications approach[ features4]. a collaborativeThis paper method presents with a framework a built-in totime utilize dependency multivariate structure, data tothereby identify avoiding reoccurring a multiplication patterns in real-worldof losses due manufacturing to a sequential data. processing The objective chain. isThe to identifybetter performing failure patterns method for is new fine applications-tuned afterwards in the areain terms of maintenance. of its hyperparameters. The resulting patterns (clusters) that are identified by the framework can then serve as input for advanced monitoring methods predicting upcoming failures and We analyze the drying process of plastic granulate in an industrial drying hopper equipped with ultimately reducing unplanned machine downtime in the future. multiple sensors. The number of different failure patterns (sources for defects) is not known beforehand The outline of this paper features a short introduction to the topic of plastic granulate drying and the sensor readings are subject to natural fluctuations (noise). The overall work presented in this and time series clustering in Section 2, followed by the proposed framework in Section 3. Finally, the manuscript includes a comparison of two different approaches towards the identification of unique results are presented in Section 4 and consequently discussed in Section 5. patterns in the data set. One processing path includes the sequential use of common segmentation and clustering algorithms to identify patterns, which might lead to a better respective (both steps of the chain) and, therefore, overall performance. The second approach features a collaborative method with a built-in time dependency structure, thereby avoiding a multiplication of losses due to a sequential processing chain. The better performing method is fine-tuned afterwards in terms of its hyperparameters. The resulting patterns (clusters) that are identified by the framework can then serve as input for advanced monitoring methods predicting upcoming failures and ultimately reducing unplanned machine downtime in the future. J. Manuf. Mater. Process. 2020, 4, 88 3 of 20

The outline of this paper features a short introduction to the topic of plastic granulate drying and time series clustering in Section2, followed by the proposed framework in Section3. Finally, the results are presented in Section4 and consequently discussed in Section5.

2. State of the Art Industry 4.0, machine learning, and artificial (AI) are popular topics in the current scientific discourse and, therefore, the current state of the art is subject to frequent changes or newly arising theories and topics [1]. Thus, we chose to apply the Grounded Theory Approach, originally developed by Glaser and Strauss in 1967 [9], to gather and subsequently analyze the existing knowledge on the topic of pattern recognition on time series data in the context of Smart Manufacturing Systems (SMS). The approach encourages inductive reasoning for the formulation of the experimental setup and, hence, fosters the development of a practical state of the art framework [10]. The corresponding literature was selected from the results of a search term optimization identifying comparable research approaches and problems. We used established academic research databases, in particular Scopus, Web of Science, IEEE Xplore, and ScienceDirect, to execute the keyword-based literature search process. The most relevant papers were selected according to their degree of relation to subject of patter recognition in time series for SMS. In order to establish an automated event detection method for SMS through an unsupervised machine learning framework, we need to take a closer look at the data and its underlying manufacturing process. Domain knowledge is key to developing successful machine learning applications in a manufacturing context. Data mining process phases, such as feature extraction and the selection of suitable algorithms and machine learning models, benefit form process understanding and the appropriate application of these insights [4]. The following chapter introduces the general drying process of plastic granulates and presents respective data usable for the analysis.

2.1. Manufacturing Process Drying hoppers are industrial machines, which are mostly used to dry and remove moisture from non-hygroscopic and slightly hygroscopic plastic pellets. They are especially effective for the process of drying the surface area of non-hygroscopic resin pellets prior to further fabrication [11] to avoid issues further down the process chain such as voids caused by gasification of moisture in the final parts. The drying process in general is a vital step in the chain of manufacturing for diverse types of polymers since the excess moisture will otherwise react during the melting process. This might result in a loss of molecular weight and subsequently to changes in regard of the physical properties, such as reduced tensile and impact strength of the material [12,13]. Thereby, most of the modern drying hoppers are structured similarly, in order to ensure an even temperature distribution while also maintaining an even mass material flow. They feature a vertically cylindric body with a conical hole at the bottom, as depicted in Figure2. The spreader tubes inside the hopper then inject hot and dry air into the chamber, while the granulate is slowly moving from the top to the bottom valve [11,13]. After the hot air has passed the material, it is cleared, dried, (re-)heated (if necessary) and then again reinjected into the drying chamber. For a drying process to be successful in achieving the targeted humidity, there are three major factors that need to be considered: Drying time, drying temperature, and the dryness of the circulating air [13]. The most intuitive factor might be the drying time since the circulating air can only take a certain amount of humidity from the material until it is saturated. The specific amount of time that the material needs to stay in the machine in order to achieve a defined humidity can be calculated as a function of air temperature, input dryness of the material and target humidity. In general, the drying process has a concave form insofar as the percentage humidity reduction is steep and flattens during later stages [12,14]. Both air dryness and temperature also play decisive roles for the drying process and are usually coordinated in dependency of one another. By drying the air, the relative humidity is reduced and consequently the moisture holding capacity J. Manuf. Mater. Process. 2020, 4, 88 4 of 20

increases, whereas a higher temperature speeds up the drying process, since it facilitates the diffusion of water from the interior to the surface of the material [11,14]. However, there might be adverse effects J.of Manuf. too high Mater. temperatures Process. 2020, 4, on x FOR materials PEER REVIEW quality depending on the plastic itself that need to be considered.4 of 19

FigureFigure 2. 2. SchematicSchematic view view of of a a drying drying hopper hopper with with temperature temperature probe probe..

Due to the critical implications of a malfunctioning drying hopper system, the corresponding process is usually monitored round the clock. In In our our use use case, case, a a six six-zoned-zoned temperature temperature probe probe for the monitoring process, whichwhich continuouslycontinuously measures measures the the temperature temperature at a dit ffdifferenterent locations locations (heights) (heights) of the of thevertically vertically aligned aligned drying drying chamber chamber (see Figure (see Figure1), is applied. 1), is applied. Doing so Doing facilitates so facilitates the indirect the detection indirect detectionof a variety of ofa divarietyfferent of reoccurring different reoccurring disruptions disruptions in the drying in process,the drying such process as over-, such or under-driedas over- or undermaterial,-dried heater material, malfunctions, heater malfunctions, and many more and since many most more of since those most incidents of those are incidents directly connected are directly or connectedat least can or be at derived least can indirectly be derived from indirectly the temperature from the of temperature the circulating of the air [circulating13]. This is air also [1 the3]. This reason is alsowhy the these reason six temperature why these six zones temperature play a major zones role play in determininga major role in the de performancestermining the of performances the different ofpattern the different recognition pattern approaches recognition presented approaches in this presented paper. in this paper. The data data set set used used for for this this analysis analysis is based is based on two on separate two separate data sources, data two sources, individually two individually controlled, controlled,yet identical, yet drying identical, hopper drying machines hopper located machines in the located same manufacturing in the same manufacturing plant. Each drying plant. hopper Each dryingwas monitored hopper was over monitored a specific over time, a 6 specific months time, for machine 6 months one for (M1)machine and one 4 months (M1) and for machine4 months two for machine(M2), whereby two (M2), an automatic whereby readingan automatic of the reading sensor outputof the sensor was conducted output was every conducted 60 s over every the entire 60 s overmonitoring the entire period. monitoring In addition period. to the In diadditionfferent temperatureto the different zones, temperature the data set zones, contains the data a variety set contains of other aoperational variety of other process operational related sensor process data. related After sensor the initial data. preprocessing After the initial of the preprocessing data set, which of the involves data set,the removal which involves of missing, the damaged, removal ofand missing, severely damaged fragmented, and data severely objects, fragment the resultinged data data objects, set consists the resultingof 31 (M1) data and 37set(M2) consists different of 31 features (M1) and and 37 263,479 (M2) different (M1) and 176,703features (M2) and instances.263,479 (M1) The and signals 176,703 from (M2)the sensors instances. are partlyThe signals continuous, from the such sensors as the are real partly time continuous, temperature such measurements as the real time of the temperature circulating measurementsair or panel, as of well the ascirculating partly binary air or forpanel, information as well as on partly the statusbinary of for di informationfferent components on the status such of as differentheater or components blower (on/o suchff). The as heater following or blower Figure 3(on/off). illustrates The the following di fferent Figure events 3 involved,illustrates by the displaying different eventsthe steady involved, state of by the displaying machine (left) the in steady comparison state of to the a typical machine disruption (left) in (right). comparison to a typical disruption (right).

Figure 3. Exemplary representation of steady state (left) and failure event (right).

J. Manuf. Mater. Process. 2020, 4, x FOR PEER REVIEW 4 of 19

Figure 2. Schematic view of a drying hopper with temperature probe.

Due to the critical implications of a malfunctioning drying hopper system, the corresponding process is usually monitored round the clock. In our use case, a six-zoned temperature probe for the monitoring process, which continuously measures the temperature at different locations (heights) of the vertically aligned drying chamber (see Figure 1), is applied. Doing so facilitates the indirect detection of a variety of different reoccurring disruptions in the drying process, such as over- or under-dried material, heater malfunctions, and many more since most of those incidents are directly connected or at least can be derived indirectly from the temperature of the circulating air [13]. This is also the reason why these six temperature zones play a major role in determining the performances of the different pattern recognition approaches presented in this paper. The data set used for this analysis is based on two separate data sources, two individually controlled, yet identical, drying hopper machines located in the same manufacturing plant. Each drying hopper was monitored over a specific time, 6 months for machine one (M1) and 4 months for machine two (M2), whereby an automatic reading of the sensor output was conducted every 60 s over the entire monitoring period. In addition to the different temperature zones, the data set contains a variety of other operational process related sensor data. After the initial preprocessing of the data set, which involves the removal of missing, damaged, and severely fragmented data objects, the resulting data set consists of 31 (M1) and 37 (M2) different features and 263,479 (M1) and 176,703 (M2) instances. The signals from the sensors are partly continuous, such as the real time temperature measurements of the circulating air or panel, as well as partly binary for information on the status of different components such as heater or blower (on/off). The following Figure 3 illustrates the different events involved, by displaying the steady state of the machine (left) in comparison to a typical J.disruption Manuf. Mater. (right). Process. 2020, 4, 88 5 of 20

Figure 3. Exemplary representation of steady state (left) and failure event (right). Figure 3. Exemplary representation of steady state (left) and failure event (right). 2.2. Analytical Approaches

There is a plethora of scientific work available focused on the topic of time series analysis, in particular within the last decades. Two major themes among the published research that are strongly related to the practical application of those algorithms are (i) fault (anomaly) detection within time series and (ii) clustering of different time series [4]. Since both of these areas are comparably well developed evidenced by the large amount of available scientific contributions, some pattern recognition approaches employed this to construct a pipeline-like processing framework to cluster time series data. The works [15–17] are comprehensive reviews and applications focused on this topical area and used segmentation and clustering methods subsequently. The majority of these papers apply correlation analysis, wavelet transformation, and principle component analysis (PCA) based metrics to identify homogeneous segments and to assign the data points to each cluster. Classical distance-based metrics (e.g., Euclidean distance) are still used in various scientific works today, most often to provide a benchmark for the evaluation process [4,5]. A more recent work by [8] presented an approach using fuzzy clustering combined with a dynamic time warping technique to identify the different clusters after the segmentation has been completed. Another reoccurring theme in the field of pattern recognition in series is the domain those approaches are applied mostly. The majority of datasets that has been used to evaluate theoretical frameworks stem from the medical domain (e.g., ECG and biological signals), the financial sector (e.g., exchange rates and stock trade), and speech recognition applications [3–5]. Significantly fewer works focus on manufacturing applications and the analysis of industrial machines (e.g., gas turbine and water level monitoring), especially in a maintenance context [4]. Most recent contributions to the maintenance discourse focus on the prediction and classification of previously defined failure patterns, thereby solely relying on the preceding groundwork of experts. Hence, the use of time series analysis in SMS itself provides a multitude of opportunities for further research. Additionally, the effectiveness of cutting-edge event detection methods is examined for industrial applications, which later can be used to assist or even replace expert centered error recognition systems. Most importantly, the adaption and usage of advanced pattern recognition algorithms for SMS serves as a stepping stone into further studies combining domain expertise and novel machine learning algorithms for solving advanced industrial problems, therefore moving towards a more automated data processing chain overall. To achieve this goal, this paper aims to combine the best of both worlds: the latest findings from the general time series analysis literature as well as complex, real-world industrial data to put those cutting-edge approaches to the test. In doing so, we also use feature fusion [4] and sequential pattern recognition methods [18], which have been successfully applied previously within an industrial context. However, an important distinction to comparable works in this field is that we start without any prior expert knowledge on the characteristics (e.g., type, frequency, and shape) of the failure patterns. Furthermore, we also chose a quantitative method to compare competing time series analysis approaches by implementing various methods in combination of different cost function, J. Manuf. Mater. Process. 2020, 4, 88 6 of 20 while comparable papers often select promising configurations beforehand without further examination. This rigorous (in-depth) proceeding, however, results in a trade-off so that only a limited number of different algorithms (or rather pathways) could be implemented in the scope of this scientific work, which is admittedly smaller than comparable studies, such as [19,20] produced.

3. Research Methodology An overview of the novel framework proposed in this paper is illustrated in Figure4. It is important to note that it includes two separate approaches to pattern recognition that were conducted separately and then compared to identify the superior process given the unique characteristics of the data set. This is intentionally included in the framework as data sets differ significantly in their nature and providing options to the user that allow to compare the performance as part of the framework is aJ. promising Manuf. Mater. way Process. of leveraging2020, 4, x FORthese PEER REVIEW differences in data sets today. 6 of 19

Figure 4. Overview of proposed time series processing framework. Figure 4. Overview of proposed time series processing framework. 3.1. Principle Component Analysis (PCA) 3.1. Principle Component Analysis (PCA) The first step, after generally checking and cleaning the data (pre-processing), involves a PCA to identifyThe first irrelevant step, after features generally to reduce checking the dimensionalityand cleaning the (feature data (pre reduction)-processing), and thereforeinvolves a overall PCA complexityto identify irrelevant of the data features set. In to doing reduce so, the the dimensionality implemented (feature algorithm reduction) is successively and therefore constructing overall “artificialcomplexity variables” of the data (principle set. In components) doing so, the as implemented linear combination algorithm of the is successively original variables, constructing so that those“artificial new vectorsvariables” are (principle orthogonally components) aligned to as each linear other. combination This new of representation the original variables, of the data so that with linearthose independent new vectors vectorsare orthogonally can subsequently aligned beto each used, other. instead This of thenew singular representation vectors, of as the input data data with for thelinear affi liatedindependent analysis vectors processes can (segmentation subsequently andbe used, clustering) instead [ 4of]. Afterthe singular the data vectors, has been as clearedinput data and rearranged,for the affiliated we can analysis start with processes the first (segmentation step (of the firstand pathway)clustering) of [4] segmenting. After the data the timehas been series. cleared and rearranged, we can start with the first step (of the first pathway) of segmenting the time series. 3.2. Heuristic Segmentation 3.2. Heuristic Segmentation Following the first branch of the pattern recognition pathway, we consider the multivariate FollowingT the first branchm of the pattern recognition pathway, we consider the multivariate signal X = xt = with xt R a m-dimensional (number of sensors) observation. T. stands { }t 푇1 ∈ 푚 forsignal the total푋 = { number푥푡}푡=1 wit ofh observations 푥푡 ∈ ℝ being and a tmis-dimensional the index of (number the measurements of sensors) inobservation time. Consequently,. 푇 stands thefor multivariate the total number time seriesof observations can also be and expressed 푡 is the asindex the of following the measurementsm T matrix: in time. Consequently, the multivariate time series can also be expressed as the following 푚×× 푇 matrix:    x11 ... x1m   x … x   1. 1 . .1m X =  . .. .  X =[ .⋮ ⋱ . ⋮ ]   xT1 ...… xxTmTm

InIn order order to to identify identify sections sections of of the the time time series series with with similar similar trajectories trajectories (patterns), (patterns), we we first first need need to partitionto partition thesignal the signal into internal into internal homogeneous homogeneous segments. segments. Ideally,the Ideally, timeseries the timewill be series automatically will be splitautomatically into segments split into representing segments representi (i) regularng machine (i) regular behavior machine (steady behavior state) (steady on thestate) one on side the one and (ii)side segments and (ii) withsegments a variety with ofa variety events of (including events (including failure events) failure on events) the other on the side other as the side final as the output. final b output.An event or segment in this case can therefore be formulated as X = xt , with X being the i { }t=a+1 i i th overallAn event segment or segment of the in partition this case and can( bthereforea) with be1 formulateda < b T asdenoting 푋 = {푥 the}푏 length, with and 푋 location being − − ≤ ≤ 푖 푡 푡=푎+1 푖 ofthe the 𝑖 interval.− 푡ℎ overall For segment a less formal of the notation,partition and it can (푏 also− 푎) be with denoted 1 ≤ 푎 as

J. Manuf. Mater. Process. 2020, 4, x FOR PEER REVIEW 6 of 19

Figure 4. Overview of proposed time series processing framework.

3.1. Principle Component Analysis (PCA) The first step, after generally checking and cleaning the data (pre-processing), involves a PCA to identify irrelevant features to reduce the dimensionality (feature reduction) and therefore overall complexity of the data set. In doing so, the implemented algorithm is successively constructing “artificial variables” (principle components) as linear combination of the original variables, so that those new vectors are orthogonally aligned to each other. This new representation of the data with linear independent vectors can subsequently be used, instead of the singular vectors, as input data for the affiliated analysis processes (segmentation and clustering) [4]. After the data has been cleared and rearranged, we can start with the first step (of the first pathway) of segmenting the time series.

3.2. Heuristic Segmentation Following the first branch of the pattern recognition pathway, we consider the multivariate 푇 푚 signal 푋 = {푥푡}푡=1 with 푥푡 ∈ ℝ being a m-dimensional (number of sensors) observation. 푇 stands for the total number of observations and 푡 is the index of the measurements in time. Consequently, the multivariate time series can also be expressed as the following 푚 × 푇 matrix:

x11 … x1m X = [ ⋮ ⋱ ⋮ ] xT1 … xTm

In order to identify sections of the time series with similar trajectories (patterns), we first need to partition the signal into internal homogeneous segments. Ideally, the time series will be automatically split into segments representing (i) regular machine behavior (steady state) on the one side and (ii) segments with a variety of events (including failure events) on the other side as the final output. 푏 An event or segment in this case can therefore be formulated as 푋푖 = {푥푡}푡=푎+1, with 푋푖 being the 𝑖 − 푡ℎ overall segment of the partition and (푏 − 푎) with 1 ≤ 푎 < 푏 ≤ 푇 denoting the length and J. Manuf. Mater. Process. 2020, 4, 88 7 of 20 location of the interval. For a less formal notation, it can also be denoted as 푋푎..푏. The points 푡, in which the state of the signal changes, are usually referred to as change points or break points and are truegiven break as 퐵 points,= {푏1 which, 푏2, … , was푏푛} with acquired 푏1 being by manually the chronologically labeling the timefirst break series, point is denoted in the as timeB with series. its ( ) ∗ The set of true break points,n which waso acquiredˆ by ˆmanuallyˆ ˆ labeling the time series, is denoted as b b b B = b b b corresponding∗ elements 1∗ , 2∗ , ... , n∗ , while∗ ∗ ∗ 1, 2, ... , n iŝ used̂ tô record the approximated 퐵 with its corresponding elements {푏1, 푏2, … , 푏푛}, while 퐵̂ = {푏1, 푏2, … , 푏푛} is used to record the breakapproximated points retuned break points by the retuned algorithm. by the algorithm. To solvesolve thisthis problem,problem, we we will will apply apply the the heuristic heuristic changepoint changepoint detection detection method method implemented implemented in ain package a package called called ruptures ruptures [5 [5],], which which can can be be described described as as a a combination combination ofof threethree definingdefining elements: cost function, search method,method, andand (penalty)(penalty) constraintconstraint (see(see FigureFigure5 5).).

Figure 5. Overview of heuristic segmentation approach by ruptures based on [5].

Formally the change point detection (segmentation) can be viewed as model selection problem, which tries to choose the best possible partition of the time series given a quantitative criterion. This criterion is most often expressed through the form of cost functions and is therefore denoted as C(B), which needs to be minimized during this process of optimization [5]. In fact, the criterion function consists of a sum of costs of all segments that define the segmentation:

Xn   ( ) = C B : c xbi ; xbi+1 (1) i=1

For this formulation we also need to define the dummy indices b0 as the first index and bn+1 as the end of the time series. Furthermore, we need to introduce a penalty function for our overall minimization problem, since the algorithm would otherwise always converge to an oversegmentation, where all points are ultimately represented in their own individual segment and therefore minimizing the overall sum of costs [5]. The final optimization problem for the segmentation is therefore given by:

min C(B) + pen(B) (2) B

The penalty function (constraint) can also be interpreted as complexity measure for the segmentation and needs to be adjusted for each heuristic, respectively. In this case we choose a linear penalty function given by pen (B) := β B with β > 0. Choosing a linear penalty function thereby l0 | | aligns with successful implementations from recent literature [5,21–23] and with well-known model selection criterions such as the Akaike’s information criterion and the Bayesian information criterion. After our problem has been defined, we need to choose a search method to calculate the break points of the time series. There are a variety of different solving algorithms available from which we will explore three approaches (also featured in ruptures) in more depth in the following section.

3.2.1. Sliding Window Segmentation The core of the sliding window algorithm is to approach the partitioning (segmentation) problem by moving a fix sized window over the time series. The window basically represents the area of vision or rather area of focus of the algorithm for a given iteration [6]. Thereby, the approach implemented by ruptures is not directly dividing the times series in each iteration solely based on the information given in the window, but rather by calculating a so-called discrepancy measure for J. Manuf. Mater. Process. 2020, 4, 88 8 of 20 each point in the signal. To do so, the window is split in two parts (“left” and “right”) along its center point [5]. The discrepancy measure is defined as following:

d(xa..t, x ) = c(x ) c(xa..t) c(x )(1 a < t < b T) (3) t..b a..b − − t..b ≤ ≤ Given a cost function c( ) and interval [a, b], the discrepancy measure is basically computing · the difference in cost between treating the sub-signal xa..b as one segment in comparison to splitting the interval up into two separate segments (xa..t and xt..b) at point t. Consequently, the discrepancy function is reaching high values when the two sub-windows contain dissimilar segments according to the cost function. After the values of the discrepancy function have been computed for all points of the time series, the algorithm analyses the discrepancy curve to identify the maximum values (peak search). The peaks of the function correspond to the incidents, in which the window was holding two highly dissimilar segments in each sub-window and therefore gives us the break points of the

J.signal Manuf. [ 5Mater.]. Figure Process.6 provides 2020, 4, x FOR a schematic PEER REVIEW overview of the algorithm to provide more clarification of8 of its 19 functionality and objective.

Figure 6. Sliding window approach: discrepancy curve based on [5]. Figure 6. Sliding window approach: discrepancy curve based on [5]. 3.2.2. Top-Down Segmentation 3.2.2.The Top top-down-Down Segmentation method is a greedy-like approach to partition the time series iteratively. To do so, it considersThe top the-down whole method time seriesis a greedy and calculates-like approach the total to partition cost for each the time possible series binary iteratively. segmentation To do so, of itthe considers signal at the first. whole After time the most series cost-e andffi calculatescient partitioning the total hascost been for each found, possible the time binary series segmentation is split at the ofidentified the signal change at first. point After and the the most procedure cost-efficient repeats forpartitioning each sub-signal, has been respectively found, the [5 ,time6]. The series calculation is split aton the which identified the partitioning change point decision and the of eachprocedure stage ofrepeats the iterative for each algorithm sub-signal is, basedrespectively on can [5,6 be] easily. The calculationdisplayed with on which the following the partitioning formula: decision of each stage of the iterative algorithm is based on can be easily displayed with the following formula: ˆ (1) b̂ (1):= argminc(xo..t) + c(xt T) (4) 푏 ≔ argmin 푐(푥표..푡) + 푐(푥..푡..푇) 1 t T 1 (4) 1≤푡≤푇−1 It represents the decision mechanism for the first first iteration of the algorithm. However, However, it it can be applied to each sub sub-signal-signal after the initial time series series has has been been divided divided initially. initially. The The stopping stopping criterion criterion of the approach depends entirely on the previously defined defined penalty function of the overall problem (2). Thereby, Thereby, the cost cost-savior-savior of each partitioning is compared to the penalty of introducing another break point point to to the the existing existing set of set change of change points. points. The algorithm The algorithm terminates terminates when there when is no therepartitioning is no withinpartitioning the iteration within the that iteration has a higher that has cost-savior a higher comparedcost-saviorto compared the penalty to the penalty [5]. Thevalue schematic [5]. The schematicdepiction indepiction Figure7 inprovides Figure an7 provides overview an over overview the whole over process. the whole process.

Figure 7. Top-down algorithm schema based on [based on 5].

3.2.3. Bottom-Up Segmentation The bottom-up segmentation can be best described as the counterpart to the previously discussed top-down approach. Herby, the start of the algorithm is characterized by the creation of a fine, evenly distributed partitioning (all segments have the same length) of the entire time series. Subsequently, the algorithm merges the most cost-efficient pair of adjacent segments in a greedy-like fashion [5,6]. To clarify the overall process, a schematic overview of the process is depicted in Figure 8.

J. Manuf. Mater. Process. 2020, 4, x FOR PEER REVIEW 8 of 19

Figure 6. Sliding window approach: discrepancy curve based on [5].

3.2.2. Top-Down Segmentation The top-down method is a greedy-like approach to partition the time series iteratively. To do so, it considers the whole time series and calculates the total cost for each possible binary segmentation of the signal at first. After the most cost-efficient partitioning has been found, the time series is split at the identified change point and the procedure repeats for each sub-signal, respectively [5,6]. The calculation on which the partitioning decision of each stage of the iterative algorithm is based on can be easily displayed with the following formula: ̂ (1) 푏 ≔ argmin 푐(푥표..푡) + 푐(푥푡..푇) (4) 1≤푡≤푇−1 It represents the decision mechanism for the first iteration of the algorithm. However, it can be applied to each sub-signal after the initial time series has been divided initially. The stopping criterion of the approach depends entirely on the previously defined penalty function of the overall problem (2). Thereby, the cost-savior of each partitioning is compared to the penalty of introducing another break point to the existing set of change points. The algorithm terminates when there is no partitioning within the iteration that has a higher cost-savior compared to the penalty value [5]. The schematicJ. Manuf. Mater. depiction Process. 2020 in Figure, 4, 88 7 provides an overview over the whole process. 9 of 20

FigureFigure 7. Top 7.-downTop-down algorithm algorithm schema schema based based on [based on [5 ].on 5].

3.2.3. Bottom Bottom-Up-Up Segmentation Segmentation The bottombottom-up-up segmentation can can be best described as the counterpart to the previously discussed top top-down-down approach. approach. Herby, Herby, the the start start of ofthe the algorithm algorithm is characterized is characterized by the by creation the creation of a fine,of a fine,evenly evenly distributed distributed partitioning partitioning (all (allsegments segments have have the the same same length) length) of of the the entire entire time time series. series. Subsequently, the algorithm merges the most cost-ecost-efficientfficient pair of adjacent segments in a greedy greedy-like-like J. Manuf. Mater. Process. 2020, 4, x FOR PEER REVIEW 9 of 19 fashion [[5,65,6].]. To clarifyclarify thethe overalloverall process,process,a aschematic schematic overview overview of of the the process process is is depicted depicted in in Figure Figure8. 8.

FigureFigure 8. Bottom 8. Bottom-up-up algorithm algorithm schema schema based based on [based on [5 ].on 5]. In contrast to other bottom-up methods covered in the current literature on segmentation [6], In contrast to other bottom-up methods covered in the current literature on segmentation [6], the approach we used in this case also harnesses the discrepancy measure defined above (see sliding the approach we used in this case also harnesses the discrepancy measure defined above (see sliding window) in order to determine the most cost-efficient pairs to merge. During each iteration, window) in order to determine the most cost-efficient pairs to merge. During each iteration, all all temporary change points are ranked according to the discrepancy measure, which takes into temporary change points are ranked according to the discrepancy measure, which takes into account account all of the points in both adjacent segments. Consequently, the break point with the lowest all of the points in both adjacent segments. Consequently, the break point with the lowest discrepancy discrepancy measure is deleted from the set of potential break points. This process continuous until measure is deleted from the set of potential break points. This process continuous until the increase the increase in costs, due to the removal of a break point, cannot be compensated by the reduction of in costs, due to the removal of a break point, cannot be compensated by the reduction of the penalty the penalty and the algorithm terminates [5]. and the algorithm terminates [5]. After completing the heuristic segmentation, standard clustering approaches such as k-means or After completing the heuristic segmentation, standard clustering approaches such as k-means Gaussian mixture models can be applied (in combination with DTW) to sort the time series snippets or Gaussian mixture models can be applied (in combination with DTW) to sort the time series into equal groups [8]. These approaches work similarly to parts of the TICC algorithm and are therefore snippets into equal groups [8]. These approaches work similarly to parts of the TICC algorithm and relevant to the depictions in the following chapter as indicated by the ‘*’ in Figure4. are therefore relevant to the depictions in the following chapter as indicated by the ‘*’ in Figure 4. 3.3. Toeplitz inverse-Covariance Clustering (TICC) 3.3. Toeplitz inverse-Covariance Clustering (TICC) In contrary to the heuristic approaches that start by partitioning the signal first and then conduct In contrary to the heuristic approaches that start by partitioning the signal first and then conduct the clustering, the alternative method we explored to tackle the problem includes the algorithm TICC, the clustering, the alternative method we explored to tackle the problem includes the algorithm TICC, which approaches the problem by simultaneously segmenting and clustering the multivariate time series which approaches the problem by simultaneously segmenting and clustering the multivariate time through the use of Markov random fields (MRF). The reason for using Markov networks is that they series through the use of Markov random fields (MRF). The reason for using Markov networks is that denote relationships between variables generally much stronger than more simple correlation-based they denote relationships between variables generally much stronger than more simple correlation- models. Thereby, each cluster is defined as a dependency network, which is taking into account both based models. Thereby, each cluster is defined as a dependency network, whichm is taking into account the interdependencies of all dimensions (features) of an observation xt R and the dependency푚 to both the interdependencies of all dimensions (features) of an observation∈ 푥푡 ∈ ℝ and the near neighbor point ... , xt 2, xt 1, xt+1, xt+2, ... . Each cluster network can be described as a multilayer dependency to near neighbor− − point {… , 푥 , 푥 , 푥 , 푥 , … } . Each cluster network can be undirected graph, were the vertices are representing푡−2 푡−1 the푡+1 random푡+2 variables and the edges indicate described as a multilayer undirected graph, were the vertices are representing the random variables and the edges indicate a certain dependency among them [3]. The layers of each network correspond to the number of consecutive observations (instances) and therefore to the receptive field of the model (see Figure 9).

Figure 9. Exemplary MRF structures for two distinct clusters based on [based on 3].

Another important differentiating quality of MRFs is that, under certain criteria, they can be used to mimic a multivariate Gaussian model with respect to the graphical representation of the multivariate distribution. Since this applies in our case, we can use the precision matrix (inverse covariance matrix) as a formal representation of the graphical structure, where a zero (conditional independence) in the matrix corresponds to a missing edge between two variables. The benefits of

J. Manuf. Mater. Process. 2020, 4, x FOR PEER REVIEW 9 of 19

Figure 8. Bottom-up algorithm schema based on [based on 5].

In contrast to other bottom-up methods covered in the current literature on segmentation [6], the approach we used in this case also harnesses the discrepancy measure defined above (see sliding window) in order to determine the most cost-efficient pairs to merge. During each iteration, all temporary change points are ranked according to the discrepancy measure, which takes into account all of the points in both adjacent segments. Consequently, the break point with the lowest discrepancy measure is deleted from the set of potential break points. This process continuous until the increase in costs, due to the removal of a break point, cannot be compensated by the reduction of the penalty and the algorithm terminates [5]. After completing the heuristic segmentation, standard clustering approaches such as k-means or Gaussian mixture models can be applied (in combination with DTW) to sort the time series snippets into equal groups [8]. These approaches work similarly to parts of the TICC algorithm and are therefore relevant to the depictions in the following chapter as indicated by the ‘*’ in Figure 4.

3.3. Toeplitz inverse-Covariance Clustering (TICC) In contrary to the heuristic approaches that start by partitioning the signal first and then conduct the clustering, the alternative method we explored to tackle the problem includes the algorithm TICC, which approaches the problem by simultaneously segmenting and clustering the multivariate time series through the use of Markov random fields (MRF). The reason for using Markov networks is that they denote relationships between variables generally much stronger than more simple correlation- based models. Thereby, each cluster is defined as a dependency network, which is taking into account 푚 both the interdependencies of all dimensions (features) of an observation 푥푡 ∈ ℝ and the J.dependency Manuf. Mater. Process. to near2020 neighbor, 4, 88 point {… , 푥푡−2, 푥푡−1, 푥푡+1, 푥푡+2, … } . Each cluster network can10 of be 20 described as a multilayer undirected graph, were the vertices are representing the random variables and the edges indicate a certain dependency among them [3]. The layers of each network correspond ato certainthe number dependency of consecutive among observations them [3]. The (instances) layers ofand each therefore network to the correspond receptive tofie theld of number the model of consecutive(see Figure 9). observations (instances) and therefore to the receptive field of the model (see Figure9).

FigureFigure 9. Exemplary 9. Exemplary MRF MRFstructures structures for two for distinct two distinct clusters clusters based based on [based on [3]. on 3].

Another importantimportant didifferentiatingfferentiating quality quality of of MRFs MRFs is that,is that, under under certain certain criteria, criteri theya, they can becan used be toused mimic to mimic a multivariate a multivariate Gaussian Gaussian model with model respect with to respect the graphical to the representation graphical representation of the multivariate of the distribution.multivariate Since distribution. this applies Since in thisour case, applies we in can our use case, the precision we can use matrix the (inverse precision covariance matrix (inverse matrix) ascovariance a formal matrix) representation as a formal of the representation graphical structure, of the wheregraphical a zero structure, (conditional where independence) a zero (conditional in the matrixindependence) corresponds in the to matrix a missing corresponds edge between to a missing two variables. edge between The benefits two variables. of using The the benefits precision of matrix Θ over the covariance matrix are manifold and include computational advantages as the result of a tendency of being sparse. Furthermore, the sparsity of the MRFs also prevents overfitting and therefore increase the robustness of the overall approach [3,24].

3.3.1. Problem Formulation In addition to the preciously established notation, we need to introduce a few modifications for the formulation of the TICC problem. First, we are interested in a short subsequence (with w T),  rather than just one point in time xt. Therefore, to assign the data points to the corresponding clusters,  we need to address those points as nw dimensional vectors Xt = xt w+1, ... , xt . The number of − − clusters, which has to be defined beforehand, will be denoted as K N and the time horizon of the ∈ short subsequence will be set by the window size w. A challenge of the TICC approach is to address the assignment problem of matching the data points to one of the K clusters and thus determining the assignment sets P = P , ... , PK with { 1 } P 1, 2, ... , T . Furthermore, the algorithm must update the cluster parameters Θ = Θ , ... , ΘK i ⊂ { } { 1 } based on the previously calculated assignment mappings [3]. The overall optimization problem can be formulated as: K   X X   + ( (X ) + X P ) argmin  λ Θi 1 `` t, Θi β 1 t 1 < i  (5) P k ◦ k − { − }  Θ , i=1 Xt P ∈T ∈ i This entire problem is called Toeplitz inverse covariance-based clustering (TICC). in the formula T reflects the set of symmetric block Toeplitz nw nw matrices, which adds an additional constraint × for the construction of the MRFs. The enforced structure ensures the time-invariant of each cluster so that the cluster assignments do not depend on gate-events. This implies that any edge between two layers l and l + 1 must also exist for the layers l + 1 and l + 2. The first expression λ Θ k ◦ ik1 represents an additional sparsity constraint based on the Hadamard product of the inverse covariance nw nw matrix with the regularization parameter λ R × . The second part of the formula ``(Xt, Θi) states ∈ the core optimization problem of cluster parameter fitting, given assignment set Pi (log likelihood). The last part of the overall problem addresses the desired property of temporal consistency. By adding a penalty β for each time that two consecutive observations are assigned to two different cluster Xt 1 < Pi, the overall algorithm is incentivized to hold those instances to a minimum [3]. − The method to handle this complex problem can be described as a variation of the expectation maximization (EM) algorithm, which is commonly used to solve related clustering issues by applying Gaussian Mixture Models [25]. J. Manuf. Mater. Process. 2020, 4, 88 11 of 20

3.3.2. Cluster assignment (Expectation) Similar to the expectation step in the common EM algorithm, TICC starts with the assignment of observations to the given clusters. However, to begin this process we need a prior distribution to which the data points can be assigned to. Hence, the overall TICC algorithm is initialized by conducting a k-means clustering to calculate the initial cluster parameters Θ. After the initialization phase, the points are assigned to the most suitable cluster by fixing the values of the precision matrix Θ [3]. This leaves us with the following combinatorial optimization problem for P:

XK X argmin ``(Xt, Θi) + β 1 Xt 1 < Pi (6) P − { − } i=1 Xt P ∈ i The sparsity constraint is not directly relevant for this step since the shape of the precision matrix Θ is automatically fixed with its values. Thus, the resulting constant term of the Hadamard product can be neglected. Each of the short subsequences Xi is therefore primarily assigned to a cluster K basedJ. Manuf. on Mater. its maximumProcess. 2020, 4 likelihood, x FOR PEER under REVIEW the regard of temporal consistency (minimize number11 of of19 “break points”). The problem (6) is subsequently solved by harnessing the dynamic programming approach of the the Viterbi Viterbi algorithm algorithm [[2266].]. The problem is translated translated into into a a graphical graphical representation representation (Figure(Figure 1010),), whichwhich representsrepresents eacheach decisiondecision (per(per instance)instance) thethe algorithmalgorithm can make to assign the data pointspoints consecutivelyconsecutively [[3]3]..

Figure 10. Graphical formulation of the clustering assignment problem based on [[3]3].. Subsequently, the Viterbi algorithm is returning the minimum cost path (Viterbi path) by Subsequently, the Viterbi algorithm is returning the minimum cost path (Viterbi path) by calculating the total cost values of each vertex recursively while remembering the minimum cost path calculating the total cost values of each vertex recursively while remembering the minimum cost path simultaneously [26]. simultaneously [26]. 3.3.3. Updating Cluster Parameters (Maximization) 3.3.3. Updating Cluster Parameters (Maximization) The maximization step of the EM algorithm concerns the updating of the inverse covariance The maximization step of the EM algorithm concerns the updating of the inverse covariance matrices of the clusters once the assignments of the E-step have been given. For this step, the assignment matrices of the clusters once the assignments of the E-step have been given. For this step, the sets P are frozen and the precision matrices Θi are adjusted based on the corresponding observations. assignment sets 푃 are frozen and the precision matrices Θ are adjusted based on the corresponding Similar to the assignment problem before, the overall TICC푖 problem (5) can be simplified for this step, observations. Similar to the assignment problem before, the overall TICC problem (5) can be since the temporal consistency is strongly tied to the assignment of the datapoints. As seen before, simplified for this step, since the temporal consistency is strongly tied to the assignment of the the whole term can therefore be dropped while solving the newly arising subproblem (7), which is datapoints. As seen before, the whole term can therefore be dropped while solving the newly arising expressed in the following formula [3]: subproblem (7), which is expressed in the following formula [3]: K   X퐾  X  argmin  λ Θ + ``(X , Θ ) (7) argmin ∑[‖휆 ∘ Θi ‖1 + ∑ −ℓℓ(푋t , Θi )] Θ k ◦ 푖k 1 − 푡 푖  (7) Θ∈T∈풯 i=1 Xt Pi 푖=1 푋푡∈∈푃푖

Another relevant property of this optimization function is that the updating problem for each cluster can be calculated independently since there is no reliance on previous or cross-connected results of other clusters as in problem (6). Therefore, all updates can be done in parallel rather than sequentially [3]. In order to bring the overall problem into an easier to handle form, we need to rearrange the previously established formula. The negative log likelihood can be expressed as following:

( ) | |( ( )) ∑ −ℓℓ 푋푡, Θ푖 = − 푃푖 log det Θ푖 + 푡푟 푆푖Θ푖 + 퐶 (8) 푋푡∈푃푖

Thereby, |푃푖| stands for the number of observations assigned to each cluster and det for the determinant of the matrix Θ푖. The term 푡푟(∙) is the abbreviation for the trace of a quadratic matrix. The included value 푆푖 stands for the empirical covariance of the random sample, which is often used to approximate the true covariance matrix Σ of a stochastic problem [27]. The remaining term 퐶 represents all other constant (variable independent) terms of the problem, which can also be neglected according to the argumentation above. We can now substitute the log likelihood expression of the initial updating problem (7) with the new representation in (8) and ultimately (after a few adjustments) reach the following identical optimization problem for each cluster (9) [3]:

J. Manuf. Mater. Process. 2020, 4, 88 12 of 20

Another relevant property of this optimization function is that the updating problem for each cluster can be calculated independently since there is no reliance on previous or cross-connected results of other clusters as in problem (6). Therefore, all updates can be done in parallel rather than sequentially [3]. In order to bring the overall problem into an easier to handle form, we need to rearrange the previously established formula. The negative log likelihood can be expressed as following: X ``(Xt, Θ ) = P (log det Θ + tr(S Θ )) + C (8) − i −| i| i i i Xt P ∈ i Thereby, P stands for the number of observations assigned to each cluster and det for the | i| determinant of the matrix Θ . The term tr( ) is the abbreviation for the trace of a quadratic matrix. i · The included value Si stands for the empirical covariance of the random sample, which is often used to approximate the true covariance matrix Σ of a stochastic problem [27]. The remaining term C represents all other constant (variable independent) terms of the problem, which can also be neglected according to the argumentation above. We can now substitute the log likelihood expression of the initial updating problem (7) with the new representation in (8) and ultimately (after a few adjustments) reach the following identical optimization problem for each cluster (9) [3]:

1 minimize log det Θi + tr(SiΘi) + P λ Θi 1 − | i| k ◦ k (9) subject to Θ i ∈ T Note, that the coefficient of the sparsity constraint is permanently constant and therefore can be incorporated into the regularization parameter λ by scaling the values accordingly. For further simplifications we will hence remove the coefficient and furthermore also neglect the indices of the variables and parameters to highlight the independence of each subproblem from the others [3]. Many problems can be efficiently solved (time) sequentially by using the alternating direction method of multipliers (ADMM). This algorithm, aimed at solving convex problems, has constantly shown promising results and performed especially well on large-scale tasks [28]. However, in order to make this approach applicable, we need to reformulate our problem once more to align it with the ADMM requirements. Therefore, a consensus variable Z is introduced to allow the optimization function to be split into two (variable-) independent target functions [3].

minimize log det Θ + tr(SΘ) + λ Z 1 − k ◦ k (10) subject to Θ = Z, Z ∈ T

Similar to the EM algorithm, ADMM is approximating the optimal result of the primal optimization problem in an alternating manner, by updating one variable while the others are fixed. For a sufficiently large number of iterations, the solution of the algorithm converges to the optimal solution of the problem (9). In practice, the method is stopped when the residual variables reach values close to zero (equality constraint is almost met), indicating a nearly feasible solution [3,28]. Those two alternating steps of datapoint (to cluster) assignment and updating the cluster parameters repeat until the cluster assignments become repetitive (stationary) and the overall TICC algorithm terminates [3]. Finally, the high-level outline for the TICC method is provided in the following Algorithm 1. J. Manuf. Mater. Process. 2020, 4, 88 13 of 20

Algorithm 1: TICC (high-level) initialization Cluster parameters Θ, cluster assignments P repeat E-step: Assign points to clusters P → M-step: Update cluster parameters Θ → until Stationarity return Θ,P

4. Results In this chapter, the results of the two alternative analytical pathways are presented. For the comparison of the two pathways, an evaluation systematic is introduced that allows for a problem centricJ. Manuf. comparison Mater. Process. of 20 the20, 4 results., x FOR PEER REVIEW 13 of 19 The performance evaluation of individual approaches and their corresponding configuration, that are required input to use the evaluation metrics were determined manually leveraging domain consisting of search method and cost function, is implemented by applying a variety of established and experts’ input. commonly accepted evaluation metrics from different fields of machine learning, which are adjusted The first aspect of the experimental results deals with the problem of choosing a proper cost for the segmentation problem. The first metric is the Hausdorff measure, which is measuring the function in order to identify homogeneous segments. Therefore, a variety of different cost functions greatest temporal distance between a true change point and a predicted change point. It is therefore have been applied to the dataset by using the Sliding window approach discussed earlier. The results measuring the accuracy of the approximated break points [5,29]. The second measure is the Rand index are displayed in Figure 11. and basically measures the number of agreements between two segmentations. It is commonly used Since both the Rand index and the F1-score are measures in the range [0,1], the Hausdorff metric for the evaluation of clustering performances and is also related to the (mathematical) accuracy [5]. is scaled accordingly. This is done by measuring the pairwise percentage difference of a given cost The final evaluation metric, which is commonly used in the field of classification, is the F1-Score. It is function in comparison to the worst result of the whole set of cost functions. Therefore, the approach a combination of the precision (reliability) and recall (completeness) of a classification. For the F1-Score with the worst Hausdorff metric always scores a zero, while the other scores display the percentage calculation no element is prioritized over another and therefore the total value is given as the harmonic superiority of each approach in comparison to that. For the analytics problem used in this paper, we mean of both measure components [5]. The true change points that are required input to use the compared four different cost functions in total: evaluation metrics were determined manually leveraging domain experts’ input.  Theleast first absolute aspect deviation of the experimental (L1), results deals with the problem of choosing a proper cost function least in squared order to identifydeviation homogeneous (L2), segments. Therefore, a variety of different cost functions have beenlinear applied model to change the dataset based by on using a piecewise the Sliding linear window regression approach (Linear), discussed earlier. The results are displayedfunction in to Figure detect 11 shifts. in the means and scale of a Gaussian time series (Gaussian).

Figure 11. Figure 11.Comparison Comparison of of exemplary exemplary cost cost functions. functions.

The findings displayed in Figure 11 were found to show mixed results. While the L2 function scores superior in the field of temporal distance and recall, it is inferior in regard of the precision and ultimately in the total F1-Score. The L1 cost function on the other hand scores poorly in terms of the Hausdorff metric (L2 is relatively more than 60% better) but dominates the other approach in terms of precision and consequently scores best in term of the F1-Score. The two remaining cost functions remain unremarkable, since they consistently score significantly worse than the leasing approach for each metric category, except for the recall since all of the approaches are close there. In the end, the classical L2 cost function was chosen as benchmark for the comparison of the pathway methods based on its superiority in two out of three measures (Hausdorff and Rand index). This was also used in the following analysis and discussion of the results. The second area of investigation is the comparison of the different searching methods with each other. The TICC algorithm was also included into this comparison by dropping the cluster labels from the results and therefore turning the multi-cluster output into a binary clustering (“steady mode” and “event”). The results are presented in Figure 12.

J. Manuf. Mater. Process. 2020, 4, 88 14 of 20

Since both the Rand index and the F1-score are measures in the range [0, 1], the Hausdorff metric is scaled accordingly. This is done by measuring the pairwise percentage difference of a given cost function in comparison to the worst result of the whole set of cost functions. Therefore, the approach with the worst Hausdorff metric always scores a zero, while the other scores display the percentage superiority of each approach in comparison to that. For the analytics problem used in this paper, we compared four different cost functions in total: least absolute deviation (L1), • least squared deviation (L2), • linear model change based on a piecewise linear regression (Linear), • function to detect shifts in the means and scale of a Gaussian time series (Gaussian). • The findings displayed in Figure 11 were found to show mixed results. While the L2 function scores superior in the field of temporal distance and recall, it is inferior in regard of the precision and ultimately in the total F1-Score. The L1 cost function on the other hand scores poorly in terms of the Hausdorff metric (L2 is relatively more than 60% better) but dominates the other approach in terms of precision and consequently scores best in term of the F1-Score. The two remaining cost functions remain unremarkable, since they consistently score significantly worse than the leasing approach for each metric category, except for the recall since all of the approaches are close there. In the end, the classical L2 cost function was chosen as benchmark for the comparison of the pathway methods based on its superiority in two out of three measures (Hausdorff and Rand index). This was also used in the following analysis and discussion of the results. The second area of investigation is the comparison of the different searching methods with each other. The TICC algorithm was also included into this comparison by dropping the cluster labels from the results and therefore turning the multi-cluster output into a binary clustering (“steady mode” and “event”).J. Manuf.The Mater. results Process. are 2020 presented, 4, x FOR PEER in Figure REVIEW 12 . 14 of 19

FigureFigure 12. 12.Comparison Comparison of segmentationof segmentation approaches. approaches.

The most striking result of the experiment is the superiority of the TICC approach in comparison The most striking result of the experiment is the superiority of the TICC approach in comparison to all heuristic methods. Even the best historical approach, namely sliding window, which has to all heuristic methods. Even the best historical approach, namely sliding window, which has constantly outperformed the other two approaches except for the recall, is surpassed by the TICC constantly outperformed the other two approaches except for the recall, is surpassed by the TICC algorithm remarkably. algorithm remarkably. The pattern recognition approach provided by TICC identified a total of 32 events (disturbances The pattern recognition approach provided by TICC identified a total of 32 events (disturbances of steady state), which in turn were broken down into four different clusters (patterns). Furthermore, of steady state), which in turn were broken down into four different clusters (patterns). Furthermore, it revealed the existence of not one but two steady states of the machine, which could not be identified by solely monitoring the temperature zones. The remaining three clusters were identified as actual disturbances of the drying process. An overview over all major events is given in the following Figure 13. The pattern in the top left of the diagram represents a section of the steady state (cluster A) of the machine and depicts the natural fluctuations, especially of the sixth temperature zone (material input).

Figure 13. Graphical depiction of most frequent clusters.

J. Manuf. Mater. Process. 2020, 4, 88 15 of 20 it revealed the existence of not one but two steady states of the machine, which could not be identified by solely monitoring the temperature zones. The remaining three clusters were identified as actual disturbances of the drying process. An overview over all major events is given in the following Figure 13. The pattern in the top left of the diagram represents a section of the steady state (cluster A) of the machine and depicts the natural fluctuations, especially of the sixth temperature zone (material input).

Figure 13. Graphical depiction of most frequent clusters.

The first disturbance of this natural state can be seen in the bottom left of the diagram and will be referred to as cluster C or “canyon”. The disturbance occurred in total 10 times during the whole monitoring period with a mean duration of about 3000 min, which corresponds to approximately 46 h from start to the finish. The shortest duration of a cluster C events was 1530 min (25.5 h) and the longest duration was 3210 min (53.5 h). The second recorded disturbance can be seen in the bottom right and is referred to as cluster D or “shark fin”. It appeared in total six times and was more fluctuant in terms of its durations, varying from approximately 1500 min (25 h) to 7300 min (121 h). The duration with the highest density (3 of 6) was thereby 1500 min (median). The last common disturbance can be seen in the top right of the diagram and is referred to as cluster B or “random spike”. This disturbance appears less systematic then the events presented before since its durations range from 50 min to 400 min, which is notably shorter than the other common disturbance durations identified. The shape of the events of cluster B, which appeared a total of nine times, also can be seen to be less homogeneous in comparison to cluster C and D. However, in addition to correct assignments the algorithm has made automatically, there have been some events which raised doubt in terms of their affiliation to the assigned failure clusters. Those irregularities are presented in the following Figure 14. J. Manuf. Mater. Process. 2020, 4, 88 16 of 20

Figure 14. Graphical depiction of rare events (steady state for scale).

The top left of the diagram shows the steady state as a comparison. The other three events show a characteristic pattern respectively but were still all assigned to the cluster C (“canyon”). However, in a later step of the manual analyzing process they have been found to be significantly different from the corresponding evens in the cluster C, therefore declared as related subclusters. Each of the subclusters occurred a total of two times within the entire dataset.

5. Discussion This paper presents a functional framework to automatically analyze multivariate times series data in order to identify reoccurring patterns within a manufacturing setting. In doing so, we have tested two different pathways to approach the pattern recognition problem in time series applications and thereby displayed their strengths and weaknesses. As the data sets and their requirements vary significantly even within the manufacturing domain, the option to quickly apply and evaluate different approaches is a key feature towards a robust framework. However, to truly reflect the full wealth of different data sets and time series, additional alternatives might have to be included in the future. Nevertheless, the process and evaluation enable the principle scalability of the presented framework. We have shown that a two-step approach of using a segmentation before clustering the data appears to be inferior to a collaborative method. The obtained results from our manufacturing time series data set have shown that even the first step of the sequential approach (heuristic segmentation) has already returned a worse performance for all possible searching methods when compared to the TICC approach in this case. This can be observed in the Figure 12 of the results, since the TICC algorithm scored significantly higher in almost all statistical evaluation metrics. The only indictor where TICC scored below the alternative BU and TD approaches was the recall metric. This can, however, be easily explained in this context due to the low precision scores of both heuristic methods, since they have a tendency to oversegment the data. A rather fine partitioning consequently increases the probability that the true change points will be among the predicted set, but consequently increase the number of false (unnecessary) break points, reducing the precision drastically. The better performance of the TICC approach could also not be explained through an unproper cost function, since we have vetted various homogeneity metrics in order to identify the most suiting failure function for the given dataset (see Figure 11). All in all, the finding indicates that the second processing pathway is the superior approach in this application case. J. Manuf. Mater. Process. 2020, 4, 88 17 of 20

We have also shown that the TICC featuring pathway is capable of returning strong results, since all of the three major events in the times series have been found and furthermore assigned to different clusters. Even though it was not able to recognize the events displayed in Figure 14, this can be explained though the low frequency of those disturbances and therefore a lack of significance in order to define a new cluster based on only a few available examples. A further dissection of the cluster C and D suggests that those clusters correspond to the real-world activity of shutting the machine down for the weekend. National holidays were also taken into consideration and made it possible to explain a significant number of events that took part over the cause of a day or two. Furthermore, the differences in the cluster C and D might be the result of the difference between a rapid machine shutdown, were the material is removed immediately and a cooling process, were the material remained in the machine. This could also indicate an unproper shutdown, were the material was supposed to be removed for the weekend break but forgotten inside the dryer. The events assigned to cluster B (random spikes) could, however, not be tied to related real-world events at this point and most likely correspond to random disturbances caused for example by an improper handling of the machine or related short-term issues. However, the insights from the analysis can now be used to investigate the root cause and then return with a correct label for these events in the future after further consultation with domain experts. Like most analytics projects, the continuous cycle is key for a long-term sustainable solution and providing real value to the application area. All in all, our results show that the multivariate clustering approach displayed in the second pathway is able to return strong results for the time series data set in this case. It is safe to say that the new framework can identify common events, which in some cases correspond with relevant sources of failure. Therefore, these insights can subsequently be used to implement a near real-time warning system (or as input for a more sophisticated predictive maintenance system) that is capable of not only identifying and correctly clustering disturbances, but also tying it to a real-world activity.

6. Conclusions and Future Work This paper presents a framework to utilize multivariate time series data to automatically identify reoccurring failure patterns in a real-world smart manufacturing system. To do so, the mixed binary (on/off) and continuous (temperature) data provided by the monitoring system of an industrial drying hopper was analyzed following two different processing pathways. The first approach was built-up using a subsequent segmentation and clustering approach, the second branch featured a collaborative method at its core (TICC). Both pathways have been facilitated by a standard PCA analysis (feature fusion) and a hyperparameter optimization (TPE) approach. The second procedure featuring TICC returned constantly superior results in comparison to all heuristic segmentation methods for the available time series data set. It was therefore expanded and finally enabled the recognition of three major (most frequent) failure patterns in the time series (“canyon”, “shark fin”, and “random spike”). Furthermore, it was also able to recognize all disruptions of the steady state but failed to identify all of the less frequent failure patterns as stand-alone clusters. Besides evaluating the statistical accuracy, we furthermore leveraged domain expertise to verify the results and were, e.g., able to identify one event (canyon) as a shutdown process that aligns with weekends and holidays. Nevertheless, the identified cluster can be used subsequently to enhance the monitoring process of the drying machine even further and to establish a predictive maintenance system in order to foresee potential upcoming machine downtimes. This framework can also be applied to other (related) problems in the field of preventive maintenance and pattern recognition in time series in general. Industrial monitoring and maintenance systems can therefore be extended by adding a failure detection component to enhance the machine restoration process (speed) further. Such algorithms can also be used to automatically analyze large amounts of data generated by the machinery pool, thereby reducing the manual workload necessary for this process. J. Manuf. Mater. Process. 2020, 4, 88 18 of 20

The presented framework can contribute to reducing unnecessary machine downtimes in the future, improve the troubleshooting process in case a failure occurs, and consequently increase the effective machine running time and overall productivity of the plant. An additional advantage from a managerial point of view is the possibility to automate the process and improved transparency of the production process. However, there also ethical implications that need to be considered aside of the overall impact of the presented research findings on the business process and workflow. On the one hand, the increase in efficiency is not only favorable for the business objectives overall due to the potential cost reduction, but also to the society in general given potential improvement of environmental (e.g., reduction in co2-emissions and overall waste of energy) and working atmosphere (e.g., reduction of unnecessary stress from unplanned machine failure) benefits. On the other hand, those techniques may also present an opportunity for being abused when other types of time-series data, such as personal data, of workers, are analyzed. For instance, biometric and behavioral data can be monitored and analyzed automatically, potentially leading to issues with regard to privacy and other negative implications for workers. Thus, it is necessary to investigate policy implications hindering inappropriate application of such time-series analytics conflicting with individuals’ privacy rights. The presented process and analytical model are influenced by design and thus to a certain extend limited. Since the whole performance measuring process (correct number of clusters) is dependent on the manual determination of the true change points and clusters (or subclusters) it reflects a certain subjectivity and dependence on domain knowledge and metadata. Furthermore, the overall clustering approach is initialized through a distance-based algorithm (k-means), which makes it biased in terms of its priors and the subsequent estimation of the Gaussian mixture distributions. The effectiveness of the proposed framework can also not be guaranteed for the entirety of industrial domains since it was only tested on one specific dataset from a particular field. In order to address these challenges, future research should focus on follow up studies to improve the framework in respect to the algorithmic initialization and study the applicability of the underlying on similar use-cases in the industrial domain. Additionally, comparing the methodology to other approaches in further studies using a comparable data set will increase the transparency and understanding of its value.

Author Contributions: Conceptualization, V.K., T.W.; Data curation, V.K.; formal analysis, V.K.; funding acquisition, G.L., T.W.; methodology, M.C.M., T.W.; project administration, T.W.; supervision, M.C.M., G.L., T.W.; writing—original draft, V.K.; writing—review and editing, M.C.M., G.L., T.W. All authors have read and agreed to the published version of the manuscript. Funding: This work was supported by the J. Wayne & Kathy Richards Faculty Fellowship in Engineering at West Virginia University and the DIGIMAN4.0 project (“DIGItal MANufacturing Technologies for Zero-defect Industry 4.0 Production”, http://www.digiman4-0.mek.dtu.dk/). DIGIMAN4.0 is a European Training Network supported by Horizon 2020, the EU Framework Programme for Research and Innovation (Project ID: 814225). Acknowledgments: The authors express their gratitude to the anonymous reviewers for their valuable comments that helped us to improve the paper significantly. Conflicts of Interest: The authors declare no conflict of interest.

References

1. Sinha, A.; Bernardes, E.; Calderon, R.; Wuest, T. Digital Supply Networks: Transform Your Supply Chain and Gain Competitive Advantage with Disruptive Technology and Reimagined Processes; McGraw-Hill Education: New York, NY, USA, 2020. 2. Özkoç, E.E. Clustering of Time-Series Data. In Data Mining—Methods, Applications and Systems; IntechOpen: London, UK, 2020. [CrossRef] 3. Hallac, D.; Vare, S.; Boyd, S.; Leskovec, J. Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series Data. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’17), Association for Computing Machinery, New York, NY, USA, 23–27 August 2017; pp. 215–223. [CrossRef] J. Manuf. Mater. Process. 2020, 4, 88 19 of 20

4. Fontes, C.H.; Pereira, O. Pattern recognition in multivariate time series—A case study applied to fault detection in a gas turbine. Eng. Appl. Artif. Intell. 2016, 49, 10–18. [CrossRef] 5. Truong, C.; Laurent, O.; Vayatis,N. Selective Review of Offline Change Point Detection Methods. Signal Process. 2020, 167, 107299. [CrossRef] 6. Keogh, E.; Chu, S.; Hart, D.; Pazzani, M. An online algorithm for segmenting time series. In Proceedings of the 2001 IEEE International Conference on Data Mining, San Jose, CA, USA, 29 November–2 December 2001; pp. 289–296. [CrossRef] 7. Aghabozorgi, S.; Shirkhorshidi, A.S.; Wah, T.Y. Time-series clustering—A decade review. Inf. Syst. 2015, 53, 16–38. [CrossRef] 8. Izakian, H.; Pedrycz, W.; Jamal, I. Fuzzy clustering of time series data using dynamic time warping distance. Eng. Appl. Artif. Intell. 2015, 39, 235–244. [CrossRef] 9. Glaser, B.; Strauss, A. The Discovery of Grounded Theory: Strategies for Qualitative Research; Aldine: Chicago, CA, USA, 1967. 10. Wolfswinkel, J.; Furtmueller, E.; Wilderom, C. Using Grounded Theory as a Method for Rigorously Reviewing Literature. Eur. J. Inf. Syst. 2013, 22, 45–55. [CrossRef] 11. Mujumdar, A.; Hasan, M. 41 Drying of Polymers; Taylor & Francis Group, LLC: Abingdon, UK, 2016. [CrossRef] 12. Munjal, S.; Kao, C. Mathematical model and experimental investigation of polycarbonate pellet drying. Polym. Eng. Sci. 1990, 30, 1352–1360. [CrossRef] 13. Rosato, D.V. Processing plastic material. In Extruding Plastics; Springer: Boston, MA, USA, 1998. [CrossRef] 14. National Research Council (USA). Committee on Polymer Science and Engineering, Polymer Science and Engineering: The Shifting Research Frontiers; National Academies Press: Washington, DC, USA, 1994. [CrossRef] 15. Keogh, E.J.; Kasetty, S. On the need for time series data mining benchmarks: A survey and empirical demonstration. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (ACMSIGKDD), Edmonton, AB, Canada, 23–26 July 2002; pp. 23–26. 16. Liao, T.W. Clustering of time series data—A survey. Pattern Recognit. 2005, 38, 1857–1874. [CrossRef] 17. Fu, T. A review on time series data mining. Eng. Appl. Artif. Intell. 2011, 24, 164–181. [CrossRef] 18. Spiegel, S.; Gaebler, J.; Lommatzsch, A.; De Luca, E.; Albayrak, S. Pattern recognition and classification for multivariate time series. In Proceedings of the Fifth International Workshop on Knowledge Discovery from Sensor Data (SensorKDD’11), Association for Computing Machinery, San Diego, CA, USA, 21 August 2011; pp. 34–42. [CrossRef] 19. Garg, R.; Aggarwal, H.; Centobelli, P.; Cerchione, R. Extracting Knowledge from Big Data for Sustainability: A Comparison of Machine Learning Techniques. Sustainability 2019, 11, 6669. [CrossRef] 20. Sfetsos, A.; Siriopoulos, C. Time series forecasting with a hybrid clustering scheme and pattern recognition. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 2004, 34, 399–405. [CrossRef] 21. Harchaoui, Z.; Lévy-Leduc, C. Multiple Change-Point Estimation with a Total Variation Penalty. J. Am. Stat. Assoc. 2010, 105, 1480–1493. [CrossRef] 22. Haynes, K.; Fearnhead, P.; Eckley, I.A. A computationally efficient nonparametric approach for changepoint detection. Stat. Comput. 2017, 27, 1293–1305. [CrossRef][PubMed] 23. Killick, R.; Fearnhead, P.; Eckley, I.A. Optimal Detection of Changepoints with a Linear Computational Cost. J. Am. Stat. Assoc. 2012, 107, 1590–1598. [CrossRef] 24. Idé, T.; Khandelwal, A.; Kalagnanam, J. Sparse Gaussian Markov Random Field Mixtures for Anomaly Detection. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain, 12–15 December 2016; pp. 955–960. [CrossRef] 25. Singh, R.; Pal, B.C.; Jabr, R.A. Statistical Representation of Distribution System Loads Using Gaussian Mixture Model. IEEE Trans. Power Syst. 2010, 25, 29–37. [CrossRef] 26. Forney, G.D. The viterbi algorithm. Proc. IEEE 1973, 61, 268–278. [CrossRef] 27. Vershynin, R. How Close is the Sample Covariance Matrix to the Actual Covariance Matrix? J. Theor. Probab. 2012, 25, 655–686. [CrossRef] J. Manuf. Mater. Process. 2020, 4, 88 20 of 20

28. Chang, T.; Hong, M.; Liao, W.; Wang, X. Asynchronous Distributed ADMM for Large-Scale Optimization—Part I: Algorithm and Convergence Analysis. IEEE Trans. Signal Process. 2016, 64, 3118–3130. [CrossRef] 29. Karimi, D.; Salcudean, S.E. Reducing the Hausdorff Distance in Medical Image Segmentation with Convolutional Neural Networks. IEEE Trans. Med. Imaging 2020, 39, 499–513. [CrossRef][PubMed]

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).