Anomaly Detection in Manufacturing Equipment with Apache Flink : Grand Challenge Yann Busnel, Nicolo Riveei, Avigdor Gal
Total Page:16
File Type:pdf, Size:1020Kb
FlinkMan : Anomaly Detection in Manufacturing Equipment with Apache Flink : Grand Challenge Yann Busnel, Nicolo Riveei, Avigdor Gal To cite this version: Yann Busnel, Nicolo Riveei, Avigdor Gal. FlinkMan : Anomaly Detection in Manufactur- ing Equipment with Apache Flink : Grand Challenge. DEBS ’17 : 11th ACM International Conference on Distributed and Event-based Systems, Jun 2017, Barcelone, Spain. pp.274-279 10.1145/3093742.3095099. hal-01644417 HAL Id: hal-01644417 https://hal.archives-ouvertes.fr/hal-01644417 Submitted on 22 Nov 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Grand Challenge: FlinkMan – Anomaly Detection in Manufacturing Equipment with Apache Flink Nicolo Rivei Yann Busnel Avigdor Gal Technion - Israel Institute of IMT Atlantique / IRISA / UBL Technion - Israel Institute of Technology [email protected] Technology [email protected] [email protected] ABSTRACT analog. ese sensors provide periodic measurements, which are We present a (so) real-time event-based anomaly detection appli- sent to a monitoring base station. e laer receives then a large cation for manufacturing equipment, built on top of the general collection of observations. Analyzing in an ecient and accurate purpose stream processing framework Apache Flink. e anom- way, this very-high-rate – and potentially massive – stream of aly detection involves multiple CPUs and/or memory intensive events is the core of the Grand Challenge. Although, the analysis tasks, such as clustering on large time-based window and parsing of a massive amount of sensor reading requires an on-line analytics input data in RDF-format. e main goal is to reduce end-to-end pipeline that deals with linked-data, clustering as well as a Markov latencies, while handling high input throughput and still provide model training and querying. exact results. Given a truly distributed seing, this challenge also e FlinkMan system proposes a solution to the 2017 Grand entails careful task and/or data parallelization and balancing. We Challenge, making use of a publicly available streaming engine and propose FlinkMan, a system that oers a generic and ecient so- thus oering a generic solution that is not specially tailored for lution, which maximizes the usage of available cores and balances this or that challenge. We oer an ecient solution that maximally the load among them. We illustrates the accuracy and eciency utilizes available cores, balances the load among the cores, and of FlinkMan, over a 3-step pipelined data stream analysis, that avoids to the extent possible tasks such as garbage collection that includes clustering, modeling and querying. are only indirectly related to the task at hand. is rest of the paper is organized as follows. Section 2 presents CCS CONCEPTS the query engine pipeline, the data set and the evaluation platform, that are provided for this challenge. Section 3 introduces the general •Information systems →Stream management; •eory of com- architecture of our solution and its rationale. Finally, Section 4 putation →Distributed algorithms; Unsupervised learning and clus- provides details of the implementation as well as the optimizations tering; included in our solution. KEYWORDS 2 PROBLEM STATEMENT Anomaly Detection, Stream Processing, Clustering, Markov Chains, Linked-Data e overall goal is to detect anomalies in manufacturing machines based on a stream of measurements produced by the sensors embed- ded into the monitored equipments. e events produced by each 1 INTRODUCTION sensor are clustered and the state transitions between the clusters Stream processing management system (SPMS) and/or Complex are used to train a Markov model. In turn, the produced Markov Event Processing (CEP) systems gain momentum In performing model is used to detect anomalies. A sequence of transitions that analytics on continuous data streams. eir ability to achieve sub- follows a low probability path in the Markov chain is considered as second latencies, coupled with their scalability, makes them the abnormal, and is agged as an anomaly. preferred choice for many big data companies. Supporting this trend, since 2011, the ACM International Conference on Distributed 2.1 ery Event-based Systems (DEBS) launched the Grand Challenge series e anomaly detection analysis can be modeled as a pipeline with to increase the focus on these systems as well as provide common three stages: (i) clustering, (ii) Markov model training and (iii) benchmarks to evaluate and compare them. e ACM DEBS 2017 Markov model querying (i.e., output transition sequences with low Grand Challenge focuses on (so) real-time anomaly detection in probability). ese three steps are executed continuously on a manufacturing equipment [4]. To handle continuous monitoring, time-based sliding window and the whole pipeline is performed each machine is ed with a vast array of sensors, either digital or independently for each sensor of each machine. e query has 6 Permission to make digital or hard copies of all or part of this work for personal or parameters: the time-based sliding window size W (in seconds), the classroom use is granted without fee provided that copies are not made or distributed initial number of clusters k (non uniform among sensors), the maxi- for prot or commercial advantage and that copies bear this notice and the full citation mum number of iterations of the clustering algorithm M (if conver- on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permied. To copy otherwise, or republish, gence has not been reached), the clustering algorithm convergence to post on servers or to redistribute to lists, requires prior specic permission and/or a distance µ, the length of the Markov model path we consider for fee. Request permissions from [email protected]. computing the anomaly probability N , and the probability thresh- DEBS’17, Barcelona, Spain © 2017 ACM. 978-1-4503-5065-5...$15.00 old T below which the path is classied as anomaly. Each event DOI: 10.1145/3093742.3095099 goes through all the mentioned stages so that a single event may DEBS’17, June 19 - 23, 2017, Barcelona, Spain Nicolo Rivei, Yann Busnel, and Avigdor Gal Event Physical Timestamp Logical Timestamp Value 1/3 r0 = ( 1485903716000; 1155; −0:04 ) r = ( 1485903717000; 1165; −0:04 ) r r6 1 0 5 r = ( 1485903718000; 1175; +0:02 ) 2 1/3 r = ( 1485903719000; 1185; −0:0 ) 1/5 r7 3 1/3 r = ( 1485903720000; 1195; −0:01 ) 2 3/5 4 1 r8 r5 = ( 1485903721000; 1205; −0:04 ) 1/5 r6 = ( 1485903722000; 1215; +0:0 ) 1 r9 r7 = ( 1485903723000; 1225; −0:02 ) r8 = ( 1485903724000; 1235; +0:0 ) Last N transitions Probability r9 = ( 1485903725000; 1245; +0:02 ) r5 : 2 ! 0 P1 = P2!0 = 1=5 = 0:2 Table 1: Example of an input window of size W = 10. r6 : 0 ! 2 P2 = P1 × 1=3 = 1=15 ≈ 0:666 r7 : 2 ! 2 P3 = P2 × 3=5 = 1=25 = 0:04 r8 : 2 ! 2 P4 = P3 × 3=5 = 3=125 ≈ 0:024 r9 : 2 ! 1 P5 = P4 × 1=5 = 1=3215 < T = 1=200 C0 C2 C1 r0 r3 ) r5 trigger an anomaly r1 r6 r2 r5 r7 r4 r8 r9 Figure 2: Trained Markov model and probability of the ter- Init : -0.05 -0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 minal path of length N = 5, with a threshold T = 0:005. c0 c2 c1 P1 = 1=5 to occur. Following the 5-step path from r4 to r9, this se- quence has a probability of 1=3215 to happen, which is way below C0 C2 C1 the anomaly threshold (set to 0:005 for this toy example). r0 r3 r1 r6 r2 r r r r r Iter1 : 5 7 4 8 9 2.2 Dataset Iter2 : -0.05 -0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 e molding machines of our dataset are equipped with a large array of sensors, measuring various parameters of their processing c0 c2 c1 including distance, pressure, time, frequency, volume, tempera- ture, time, speed, and force. e dataset is encoded as RDF [20] (Resource Description Framework) triples using Turtle [19] and Figure 1: Clustering (k-means) with k = . 3 consists of two types of inputs, namely a stream of measurements and a meta-data le. e stream measurements contain a sequence change the clustering, modify the Markov model, and trigger an of observation groups, a 120 dimensional vector with the events anomaly detection. It is worth noting that the detected anomalies from all sensors for a single time-tick and machine. It is notewor- must be ordered with respect to the ordering in the input stream. thy that the vector contains a mix of dierent value types, e.g., text and numerical values. Each observation group is marked with a Example 2.1. Table 1 contains an input windows of size W = 10. physical timestamp and has a machine identier. In addition, each Clustering First, the clustering algorithm groups all readings event contains a sensor identier, a sensor reading and a sensor (from r0 to r9) into k = 3 clusters. To do so, a k-means algo- type. Each machine outputs a (complete) observation group once rithm is initialized: the cluster centers are set to the k rst values every second. W is the size in time of the sliding window and, in encountered (represented as c0, c1 and c2 in the Init part of Fig- steady state, the exact count of the sliding window.