Comparitive Analysis of Clustering Techniques in Anomaly Detection in Wind Turbine Data

Journal of Xi'an University of Architecture & Technology Issn No : 1006-7930 COMPARITIVE ANALYSIS OF CLUSTERING TECHNIQUES IN ANOMALY DETECTION IN WIND TURBINE DATA R. Sandhya PG Scholar, Computer Science and Engineering, PSG College of Technology, Coimbatore, India. J. Prakash Assitant Professor, Department of Computer Science and Engineering,PSG College of Technology, Coimbatore, India. B. Vinoth Kumar Associate Professor, Department of Information Technology, PSG College of Technology, Coimbatore, India. Abstract - Nowadays, wind turbines are used in many places as a source of power production. Observing or checking the state of all the wind turbine are becoming very important; the process of checking the condition of the parameters in Wind turbines in order to identify a significant change will be more useful. Providing accurate data by identifying the fault and removing will be valuable. Data imputation will give additional advantages to increasing the accuracy of the data. The main goal of the work is to detect the anomalies from the dataset and to remove the anomalies from the dataset and imputation of missing data for the removed anomalies to increase the accuracy. Different clustering techniques like Isolation Forest, Support Vector Machine, and Local Outlier Factor are used for spotting of an anomaly in the wind turbine performance; the Support vector machine provided a better result than other clustering techniques because support vector machine provides very high accuracy but very low specificity. Index Term — Anomaly Detection, Clustering techniques, Isolation Forest, Local Outlier Factor, Support Vector Machine and Wind Turbine. I. INTRODUCTION Anomalies are point out to abnormality, oddity, irregularity, outliers, exceptions, novelties, deviations. Anomaly detection is one of the most active processes of recognizing startling items or events in datasets, which vary from normal data. Anomaly detection is also known as novelty detection and outlier detection. Unsupervised anomaly detection is a type of anomaly detection which is normally applicable to unlabelled data. The Unsupervised anomaly detection is the expertise that detects outliers in an unlabelled test data set under the assumption that the majority of the case that seems to fit least to the rest of the dataset. Outlier detection is desirable due to the following advantages: It is suitable in a collection of domains, such as intrusion detection, fault detection, system health monitoring, and fraud detection. The pre-processing is the technique which is removing irregular data from the dataset the anomaly detection is often used. By eliminating the abnormal data from the dataset usually results in a statistically significant increase the efficiency. The main objective of this paper is to determine the suitable clustering technique among Support Vector Machine, Isolation Forest and Local Outlier Factor to detect the anomaly in the wind turbine dataset which will be more useful in wind energy resource. The remainder of the work is formulated as follows. In portion two the literature work was conferred. In portion three the experimentation performed for implementation is discussed and followed by the evaluation results are conferred in portion four, and the conclusion with future enhancement is discussed in section five. II. LITERATURE SURVEY In reference [1], the authors proposed a method for the maintenance administration of wind turbines. This proposed work uses the PAAD algorithm (i.e., Performance Analysis and Anomaly Detection) which is capable of identifying the outlier and also specify the root for the outlier. The technique used by the PAAD algorithm is a neural network to identify the outlier from the wind turbine dataset; the root cause of the outlier is pointed out by the principal component analysis technique. In this work, the dataset used to verify the Volume XII, Issue III, 2020 Page No: 5679 Journal of Xi'an University of Architecture & Technology Issn No : 1006-7930 performance is the SCADA dataset. The benefit of this proposed work is to reduce time and cost and to increase availability. In reference [2], the neural network model is used with back propagation is built where the Mean square error was used to measure the efficiency of the network in the dataset and comparison of result was made between ANN on Feed forward Neural network and when it run on multicore map reduce. In reference [3], the wind turbine health is continuously audited using algorithms such as failure detection algorithm that will improve the reliability and decrease the maintenance costs to find the failures before it attains the tragic stage. Use of SCADA dataset is an economical way to audit the wind turbine for initial alerting of failure. To round up fault authors uses a clustering techniques and principal component analysis to establish anomaly detection algorithm. The anomaly value is identified from a set of normal data. In reference [4], the authors proposed a methodology for anomaly detection in wind turbines using an algorithm Normal Behaviour Model (NBM) on the SCADA data set. The input parameter is selected by GAPLS (genetic algorithm combined with partial least squares regression) which is used to minimize the unnecessary parameters for outlier detection in wind turbines. By the use of a back-propagation neural network, the fourteen temperature parameters of the dataset are developed. The proposed methodology is validated by a 1.5MW wind turbine fault. Anomaly detection uses prediction error which is an effective indicator in the wind turbine dataset. Evaluation results demonstrate that NBM has a low prediction error on the normal condition and a high prediction error prior to a fault condition. In reference [5], the authors explain the importance of the imputation process. In machine learning and in data mining the predominant attention is the imputation of missing data. A fuzzy-neighbourhood density- based clustering technique is used for the imputation process. The proposed architecture uses the density measure to impute the missing data by clustering similar patterns and discovers the best data for each inadequate target pattern. The fuzzy neighbourhood scales are altered using an invasive weed optimization algorithm. By using the dataset which is publicly available the performance is evaluated by the proposed technique for imputation of missing data and the performance is compared with the existing technologies such as fuzzy c- means imputation, k-means imputation. The complete result reveals the effect of the proposed technique which is used for the imputation of missing data. III. EXPERIMENTATION 2.1 DATASET: Clustering techniques depend heavily on data. It is the most crucial aspect that makes algorithm training possible and explains why clustering techniques became so popular in recent years. Here the dataset used is the wind turbine dataset. The flow of the experimentation is shown in Fig-1 where the input dataset is obtained and the various clustering techniques like Isolation Forest, Support Vector Machine, and Local Outlier Factor are applied and evaluated with the performance measure. Figure 1: Experimental flow 2.2 CLUSTERING TECHNIQUES: The most common clustering techniques used for identifying the anomalies are Isolation Forest, Support Vector Machine, and Local Outlier Factor. i) Isolation Forest : Isolation Forest finds the rarity directly, instead of marking normal data points. It is built based on the decision tree which is like a tree ensemble method. Anomalies are less overrun than formal observations. Thus, random partitioning is used. By random partitioning, closer to the root of the tree they should be identified with fewer splits necessary. Volume XII, Issue III, 2020 Page No: 5680 Journal of Xi'an University of Architecture & Technology Issn No : 1006-7930 ii) Local Outlier Factor: This technique comes under an unsupervised method. It figures out the local density alteration of a given data point regarding its neighbors. It deals with anomalies with the cases which have an essentially lower density than their neighbors. iii) Support Vector Machine: An SVM comes under a supervised machine learning model which is a classification technique [6]. In classification, technique SVM is used for the two-group classification problem. SVM is a supervised learning technique that is associated with learning algorithms that analyze data for regression and classification. A user-specified criterion that is provided by SVM is known as penalty factor. Customers can produce a tradeoff between the decision boundary width and the number of misclassification samples. 2.3 EVALUATION METRICS: The experimental results are displayed and explained in this section. The performance results of the Local Outlier Factor, the Isolation Forest and the Support Vector Machine on wind turbine datasets have been evaluated using accuracy, precision, recall and f1-score. In wind turbine data, Support Vector Machine clusters with greater accuracy than Isolation Forest and Local Outlier Factor which renders negligibly better precision. IV. RESULT Measurements that are close to the known value are said to be accurate. The Support Vector Machine achieves the best result for accuracy which gives 89% than the Local Outlier Factor (i.e.,79%) and Isolation Forest(i.e., 78%). Figure 2. Accuracy of Clustering Techniques Measurements that are close to each other are said to be Precise. The support vector machine and local outlier factor achieve the best result in precision which gives 89% than the isolation forest (i.e., 88%). Figure 3. Precision of Clustering Techniques Volume XII, Issue III, 2020 Page No: 5681 Journal of Xi'an University of Architecture & Technology Issn No : 1006-7930 The Recall is the portion of a total number of relevant instances that were actually retrieved and to find all the positive samples recall is naturally has the ability to classify. The Support Vector Machine achieves the best result in recall which gives 100% than the Isolation Forest and Local Outlier Factor (i.e., 87%). Figure 4. Recall of Clustering Techniques The F1-Score is the estimation of a test's accuracy which takes both false negative and false positive into account.

Load more