Journal of Xi'an University of Architecture & Technology Issn No : 1006-7930

COMPARITIVE ANALYSIS OF CLUSTERING TECHNIQUES IN IN WIND TURBINE DATA

R. Sandhya PG Scholar, Computer Science and Engineering, PSG College of Technology, Coimbatore, India.

J. Prakash Assitant Professor, Department of Computer Science and Engineering,PSG College of Technology, Coimbatore, India.

B. Vinoth Kumar Associate Professor, Department of Information Technology, PSG College of Technology, Coimbatore, India.

Abstract - Nowadays, wind turbines are used in many places as a source of power production. Observing or checking the state of all the wind turbine are becoming very important; the process of checking the condition of the parameters in Wind turbines in order to identify a significant change will be more useful. Providing accurate data by identifying the fault and removing will be valuable. Data imputation will give additional advantages to increasing the accuracy of the data. The main goal of the work is to detect the anomalies from the dataset and to remove the anomalies from the dataset and imputation of missing data for the removed anomalies to increase the accuracy. Different clustering techniques like Isolation Forest, Support Vector Machine, and Local Factor are used for spotting of an anomaly in the wind turbine performance; the Support vector machine provided a better result than other clustering techniques because support vector machine provides very high accuracy but very low specificity.

Index Term — Anomaly Detection, Clustering techniques, Isolation Forest, Local Outlier Factor, Support Vector Machine and Wind Turbine. I. INTRODUCTION Anomalies are point out to abnormality, oddity, irregularity, , exceptions, novelties, deviations. Anomaly detection is one of the most active processes of recognizing startling items or events in datasets, which vary from normal data. Anomaly detection is also known as novelty detection and outlier detection. Unsupervised anomaly detection is a type of anomaly detection which is normally applicable to unlabelled data. The Unsupervised anomaly detection is the expertise that detects outliers in an unlabelled test data set under the assumption that the majority of the case that seems to fit least to the rest of the dataset. Outlier detection is desirable due to the following advantages: It is suitable in a collection of domains, such as intrusion detection, fault detection, system health monitoring, and fraud detection. The pre-processing is the technique which is removing irregular data from the dataset the anomaly detection is often used. By eliminating the abnormal data from the dataset usually results in a statistically significant increase the efficiency. The main objective of this paper is to determine the suitable clustering technique among Support Vector Machine, Isolation Forest and Local Outlier Factor to detect the anomaly in the wind turbine dataset which will be more useful in wind energy resource. The remainder of the work is formulated as follows. In portion two the literature work was conferred. In portion three the experimentation performed for implementation is discussed and followed by the evaluation results are conferred in portion four, and the conclusion with future enhancement is discussed in section five.

II. LITERATURE SURVEY In reference [1], the authors proposed a method for the maintenance administration of wind turbines. This proposed work uses the PAAD algorithm (i.e., Performance Analysis and Anomaly Detection) which is capable of identifying the outlier and also specify the root for the outlier. The technique used by the PAAD algorithm is a neural network to identify the outlier from the wind turbine dataset; the root cause of the outlier is pointed out by the principal component analysis technique. In this work, the dataset used to verify the

Volume XII, Issue III, 2020 Page No: 5679 Journal of Xi'an University of Architecture & Technology Issn No : 1006-7930

performance is the SCADA dataset. The benefit of this proposed work is to reduce time and cost and to increase availability. In reference [2], the neural network model is used with back propagation is built where the Mean square error was used to measure the efficiency of the network in the dataset and comparison of result was made between ANN on Feed forward Neural network and when it run on multicore map reduce. In reference [3], the wind turbine health is continuously audited using algorithms such as failure detection algorithm that will improve the reliability and decrease the maintenance costs to find the failures before it attains the tragic stage. Use of SCADA dataset is an economical way to audit the wind turbine for initial alerting of failure. To round up fault authors uses a clustering techniques and principal component analysis to establish anomaly detection algorithm. The anomaly value is identified from a set of normal data. In reference [4], the authors proposed a methodology for anomaly detection in wind turbines using an algorithm Normal Behaviour Model (NBM) on the SCADA data set. The input parameter is selected by GAPLS (genetic algorithm combined with partial least squares regression) which is used to minimize the unnecessary parameters for outlier detection in wind turbines. By the use of a back-propagation neural network, the fourteen temperature parameters of the dataset are developed. The proposed methodology is validated by a 1.5MW wind turbine fault. Anomaly detection uses prediction error which is an effective indicator in the wind turbine dataset. Evaluation results demonstrate that NBM has a low prediction error on the normal condition and a high prediction error prior to a fault condition. In reference [5], the authors explain the importance of the imputation process. In and in the predominant attention is the imputation of missing data. A fuzzy-neighbourhood density- based clustering technique is used for the imputation process. The proposed architecture uses the density measure to impute the missing data by clustering similar patterns and discovers the best data for each inadequate target pattern. The fuzzy neighbourhood scales are altered using an invasive weed optimization algorithm. By using the dataset which is publicly available the performance is evaluated by the proposed technique for imputation of missing data and the performance is compared with the existing technologies such as fuzzy c- means imputation, k-means imputation. The complete result reveals the effect of the proposed technique which is used for the imputation of missing data.

III. EXPERIMENTATION

2.1 DATASET: Clustering techniques depend heavily on data. It is the most crucial aspect that makes algorithm training possible and explains why clustering techniques became so popular in recent years. Here the dataset used is the wind turbine dataset. The flow of the experimentation is shown in Fig-1 where the input dataset is obtained and the various clustering techniques like Isolation Forest, Support Vector Machine, and Local Outlier Factor are applied and evaluated with the performance measure.

Figure 1: Experimental flow 2.2 CLUSTERING TECHNIQUES: The most common clustering techniques used for identifying the anomalies are Isolation Forest, Support Vector Machine, and Local Outlier Factor.

i) Isolation Forest : Isolation Forest finds the rarity directly, instead of marking normal data points. It is built based on the decision tree which is like a tree ensemble method. Anomalies are less overrun than formal observations. Thus, random partitioning is used. By random partitioning, closer to the root of the tree they should be identified with fewer splits necessary.

Volume XII, Issue III, 2020 Page No: 5680 Journal of Xi'an University of Architecture & Technology Issn No : 1006-7930

ii) Local Outlier Factor: This technique comes under an unsupervised method. It figures out the local density alteration of a given data point regarding its neighbors. It deals with anomalies with the cases which have an essentially lower density than their neighbors.

iii) Support Vector Machine: An SVM comes under a supervised machine learning model which is a classification technique [6]. In classification, technique SVM is used for the two-group classification problem. SVM is a technique that is associated with learning algorithms that analyze data for regression and classification. A user-specified criterion that is provided by SVM is known as penalty factor. Customers can produce a tradeoff between the decision boundary width and the number of misclassification samples.

2.3 EVALUATION METRICS: The experimental results are displayed and explained in this section. The performance results of the Local Outlier Factor, the Isolation Forest and the Support Vector Machine on wind turbine datasets have been evaluated using accuracy, precision, recall and f1-score. In wind turbine data, Support Vector Machine clusters with greater accuracy than Isolation Forest and Local Outlier Factor which renders negligibly better precision.

IV. RESULT Measurements that are close to the known value are said to be accurate. The Support Vector Machine achieves the best result for accuracy which gives 89% than the Local Outlier Factor (i.e.,79%) and Isolation Forest(i.e., 78%).

Figure 2. Accuracy of Clustering Techniques

Measurements that are close to each other are said to be Precise. The support vector machine and local outlier factor achieve the best result in precision which gives 89% than the isolation forest (i.e., 88%).

Figure 3. Precision of Clustering Techniques

Volume XII, Issue III, 2020 Page No: 5681 Journal of Xi'an University of Architecture & Technology Issn No : 1006-7930

The Recall is the portion of a total number of relevant instances that were actually retrieved and to find all the positive samples recall is naturally has the ability to classify. The Support Vector Machine achieves the best result in recall which gives 100% than the Isolation Forest and Local Outlier Factor (i.e., 87%).

Figure 4. Recall of Clustering Techniques

The F1-Score is the estimation of a test's accuracy which takes both false negative and false positive into account. It is the equate of Recall and Precision. The support vector machine achieves the best result in F1- Score which gives 94% than the local outlier factor (i.e., 88%) and Isolation forest (i.e., 87%).

Figure 5. F1-Score of Clustering Techniques

The performance of the algorithm has been computed. Figure 2, 3, 4, 5 shows the plots for accuracy, precision, recall, F1-Score of Support Vector Machine, Isolation Forest and Local Outlier Factor from the result we can infer Support Vector Machine gives a better result than Isolation Forest and Local Outlier Factor in all the metrics.

Table - I : Performance Analysis of Clustering Techniques for Wind Turbine Dataset Techniques Accuracy Precision Recall F1- Score Support Vector Machine 88.63% 89% 100% 94% Local Outlier Factor 78.85% 89% 87% 88% Isolation Forest 77.77% 88% 87% 87%

The performance results of the Local Outlier Factor, the Isolation Forest and the Support Vector Machine on wind turbine datasets have been shown in Table 1. In wind turbine data, Support Vector Machine clusters with greater accuracy than Isolation Forest and Local Outlier Factor which renders negligibly better precision.

Volume XII, Issue III, 2020 Page No: 5682 Journal of Xi'an University of Architecture & Technology Issn No : 1006-7930

V. CONCLUSION In handling the anomaly detection Clustering techniques are widely used. The dataset is split into training data, validation data, and testing data. Clustering models Isolation Forest, Support Vector Machine, and Local Outlier Factor are used to detect anomalous data. The results reveal that the accuracy of a SVM is found to be 88.63 % better than the other two clustering techniques Isolation Forest and Local Outlier Factor. For the alternative metric like Recall, Precision, and F1 Score, an SVM carries out a better result than the other techniques. From the results, it is observed that overall; Support Vector Machine outperforms than Isolation Forest and Local Outlier Factor.

REFERENCES

[1] Peyman Mazidi, Miguel A. Sanz-Bobi, Lina Bertling Tjemberg., “Performance analysis and anomaly detection in wind turbines based on neural networks and Principal component analysis”, in 2nd workshop on industrial systems and energy technologies (josite2017), madrid, spain, 2017.

[2] Prakash J, Bharathi A, “Predicting Flight Delay using ANN with Multicore Map reduce framework, Communication and Power Engineering”, De Gruyter 2016, 280–287.

[3] Kyusung Kim, Wendy Foslien, Girija Parthasarathy, Shuangwen Sheng, Onder Uluyol, Paul Fleming., “Use of SCADA Data for Failure detectionin wind turbines”, presented at the 2011 Energy Sustainability Conferenceand Fuel Cell Conference Washington, D.C. August 2011,pp.7- 10.

[4] Peng sun, Jian Li, Yonglong Yan, Xiao Lei, Xiameng Zhang., “Wind turbine anomaly detection using normal behavior models based on SCADA data”, in the International Conference on Software Engineering,2009.

[5] Roozbeh Razavi-Far, Mehrdad Saif., “Imputation of Missing Data Using Fuzzy Neighbourhood Density-Based clustering”, published in IEEE International Conference on Fuzzy Systems (FUZZ), 2016.

[6] I.Devi, G.R.Karpagam and B.Vinoth Kumar, “A Survey of Machine learning Techniques” International Journal of Computational Systems Engineering, Inderscience Publishers, Vol. 3 No.4 2017, pp.203-212.

[7] B.Vinoth Kumar, G.R.Karpagam and N.Vijaya Rekha, “Performance Analysis of Deterministic Centroid Initialization Method for Partitional Algorithms in Image Block Clustering”, Indian Journal of Science and Technology, Vol 8(S7), 63–73, April 2015.

[8] Prakash J, “Enhanced Mass Vehicle Surveillance System”, J4R, Volume 04, Issue 04 ,002, 5-9, June 2018.

Volume XII, Issue III, 2020 Page No: 5683