A Simple Data Compression Algorithm for Anomaly Detection in Wireless Sensor Networks
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Pure and Applied Mathematics Volume 117 No. 19 2017, 403-410 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu Special Issue ijpam.eu A SIMPLE DATA COMPRESSION ALGORITHM FOR ANOMALY DETECTION IN WIRELESS SENSOR NETWORKS 1Uthayakumar J, 2Vengattaraman T, 3Dr. J. Amudhavel 1Research Scholar, Department of Computer Science, Pondicherry University, Puducherry, India 2Assistant Professor, Department of Computer Science, Pondicherry University, Puducherry, India 3Associate Professor, Department of CSE, KL University, Andhra Pradesh, India 1*[email protected], [email protected], [email protected] Abstract: Wireless Sensor Networks (WSN) consists of nodes. WSN is randomly deployed in the sensing field to numerous sensor nodes and is deeply embedded into the measure physical parameters such as temperature, real world for environmental monitoring. As the sensor humidity, pressure, vibration, etc [1]. WSN is widely nodes are battery powered, energy efficiency is used in tracking and data gathering applications include considered as an important design issue in WSN. Since surveillance (indoor and outdoor), healthcare, disaster data transmission consumes more energy than sensing management, habitat monitoring, etc [2]. A sensor node and processing of data, many researchers have been is built up of four components namely transducer, carried out to reduce the number of data transmission. microcontroller, battery, and transceiver. The sensor Data compression (DC) techniques are commonly used nodes are constrained in energy, bandwidth, memory and to reduce the amount of data transmission. On the other processing capabilities. As the sensor nodes are battery side, anomaly detection is also a challenging task in powered and are usually deployed in the harsh WSN to enhance the data integrity. To achieve this data environment, it is not easy to recharge or replace integrity, the sensor nodes append labels to the sensed batteries [3]. The lifetime of WSN can be extended in data to differentiate the actual value and abnormal value. two ways: increasing the battery storage capacity and The labeled value can be represented as ‘0’ for actual effectively utilizing the available energy. The way of data and ‘1’ for anomaly data. In this paper, we employ a increasing the battery capacity is not possible in all Lempel Ziv Markov-chain Algorithm (LZMA) to situations. So, the effective utilization of available compress the labeled data in WSN. LZMA is a lossless energy is considered as an important design issue. data compression algorithm which is well suited for real Several researchers observed that a large amount of time applications. LZMA algorithm compresses the energy is spent for data transmission when compared to labeled data and transmits to Base Station (BS) via sensing and processing operation [4]. This study reveals single hop and multi hop communication. Extensive that the reduction in the amount of data transmission is experiments were performed using real world labeled an effective way to achieve energy efficiency. Data WSN dataset. To ensure the effectiveness of LZMA transmission is the most energy consuming task due to algorithm, it is compared with 5 well-known the nature of strong temporal correlation in the sensed compression algorithms namely Deflate, Lempel Ziv data. DC is considered as a useful approach to eliminate Welch (LZW), Burrows Wheeler Transform (BWT), the redundancy in the sensed data [5] Huffman coding and Arithmetic coding (AC). By comparing the compression performance of LZMA DC technique represents the data in its compact form method with existing methods, LZMA achieves without negotiating the data quality to a certain extent. It significantly better compression with an average is used to compress text, image, audio, video, etc [6]. compression ratio of 0.0104 at the bit rate of 0.839 The compact form of any data can be achieved by the respectively. recognition and utilization of patterns exists in the data. Keywords: Anomaly detection; Data compression; DC is divided into two types based on the reconstructed Lempel Ziv Welch; Multi hop communication; Wireless data quality and the two types are lossless compression Sensor Networks and lossy compression [7]. Lossy compression refers to a loss of quality in reconstructed data. It achieves better 1. Introduction compression and is useful in situations where the loss of data quality is acceptable. Example: images, audio, The recent advancement in wireless networks and videos. In some situations, the loss of information is Micro-Electro-Mechanical-System (MEMS) leads to the unacceptable where the reconstructed data should be the development of low cost, compact and smart sensor 403 International Journal of Pure and Applied Mathematics Special Issue exact replica of the original data [8]. The basic idea of and Deflate algorithm in terms Compression Ratio (CR), compressing data involves two steps: eliminating Compression Factor (CF) and Bits per character (BPC). redundant and irrelevant data. The nature of redundancy in the real world data makes data compression possible. 1.2 Organization of this paper The removal of data in compression process which The rest of the paper is organized as follows: Section 2 cannot be identified by the human eye is termed as explains the different types of classical DC techniques irrelevancy reduction. The reduction in the amount of and anomaly detection techniques in WSN. Section 3 data enables to store a large amount of information in the presents the LZMA compression algorithm for labeled same storage space and reduces the transmission time data in WSN. Section 4 explains the performance significantly. This nature is highly useful in WSN to evaluation in single hop as well as multi hop scenario. compress sensed data [9]. Section 5 concludes with the highlighted contributions, future work, and recommendations. Another important challenge in WSN is to handle the integrity of data sensed by the sensors. This requirement 2. Related Work leads to a research problem known as anomaly detection [10]. It plays a major role in the intrusion detection and Energy efficiency is the major design issue in WSN. fault diagnosis. It is needed to detect any misbehavior or Clustering and routing are the most widely used energy anomalies for the reliable and secure functioning of the efficient techniques [13]. Numerous clustering and network. Anomaly detection is useful in WSN to identify routing techniques have been developed and these the abnormal variations in the sensing field [11]. It is a techniques are found in the literature [14], [15]. Data process of raising an alert when a significant change compression is an alternative way to achieve energy occurs. For instance, WSN is considered to monitor the efficiency. DC compression techniques have been environmental conditions like temperature and humidity presented in [16]. The popular coding methods are level of forest fire detection. When the sensor Huffman coding, Arithmetic coding, Lempel Ziv coding, malfunction or fire is caught, the sensed value will Burrows-wheeler transform, RLE, Scalar and vector drastically vary from the actual values. These abnormal quantization. conditions are identified and notified to BS for further investigation. Anomaly Detection operates in two ways while integrating to WSN: centralized approaches and Huffman coding [17] is the most popular coding distributed approaches. In centralized approaches, the technique which effectively compresses data in almost sensor node senses the environment and transmits the all file formats. It is a type of optimal prefix code which sensed data to BS. BS only identifies the data whether it is widely employed in lossless data compression. It is is actual data or anomaly data. But, this traditional based on two observations: (1) In an optimum code, the approach makes the sensor node to send all the raw or frequent occurrence of symbols is mapped with shorter erroneous measurements to BS. This results in wastage code words when compared to symbols that appear less of energy by transmitting large number of raw sensor frequently. (2) In an optimum code, the least frequent measurements. In distributed approaches, the sensor occurrence of two symbols will have the same length. nodes sense the field and identify the anomalies using The basic idea is to allow variable length codes to input anomaly detection algorithm [12]. The sensor node characters depending upon the frequency of occurrence. appends a label to the sensed value to represent The output is the variable length code table for coding a anomalies. This label is used to differentiate between the source symbol. It is uniquely decodable and it consists of normal data and anomaly data. In this paper, we employ two components: Constructing Huffman tree from input an LZMA lossless compression algorithm to compress sequence and traversing the tree to assign codes to labeled data in WSN. characters. Huffman coding is still popular because of its simpler implementation, faster compression and lack of 1.1 Contribution of this paper patent coverage. It is commonly used in text compression. The contribution of the paper is summarized as follows: (i) A lossless LZMA compression algorithm is used to compress labeled WSN data. (ii) Two labeled WSN AC [18] is an another important coding technique to datasets (temperature and humidity) in both single hop generate variable length codes. It is superior to Huffman and multi hop communication is used, and (iii) LZMA coding in various aspects. It is highly useful in situations results are compared with 5 well-known compression where the source contains small alphabets with skewed algorithms namely Huffman coding, AC, LZW, BWT probabilities. When a string is encoded using arithmetic 404 International Journal of Pure and Applied Mathematics Special Issue coding, frequent occurring symbols are coded with lesser one code. Typically, an LZW code is 12-bits length bits than rarely occurring symbols. It converts the input (4096 codes). The starting 256 (0-255) entries represent data into a floating point number in the range of 0 and 1.