International Journal of Pure and Applied Mathematics Volume 117 No. 19 2017, 403-410 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu Special Issue ijpam.eu

A SIMPLE ALGORITHM FOR ANOMALY DETECTION IN WIRELESS SENSOR NETWORKS

1Uthayakumar J, 2Vengattaraman T, 3Dr. J. Amudhavel 1Research Scholar, Department of Computer Science, Pondicherry University, Puducherry, India 2Assistant Professor, Department of Computer Science, Pondicherry University, Puducherry, India 3Associate Professor, Department of CSE, KL University, Andhra Pradesh, India 1*[email protected], [email protected], [email protected]

Abstract: Wireless Sensor Networks (WSN) consists of nodes. WSN is randomly deployed in the sensing field to numerous sensor nodes and is deeply embedded into the measure physical parameters such as temperature, real world for environmental monitoring. As the sensor humidity, pressure, vibration, etc [1]. WSN is widely nodes are battery powered, energy efficiency is used in tracking and data gathering applications include considered as an important design issue in WSN. Since surveillance (indoor and outdoor), healthcare, disaster data transmission consumes more energy than sensing management, habitat monitoring, etc [2]. A sensor node and processing of data, many researchers have been is built up of four components namely transducer, carried out to reduce the number of data transmission. microcontroller, battery, and transceiver. The sensor Data compression (DC) techniques are commonly used nodes are constrained in energy, bandwidth, memory and to reduce the amount of data transmission. On the other processing capabilities. As the sensor nodes are battery side, anomaly detection is also a challenging task in powered and are usually deployed in the harsh WSN to enhance the data integrity. To achieve this data environment, it is not easy to recharge or replace integrity, the sensor nodes append labels to the sensed batteries [3]. The lifetime of WSN can be extended in data to differentiate the actual value and abnormal value. two ways: increasing the battery storage capacity and The labeled value can be represented as ‘0’ for actual effectively utilizing the available energy. The way of data and ‘1’ for anomaly data. In this paper, we employ a increasing the battery capacity is not possible in all Lempel Ziv Markov-chain Algorithm (LZMA) to situations. So, the effective utilization of available compress the labeled data in WSN. LZMA is a lossless energy is considered as an important design issue. data compression algorithm which is well suited for real Several researchers observed that a large amount of time applications. LZMA algorithm compresses the energy is spent for data transmission when compared to labeled data and transmits to Base Station (BS) via sensing and processing operation [4]. This study reveals single hop and multi hop communication. Extensive that the reduction in the amount of data transmission is experiments were performed using real world labeled an effective way to achieve energy efficiency. Data WSN dataset. To ensure the effectiveness of LZMA transmission is the most energy consuming task due to algorithm, it is compared with 5 well-known the nature of strong temporal correlation in the sensed compression algorithms namely , Lempel Ziv data. DC is considered as a useful approach to eliminate Welch (LZW), Burrows Wheeler Transform (BWT), the redundancy in the sensed data [5] and (AC). By comparing the compression performance of LZMA DC technique represents the data in its compact form method with existing methods, LZMA achieves without negotiating the data quality to a certain extent. It significantly better compression with an average is used to compress text, image, audio, , etc [6]. compression ratio of 0.0104 at the rate of 0.839 The compact form of any data can be achieved by the respectively. recognition and utilization of patterns exists in the data. Keywords: Anomaly detection; Data compression; DC is divided into two types based on the reconstructed Lempel Ziv Welch; Multi hop communication; Wireless data quality and the two types are Sensor Networks and [7]. Lossy compression refers to a loss of quality in reconstructed data. It achieves better 1. Introduction compression and is useful in situations where the loss of data quality is acceptable. Example: images, audio, The recent advancement in wireless networks and . In some situations, the loss of information is Micro-Electro-Mechanical-System (MEMS) leads to the unacceptable where the reconstructed data should be the development of low cost, compact and smart sensor

403 International Journal of Pure and Applied Mathematics Special Issue

exact replica of the original data [8]. The basic idea of and Deflate algorithm in terms Compression Ratio (CR), compressing data involves two steps: eliminating Compression Factor (CF) and per character (BPC). redundant and irrelevant data. The nature of redundancy in the real world data makes data compression possible. 1.2 Organization of this paper The removal of data in compression process which The rest of the paper is organized as follows: Section 2 cannot be identified by the human eye is termed as explains the different types of classical DC techniques irrelevancy reduction. The reduction in the amount of and anomaly detection techniques in WSN. Section 3 data enables to store a large amount of information in the presents the LZMA compression algorithm for labeled same storage space and reduces the transmission time data in WSN. Section 4 explains the performance significantly. This nature is highly useful in WSN to evaluation in single hop as well as multi hop scenario. compress sensed data [9]. Section 5 concludes with the highlighted contributions, future work, and recommendations. Another important challenge in WSN is to handle the integrity of data sensed by the sensors. This requirement 2. Related Work leads to a research problem known as anomaly detection [10]. It plays a major role in the intrusion detection and Energy efficiency is the major design issue in WSN. fault diagnosis. It is needed to detect any misbehavior or Clustering and routing are the most widely used energy anomalies for the reliable and secure functioning of the efficient techniques [13]. Numerous clustering and network. Anomaly detection is useful in WSN to identify routing techniques have been developed and these the abnormal variations in the sensing field [11]. It is a techniques are found in the literature [14], [15]. Data process of raising an alert when a significant change compression is an alternative way to achieve energy occurs. For instance, WSN is considered to monitor the efficiency. DC compression techniques have been environmental conditions like temperature and humidity presented in [16]. The popular coding methods are level of forest fire detection. When the sensor Huffman coding, Arithmetic coding, Lempel Ziv coding, malfunction or fire is caught, the sensed value will Burrows-wheeler transform, RLE, Scalar and vector drastically vary from the actual values. These abnormal quantization. conditions are identified and notified to BS for further investigation. Anomaly Detection operates in two ways while integrating to WSN: centralized approaches and Huffman coding [17] is the most popular coding distributed approaches. In centralized approaches, the technique which effectively compresses data in almost sensor node senses the environment and transmits the all file formats. It is a type of optimal prefix code which sensed data to BS. BS only identifies the data whether it is widely employed in lossless data compression. It is is actual data or anomaly data. But, this traditional based on two observations: (1) In an optimum code, the approach makes the sensor node to send all the raw or frequent occurrence of symbols is mapped with shorter erroneous measurements to BS. This results in wastage code words when compared to symbols that appear less of energy by transmitting large number of raw sensor frequently. (2) In an optimum code, the least frequent measurements. In distributed approaches, the sensor occurrence of two symbols will have the same length. nodes sense the field and identify the anomalies using The basic idea is to allow variable length codes to input anomaly detection algorithm [12]. The sensor node characters depending upon the frequency of occurrence. appends a label to the sensed value to represent The output is the variable length code table for coding a anomalies. This label is used to differentiate between the source symbol. It is uniquely decodable and it consists of normal data and anomaly data. In this paper, we employ two components: Constructing Huffman tree from input an LZMA lossless compression algorithm to compress sequence and traversing the tree to assign codes to labeled data in WSN. characters. Huffman coding is still popular because of its simpler implementation, faster compression and lack of 1.1 Contribution of this paper patent coverage. It is commonly used in text compression. The contribution of the paper is summarized as follows: (i) A lossless LZMA compression algorithm is used to compress labeled WSN data. (ii) Two labeled WSN AC [18] is an another important coding technique to datasets (temperature and humidity) in both single hop generate variable length codes. It is superior to Huffman and multi hop communication is used, and (iii) LZMA coding in various aspects. It is highly useful in situations results are compared with 5 well-known compression where the source contains small alphabets with skewed algorithms namely Huffman coding, AC, LZW, BWT probabilities. When a string is encoded using arithmetic

404 International Journal of Pure and Applied Mathematics Special Issue

coding, frequent occurring symbols are coded with lesser one code. Typically, an LZW code is 12-bits length bits than rarely occurring symbols. It converts the input (4096 codes). The starting 256 (0-255) entries represent data into a floating point number in the range of 0 and 1. ASCII codes, to indicate individual character. The The algorithm is implemented by separating 0 to 1 into remaining 3840 (256-4095) codes are defined by an segments and the length of each segment is based on the encoder to indicate variable-length strings. UNIX probability of each symbol. Then the output data is compress, GIF images, PNG images and others file identified in the respective segments based on the formats use LZW coding. symbol. It is not easier to implement when compared to other methods. There are two versions of arithmetic BWT [22] is also known as block sorting compression coding namely Adaptive Arithmetic Coding and Binary which rearranges the character string into runs of Arithmetic Coding. A benefit of arithmetic coding than identical characters. It uses two techniques to compress Huffman coding is the capability to segregate the data: move-to-front transform and RLE. It compresses modeling and coding features of the compression data easily to compress in situations where the string approach. It is used in image, audio and video consists of runs of repeated characters. The most compression. important feature of BWT is the reversibility which is fully reversible and it does not require any extra bits. Dictionary based coding approaches find useful in BWT is a "free" method to improve the efficiency of text situations where the original data to be compressed compression algorithms, with some additional involves repeated patterns. It maintains a dictionary of computation. It is s used in . A simpler form of frequently occurring patterns. When the pattern comes in lossless data compression coding technique is RLE [23]. the input sequence, they are coded with an index to the It represents the sequence of symbols as runs and others dictionary. When the pattern is not available in the are termed as non-runs. The run consists of two parts: dictionary, it is coded with any less efficient approaches. data value and count instead of original run. It is The Lempel-Ziv algorithm (LZ) is a dictionary-based effective for data with high redundancy. coding algorithm commonly used in lossless file compression. This is widely used because of its 3. The LZMA Algorithm on Anomaly Labeled adaptability to various file formats. It looks for Data frequently occurring patterns and replaces them by a LMZA is the modified version of Lempel-Ziv single symbol. It maintains a dictionary of these patterns algorithm to achieve higher CR [24]. It is a lossless and the length of the dictionary is set to a particular data compression algorithm based on the principle of value. This method is much effective for larger files and dictionary based encoding scheme. LZMA utilizes the less effective for smaller files. For smaller files, the complex data structure to encode one bit at a time. It length of the dictionary will be larger than the original uses a variable length dictionary (maximum size of 4 file. The two main versions of LZ were developed by GB) and is mainly used to encode an unknown data Ziv and Lempel in two individual papers in 1977 and stream. It is capable of compressing the data generated 1978, and they are named as LZ77 [19] and LZ78 [20]. at a rate of 10-20 Mbps in a real-time environment. These algorithms vary significantly in means of Though it uses larger size dictionary, it still achieves searching and finding matches. The LZ77 algorithm the same decompression speed like other compression basically uses sliding window concept and searches for algorithms. LZ77 algorithm encodes the sequence matches in a window within a predetermined distance from existing contents instead of the original data. back from the present position. , ZIP, and V.42bits When no identical byte sequence is available in the use LZ77. The LZ78 algorithm follows a more existing contents, the address and sequence length is conservative approach of appending strings to the fixed as '0' and the new symbol will be encoded. LZ77 dictionary. also employs a dynamic dictionary to compress unknown data by the help of sliding window concept. LZW is an enhanced version of LZ77 and LZ78 which is LZMA extends the LZ77 algorithm by adding a Delta developed by Terry Welch in 1984 [21]. The encoder Filter and Range Encoder. The Delta Filter alters the constructs an adaptive dictionary to characterize the input data stream for effective compression by the variable-length strings with no prior knowledge of the sliding window. It stores or transmits data in the form input. The decoder also constructs the similar dictionary of differences between sequential data instead of as encoder based on the received code dynamically. The complete file. The output of the first-byte delta frequent occurrence of some symbols will be high in text encoding is the data stream itself. The subsequent data. The encoder saves these symbols and maps it to are stored as the difference between the current and its

405 International Journal of Pure and Applied Mathematics Special Issue

previous byte. For a continuously varying real time no deviation from actual data, i.e. normal value is found. data, makes the sliding dictionary more The sensor node appends the label value to sensed data efficient [18, 19]. For example, consider a sample input and then performs compression. Sensor node runs sequence as 2,3,4,6,7,9,8,7,5,3,4 7. The input sequence LZMA algorithm and the compression algorithm is encoded with LZMA technique and the encoded efficiently compress the labeled data irrespective of the output sequence is 2, 1, 1, 2, 1, 2,-1,-1,-2,-2, 1. So, the label value. LZMA algorithm uses the dictionary, sliding number of symbols in the input sequence is 8 and the window concept and range encoder to efficiently number of symbols in output sequence is 4. compress labeled data. Then the compressed data will be transmitted to the BS. The BS receives the compressed data and performs decompression process. As the LZMA Static and adaptive dictionary are the commonly used is the lossless compression technique, the reconstruction dictionaries. The static dictionary uses the fixed entries data is the exact replica of original data with no loss of and constants based on the application of the text. information. Adaptive dictionaries take the entries from the text and generate on run time. A search buffer is employed as a dictionary and the buffer size is chosen based on the 4. Performance Evaluation implementation parameters. Patterns in the text are assumed to occur within range of the search buffer. To ensure the effectiveness of the LZMA algorithm The offset and length are individually encoded, and a while compressing labeled data, its lossless compression bit-mask is also separately encoded. Usage of an performance is compared with 5 different, well-known appropriate data structure for the buffer decreases the compression algorithms namely Huffman coding, AC, search time for longest matches. Sliding Dictionary LZW, BWT coding and Deflate algorithm. encoding is comparatively tedious than decoding as it requires to identify the longest match. Range encoder encodes all the symbols of the message into a single number to attain better CR. It efficiently deals with probabilities which are not the exact powers of two. The steps involved in range encoder are listed below.

 Given a large-enough range of integers, and probability estimation for the symbols.  Divide the primary range into sub-ranges where the sizes are proportional to the probability of the symbol they represent.  Every symbol of the message is encoded by decreasing the present range down to just that sub-range which corresponds to the successive symbol to be encoded.  The decoder should have same probability

estimation as encoder used, which can either be Fig.1.Workflow of LZWA on Anomaly labeled data in sent in advance, derived from already transferred WSN data [20]. 4.1 Metrics The overall operation is shown in Fig. 1. LZMA compression is used to compress real time data In the section, various metrics used to analyze the generated rapidly. Initially, the sensor nodes sense the compression performance are discussed. The physical environment. The sensed value is tested for performance metrics are listed below: CR, CF, and BPC. anomalies and the label value is appended. A label value is appended by the sensors to every individual sensed Compression Ratio (CR): data. The labeled value '1' is appended to the sensed data CR is defined as the ratio of the number of bits in the when the sensed data differs from actual data, i.e. the compressed data to the number of bits in the abnormal value is found. Likewise, the labeled value '0' uncompressed data and is given in Eq. (1). A value of is appended to the sensed data when the sensed data has CR 0.62 indicates that the data consumes 62% of its

406 International Journal of Pure and Applied Mathematics Special Issue

original size after compression. The value of CR greater 5. Results And Discussion than 1 result to negative compression, i.e. compressed data size is higher than the original data size. To highlight the good characteristics of LZMA based labeled data compression, it is compared with 5 states No. of bits in compressed data 1 of art approaches. A direct comparison is made with No. of bits in original data the results of existing methods using the same set of 2 datasets. Table 2 summarizes the experiment results of Compression Factor (CF): compression algorithms based on three compression CF is a performance metric which is the inverse of metrics such as CR, CF and BPC. As evident from compression ratio. A higher value of CF indicates Table 2, the overall compression performance of effective compression and lower value of CF indicates LZMA algorithm is significantly better than other expansion. algorithms on both two dataset. It is observed that LZMA algorithm achieves almost equal compression No. of bits in original data on both single and multi-hop scenarios. It is also noted 2 No. of bits in compressed data that Huffman coding produces poor results than other algorithms. The compression performance on 8 Bits per character (BPC): different dataset files reveals some interesting facts that BPC is used to calculate the total number of bits the compression algorithms perform extremely required, on average, to compress one character in the different based on the nature of applied dataset. The input data. It is defined as the ratio of the number of bits existing methods especially Deflate and BWT achieves in the output sequence to the total number of character in almost similar compression performance. the input sequence. No. of bits in compressed data Likewise, Huffman and Arithmetic coding also 3 No. character in the original data produce appropriately equal compression performance. This is due to the fact that the efficiency of an 4.2 Dataset Description Arithmetic code is always better or at least identical to a Huffman code. Similar to Huffman coding, For experimentation, two publicly available labeled Arithmetic coding also tries to calculate the probability WSN datasets are used. The labeled WSN dataset of occurrence of particular certain symbols and to consists of temperature, humidity and labeled values optimize the length of the necessary code. It achieves gathered from single-hop and multi-hop scenarios using an optimum which exactly corresponds to the TelosB motes [25]. The data is collected for 6 hours at a theoretical specifications of the . A time interval of 5 seconds. The dataset contains the minor degradation result from inaccuracies, because of labeled data in which value ‘0’ indicates the actual value correction operations for the interval division. and ‘1’ indicates the abnormal value. The data were collected in both indoor and outdoor environment. The On the other side, Huffman coding generates rounding description of labeled WSN dataset is tabulated in Table errors because of its code length is limited to multiples 1. of a bit. The variation from the theoretical value is more than the inaccuracy of arithmetic coding. Though Table 1 Dataset Description LZW achieves better compression than Huffman and arithmetic coding, it fails to achieve better than Deflate and BWT.

407 International Journal of Pure and Applied Mathematics Special Issue

Table 2 Comparison results of LZMA with existing methods

LZW works well in situations where the levels of method is compared with state of art approaches such redundancies are high. When the dictionary size is as Arithmetic coding, Huffman coding, BWT, LZW increased, the number of bits required for the indexing and Deflate algorithm. By comparing the compression also increases. This limitation of LZW makes the results performance of LZMA method with existing methods, to lag behind Deflate, BWT, and LZMA. LZMA achieves significantly better compression with an average CR of 0.0104 at the of 0.839 In overall, LZMA results in effective compression than respectively. In future, it can be extended to compress existing methods. Generally, dictionary based coding real time data of several applications. approaches find useful in situations where the original data to be compressed involves repeated patterns. As References LZW is a dictionary based method, it produces better results for labeled WSN dataset because of the [1] D. Estrin, J. Heidemann, S. Kumar, and M. Rey, possibility of repeated occurrence of temperature and “Next Century Challenges: Scalable Coordination humidity values. LZMA extends LZW with range in Sensor Networks,” in Proceedings of the 5th encoding technique enables to produce significantly annual ACM, 1999, pp. 263–270. higher compression than LZW. Interestingly, LZMA requires the minimum bit rate of 0.749 BPC for single [2] K. Sohraby, D. Minoli, and T. Znati, Wireless hop indoor data and maximum bit rate of 0.922 BPC for Sensor Networks. 2007. single hop outdoor 2 data. It is also noted that Huffman [3] F. Akyildiz, W. Su, Y. Sankarasubramaniam, and coding and Arithmetic coding achieves poor E. Cayirci, “Wireless sensor networks: a survey,” performance with the average bit rate of 3.84 BPC and Comput. Networks, vol. 38, no. 4, pp. 393–422, 3.659 BPC respectively. It is observed that LZMA 2002. achieves the average CR of 0.104 at a bit rate of 0.839 BPC. [4] . S. Raghavendra, K. M. Sivalingam, and T. Znati, Wireless Sensor Networks, 1st ed. US: 6. Conclusion Springer US, 2004. This paper employs a Lempel Ziv Markov chain [5] C. A. Smith, “A Survey of Various Data Algorithm (LZMA) lossless compression technique to Compression Techniques,” Int. J. pf Recent compress WSN labeled data. The sensor node uses Technol. Eng., vol. 2, no. 1, pp. 1–20, 2010. LZMA algorithm to compress the labeled data and [6] D. Salomon, Data Compression The Complete transmits to BS via single-hop or multi-hop Reference, 4th ed. Springer, 2007. communication. This proposed method enhances the network lifetime by reducing the amount of data [7] K. Sayood, Introduction to Data Compression. transmission. At the same time, anomaly data can also 2006. be easily identified. The performance of LZMA

408 International Journal of Pure and Applied Mathematics Special Issue

[8] S. W. Drost and N. . Bourbakis, “A Hybrid system [20] J.Ziv and A.Lempel, “lz78..” IEEE, pp. 530– for real-time lossless ,” 536, 1978. Microprocess. Microsyst., vol. 25, no. 1, pp. 19– [21] T. A. Welch, “A technique for high-Performance 31, 2001. Data Compression,” IEEE, pp. 8–19, 1984. [9] N. Kimura and S. Latifi, “A Survey on Data [22] M. Burrows and D. Wheeler, “A block-sorting Compression in Wireless Sensor Networks,” Proc. lossless data compression algorithm,” Algorithm, Int. Conf. Inf. Technol. coding Comput., pp. 16– Data Compression, no. 124, p. 18, 1994. 21, 2005. [23] J. Capon, “A probabilistic model for run-length [10] J. W. Branch, B. K. Szymanski, C. Giannella, R. coding of pictures,” IRE Trans. Inf. Theory, vol. Wolff, and Kargupta, “In-network outlier 100, pp. 157–163, 1959. detection in wireless sensor networks,” in Proc. of ICDCS, 2006. [24] Z. Tu and S. Zhang, “A Novel Implementation of JPEG 2000 Lossless Coding Based on LZMA,” in [11] S. Rajasegarar, J. C. Bezdek, C. Leckie, and M. Proceedings of the Sixth IEEE International Palaniswami, “Elliptical anomalies in wireless Conference Computer and Information sensor networks,” ACM Trans. Sens. Networks, Technology, 2006. vol. 6, no. 1, 2009. [25] S. Suthaharan, M. Alzahrani, S. Rajasegarar, C. [12] M. Moshtaghi, S. Rajasegarar, C. Leckie, and S. Leckie, and M. Palaniswami, “Labelled Data Karunasekera, “Anomaly detection by clustering Collection for Anomaly Detection in Wireless ellipsoids in wireless sensor networks,” in Proc. of Sensor Networks,” pp. 269–274, 2010. the ISSNIP, 2009.

[13] W. R. Heinzelman, A. Chandrakasan, and H. Balakrishnan, “Energy-efficient communication protocol for wireless microsensor networks,” Proc. 33rd Annu. Hawaii Int. Conf. Syst. Sci., vol. 0, no. c, pp. 3005–3014, 2000. [14] Sariga and P. Sujatha, “A survey on unequal clustering protocols in Wireless Sensor Networks,” J. King Saud Univ. - Comput. Inf. Sci., 2017. [15] M. M. Afsar and N. M. H. Tayarani, “Clustering in sensor networks: a literature survey,” J. Netw. Comput. Appl., vol. 46, pp. 198–226, 2014. [16] T. Srisooksai, K. Keamarungsi, P. Lamsrichan, and K. Araki, “Practical data compression in wireless sensor networks: A survey,” J. Netw. Comput. Appl., vol. 35, no. 1, pp. 37–59, 2012. [17] D. A. Huffman, “A Method for the Construction of Minimum-Redundancu Codes,” A Method Constr. Minimum-Redundancu Codes, pp. 1098– 1102, 1952. [18] W. I. H., N. R. M., and C. J. G., “Arithmetic coding for data compression,” Commun. ACM, vol. 30, no. 6, pp. 520–540, 1987. [19] J.Ziv and A.Lempel, “A Universal Algorithm for Data Compression,” IEEE Trans. Inf. Theory, vol. 23, no. 3, pp. 337–343, 1977.

409 410