International Journal of Advances in Electronics and , ISSN: 2393-2835 Volume-4, Issue-2, Feb.-2017 http://iraj.in A COMPARATIVE STUDY OF TECHNIQUES

1J P SATI, 2M J NIGAM

1, 2 Indian Institute of Technology, Roorkee, India E-mail: [email protected], [email protected]

Abstract- As we are dealing with more and more digital data, several compression techniques are developed for increasing need to store more data in lesser memory. Compressing can save storage capacity, speed file transfer, and decrease costs for storage hardware and network . This paper intends to provide the performance analysis of lossless compression techniques with respect to various parameters like compression ratio, compression factor, saving percentage, compression and de-compression timeetc. It provides the relevant data about variations in parameters as well as describes the possible causes for it. The simulation results of are achieved in MATLAB R2009a. The paper focuseson thede- compression time and the reasons for differences in comparison.

Keywords- Run length Coding (RLE), Huffman, Arithmetic, Lempel-ziv-welch (LZW), Compression Ratio.

I. INTRODUCTION compression. In this method, thecompressed image is not same as the original image; there is some amount is a technique that transforms the of loss in the image. data from one representation to another new In , much information can be compressed (in ) representation, which contains simply discarded away from image data/audio the same information but with smallest possible size data/ data and when they are uncompressed the [13]. The size of data is reduced by removing the data will still be of acceptable quality. excessive or redundant information. The data is stored or transmitted at reduced storage and/or II. LOSSLESS COMPRESSION METHODS communication costs. Compressing a file to half of its original size is equivalent to doubling the capacity of Commonly used lossless compression techniques are the storage medium. It may then become feasible to Run Length Encoding (RLE), , store the data at a higher level of the storage and Lempel-Ziv-Welch (LZW) hierarchy and reduce the load on the input/output channels of the system. 2.1 Run length coding Run-length encoding (RLE) is a data compression that is supported by most file formats, such as TIFF, BMP, and PCX. RLE is suited for compressing any type of data regardless of its information content, but the content of the data will affect the compression ratio achieved by RLE. RLE is both easy to implement and quick to execute, making it a good alternative to either using a complex compression algorithm or leaving your image data uncompressed. Fig1: Compression and de-compression process RLE works by reducing the physical size of a repeating string of characters. This repeating string, There are two compression techniques named as called a run, is typically encoded into two bytes. The Lossless and Lossy compression. first byte represents the number of characters in the In lossless compression scheme the reconstructed run and is called the run count. In practice, an image is same as the input image. Lossless image encoded run may contain 1 to 128 or 256 characters; compression techniques first convert the images into the run count usually contains as the number of the image . Then processing is done on characters minus one (a value in the range of 0 to 127 eachsingle . The First step includes prediction of or 255). The second byte is the value of the character next image pixel value from the in the run, which is in the range of 0 to 255, and is neighbourhoodpixels. In the second stage the called the run value. difference between the predicted value and the actual intensity of thenext pixel is coded using different Arun of 15 A’s would normally require 15 bytes to encoding methods. store: Lossy compression technique provides higher AAAAAAAAAAAAAAA compression ratio as compared to lossless

A Comparative Study of Lossless Compression Techniques

45 International Journal of Advances in Electronics and Computer Science, ISSN: 2393-2835 Volume-4, Issue-2, Feb.-2017 http://iraj.in The same string after RLE would require only two Implementation of Arithmetic codingwas carried out bytes: 15A.This compression technique is useful for in MATLAB R2009a [12]. The stepsfor executing the monochrome images or images having the same are as follows:- background pixels. Implementation of Run-length  Convert the image into greyscale image. encoding is carried out in MATLAB R2009a. The  Read the greyscale image and store all the stepsfor executing the code are as follows:- intensity values as a single row vector.  Convert the image into greyscale image.  Convert the matrix into binary form and  Read the grey scale image and rearrange arrange all the bits in binary stream representing the data of image as single row vector. same image.  Convert all intensities values to binary state  Encode the entire stream using arithmetic & obtain a binary stream representation of image. encoding algorithm.  Count consecutive 1’s and 0’s appeared in a  Calculate the compression ratio as the ratio sequence and stored as run length encoded sequence. of original image and size of Arithmetic coded  Reconstruct the original image. sequence.  Calculate the compression ratio as the ratio of original image and size of run lengthencoded 2.4 Lempel-Ziv-Welch (LZW) coding sequence. LZW compression algorithm is dictionary based algorithm. This means that instead of tabulating 2.2 Huffman coding character counts and building trees (as for Huffman It is a variable length coding technique which is encoding), LZW encodes data by referencing a coded for the symbols based on their probabilities. dictionary. It representsthe variable length symbols Symbols are generated based on the pixels in an with fixed length . The original version of this image. On the basis of the frequency of occurrence of method was created by Lempel and Ziv in 1978 the symbols, bits are assigned to it. Less bits are (LZ78) and was further refined by Welch in 1984, assigned to the symbols that occur more frequently hence the LZW acronym. while more number of bits are assigned to the Dictionary based coding scheme are of two types, symbols that occur less frequently. In Huffman Static and Adaptive. In Static Dictionary based coding the generated binary code of any symbol is coding, dictionary size is fixed during encoding and not the prefix of the code of any other symbol [3] [5]. decoding processes and in Adaptive Dictionary based Implementation of Huffman coding was carried out in coding; dictionary size is updated and reset when it is MATLAB R2009a. The stepsfor executing the code completely filled. Since images are used as data, are as follows:- static coding suits for the compression job with  Convert the image into greyscale image. minimum delay [3] [6] [7].  Read the greyscale image and convert the Implementation of LZW encoding was carried out in array into singlerow vector. MATLAB R2009a [6][7]. The stepsfor executing the  From the grey scale image, form a Huffman code are as follows:- encoding using probability of symbols in the  Convert the image into greyscale image. image.  Read an image and arrange all the intensity  Encode each symbol independently using values in single row vector. the Huffman encoding tree.  Convert all the values in binary form and  Reconstruct the original image by achieve a single row binary representation. decompressing it using Huffman decoding.  Initialize the dictionary with basic symbols 1  Calculate the compression ratio as the ratio and 0. of original image and size of Huffman coded  Start encoding & decoding based on search sequence. & find method. Add any new word found in dictionary and encode the sequence. 2.3 Arithmetic encoding  If dictionary is completely filled, continue Arithmetic coding is also a variable length coding using same dictionary. technique.In this technique, the entire symbols  Calculate the compression ratio as the ratio generated from the pixels is converted into a single of original image and size of encoded sequence. floating point number also termed as binary fraction. In arithmetic coding technique, a tag is generated for III. EVALUATION AND COMPARISON the sequence which is to be encoded. This tag signifies the given binary fraction and becomes the 3.1 Performance Parameters unique binary code for the sequence. This unique Depending on the nature of the application there are binary code generated for a given sequence of certain various criteria to measure the performance of a length is not dependent on the entire length of compression algorithm.Following are some sequence [1] [4] [10]. measurements parameters used to evaluate the performances of lossless .

A Comparative Study of Lossless Compression Techniques

46 International Journal of Advances in Electronics and Computer Science, ISSN: 2393-2835 Volume-4, Issue-2, Feb.-2017 http://iraj.in Compression Ratio is the ratio between the size of the Table 1: Compressionratio compressed file and the size of the source file. size after compression compression ratio = size before compression Compression Factor is the inverse of the compression ratio. That is the ratio between the size of the source file and the size of the compressed file.

sizebeforecompression compressionfactor = sizeaftercompression

Saving Percentage calculates the shrinkage of the Table 2: Compression factor source file as a percentage.

sizeaftercompression savingpercentage = 1 − sizebeforecompression

Bits per pixel is the number of bits per pixel used in the compressed representation of the image.

8 bitsperpixel = compression ratio Table 3: Saving Percentage Along with the above parameters compression and de-compression time, are also used to measure the effectiveness.

Compression and De-Compression Time Time taken for the compression and decompression should be considered separately. For some applications like transferring compressed video data, the de-compression time is more important, while for some other applications both compression and decompression time are equally important. If the Table 4: Bits per pixel compression and decompression times of an algorithm are less or in an acceptable level it implies that the algorithm is acceptable with respect to the time factor.

3.1.1 Results with real images Following images with sizes given in tables1 are used for comparison.

As seen in the table 1,2,3 and 4 the relative compression ratios, compression factor, saving percentage and bits per pixel are displayed respectively with respect to each technique used for compression. Among all, the run length encoding shows maximum compression ratio but run length algorithm simply works to reduce inter-pixel redundancy which exists only when extreme shades are significant. Since the most of the real world images lack such dominance of shades, RLE is rarely used now a days for lossless data compression. Considering the available data about compression ratio, Huffman encoding scheme is found to be optimum since it solely works on reducing Fig2: Test images

A Comparative Study of Lossless Compression Techniques

47 International Journal of Advances in Electronics and Computer Science, ISSN: 2393-2835 Volume-4, Issue-2, Feb.-2017 http://iraj.in Lempel-Ziv-Welch (LZW) method redundancy in input data. Though Arithmetic 3 encoding also seems to generate closest results as Huffman encoding, it also considers inter-pixel 2.5 redundancy which reduces the compression factor. Lempel-Ziv-Welch encoding totally works on 2 dictionary size as a key factor to achieve greater 1.5 compression ratios. Thus with lower dictionary sizes ratio compression the compression results are inferior as compared to 1 other compression techniques. 0.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 probability of zeros 3.2Comparison wrt. Compression ratios Fig 7: CR against probability of 0’sfor LZW coding

The variations of compression ratio with respect tothe The graphs in Fig 4, 5, 6and 7 shows variation in probability of zero are achieved by generating compression ratio with respect to the probability of random images of same size 50x50 pixels while zero in the image to be compressed. If we see all the changing probabilities for symbol zero in each by 0.1. graphs, we find that irrespective of technique used for The images shown in Fig 3 are used for this purpose. compression the compression ratio increases as the probability of zero approaches to zero. The results shows that minimum value of compression ratio occur when probability of zero is 0.5 irrespective of the method used. This could be understood with the standard entropy of any binary data with respect to probability of occurrences of symbols. When probability of zero and one is equal (each as 0.5) the content of information is maximum. As we tend to move towards extreme probabilities the redundancy Fig3: Test images to compare the compression ratios in information becomes more & more significant. Run Length Encoding Thus by any lossless technique, the compression 15 results are best when probabilities of symbols lies in either of extreme probabilities.

10 The Huffman coding shows almost linear increase and decrease in compression ratio as we move away from the centre probability. The other methods show

compression ratio compression 5 the nonlinearity in the same. This is because the Huffman coding is totally based on modifying information by simply assigning bits to respective 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 symbols. Other techniques follow the technique of probability of zeros Fig4: CR against probability of 0’s for Run length coding data modification by means of counting the repeating symbols, probability range split or dictionary, which Huffman Method 4 are nonlinear. Therefore the variations in 3.5 compression ratios are nonlinear for other techniques. 3 2.5 3.3Comparison of Compression time 2 To compare the compression time, random images of 1.5

compression ratio compression different sizes but same probabilities of zero are 1 taken. Three different data sets are generated with 0.5 probability of symbol zero for image a), b), and c) in 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 probability of zeros figure 8as 0.25 0.5 & 0.75 respectively. 10 samples of Fig5: CR against probability of 0’sfor Huffman coding each image are used with varying size of image from 1 kb to 10 kb. Arithematic Encoding 5

4.5

4

3.5

3

2.5

2

compression ratio compression 1.5

1

0.5

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 probability of zeros Fig6: CR against probability of 0’sfor Arithmetic coding Fig 8: Test images to compare the compression and de- compression time

A Comparative Study of Lossless Compression Techniques

48 International Journal of Advances in Electronics and Computer Science, ISSN: 2393-2835 Volume-4, Issue-2, Feb.-2017 http://iraj.in It’s obvious that as the size of image increases the Run Length Encoding 4 compression time also increases. But the compression at 25% 3.5 at 50% timeprofile shows changes if there is any variation in at 75% probability of zero while keeping the size variations 3 same. 2.5

Run Length Encoding 2 0.45 at 25% 1.5 at 50% 0.4 Delayinseconds at 75% 1

0.35 0.5

0.3 0 1 2 3 4 5 6 7 8 9 10 size of image in Kbytes 0.25 Fig 11: Compression timevariations against size of image for Delayin seconds 0.2 Arithmetic encoding

0.15 In the Fig 11, first of all the entire probability range is segmented as per the probability of symbol to be 0.1 1 2 3 4 5 6 7 8 9 10 fetched. This step is followed till the end of sequence size of image in Kbytes and the final value at the centre of segment is treated Fig 9: Compression timevariations against size of image for as encoded value. In the case of binary data, the Run length encoding arithmetic encoding process changes the current In the Fig 9which shows compression time profile for segment as there is transition from 1 to 0 or 0 to 1, run length compression technique, delay in which is a delaying process. If the probability of processing is independent of probability of zero. The either symbol is less, the transitions among symbols Principle of run length encoding is simply counting (0 and 1) are also less. Therefore the segment change the number of same symbols (both 0’s as well as 1’s) is less frequent and delay is also less. Hence when the in sequence. Thus whatever may be probability of probability of zero is 0.5 the delay observed is zero, the encoding process has no effect on it and maximum as the transitions are maximum. Delay hence the delay variations are not observed with reduces as we move away from equiprobable point. variations in probability of zero for same image size. Lempel-Ziv-Welch (LZW) method 45 Huffman Method at 25% 14 40 at 50% at 25% at 75% at 50% 12 35 at 75%

30 10

25 8 20 6

D elay15 in seconds Delayin seconds 4 10

2 5

0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 size of image in Kbytes size of image in Kbytes Fig 10: Compression time variations against size of image for Fig 12: Compression timevariations against size of image for Huffman encoding LZW encoding

In the Fig: 10which shows compression time profile The Lempel-Ziv-Welch (LZW) compression forHuffman coding,compression time reduces as we technique is fully based on the formation of increase probability of zero in image. For Huffman dictionary rather than other probability dependent compression technique, the compression is basically techniques. Thus in Fig 12 it can be seen that rearrangement of bits as per the content of the compression timeis almost same for any given image information. Now as per the Huffman encoding tree size irrespective of the probability of symbols. structure in MATLAB, the coding will firstly be done to assignment 0 and then assignment 1.Thus more the 3.4Comparison of De-compression time no of zeroes, more the assignment 1 in coding table De-compressiontime is calculated for the same and more is the delay for encoding entire data. images which are used for compression time.

A Comparative Study of Lossless Compression Techniques

49 International Journal of Advances in Electronics and Computer Science, ISSN: 2393-2835 Volume-4, Issue-2, Feb.-2017 http://iraj.in Run Length Encoding 0.05 the delay observed is maximum as the transitions are at 25% maximum. Delay reduces as we move away from 0.045 at 50% at 75% 0.04 equiprobable point. Lempel-Ziv-Welch (LZW) method 0.035 1.4 at 25% 0.03 at 50% 1.2 at 75% 0.025

0.02 1 Delay in seconds in Delay 0.015 0.8 0.01 0.6 0.005 Delay in seconds in Delay 0 0.4 1 2 3 4 5 6 7 8 9 10 size of image in Kbytes Fig 13: de-compression time variations against size of image for 0.2 Run length encoding 0 1 2 3 4 5 6 7 8 9 10 size of image in Kbytes In the Fig 13 which shows de-compression time Fig 16: de-compression time variations against size of image for profile for run length compression technique, delay in LZWencoding processing is independent of probability of zero. In the de-compression, the encoded data is simply The de-compression time for Lempel-Ziv-Welch converted back to run.Therefore the delay variations (LZW) compression technique, as shown in Fig 16, are not observed with variations in probability of zero does not vary for any given image size irrespective of for the same image size. the probability of symbols. Huffman Method 90 at 25% CONCLUSION 80 at 50% at 75% 70 An experimental comparison of different lossless 60 compression algorithms for text data is carried out. 50 Several existing lossless compression methods are 40 compared for their effectiveness. Considering the

Delay in seconds in Delay 30 compression ratio, compression times and

20 decompression times of all the algorithms, the

10 following conclusions are made:-

0 1 2 3 4 5 6 7 8 9 10 size of image in Kbytes  Huffman method is found better than other Fig 14: de-compression time variations against size of image for techniques since it follows optimal method to remove Huffman method redundancy from given data.  The compression ratio achieved would be In the Fig 14 which shows de-compression time maximum if one of the symbols (either 0 or 1) has profile for Huffman coding, as the number of zeros much greater probability than other in data. increase in an image,the assignment 1 is more in  The relative comparison of compression coding table and hence the delay for encoding entire time shows RLE and LZW methods do not show any data is more. significant change in delay with change in the Arithmetic Encoding 0.35 probability of symbol. While for Huffman & at 25% at 50% arithmetic methods, probabilities of symbol affects 0.3 at 75% compression time. 0.25  Similar to the compression time analysis, the de- compression time for Huffman & arithmetic 0.2 method varies with the probabilities of symbol 0.15 whereas it does not show any significant change for Delay in seconds in Delay 0.1 RLE and LZW methodswiththe change in the probability of symbols. 0.05

0 REFERENCES 1 2 3 4 5 6 7 8 9 10 size of image in Kbytes Fig 15: de-compression time variations against size of image for [1]. Dhananjay Patel , VinayakBhogan& Alan Janson Arithmetic encoding “Simulation and Comparison of Various Lossless Data Compression Techniques based on Compression Ratio and Processing Delay”, International Journal of Computer In case of Arithmetic coding as shown in the Fig 15, Applications (0975 – 8887) Volume 81 – No 14, November the de-compression time is very less compared to the 2013 compression time. When the probability of zero is 0.5 [2]. Mohammed Al-laham1 &Ibrahiem M. M. El Emary2, “Comparative Study BetweenVarious Algorithms of Data

A Comparative Study of Lossless Compression Techniques

50 International Journal of Advances in Electronics and Computer Science, ISSN: 2393-2835 Volume-4, Issue-2, Feb.-2017 http://iraj.in Compression Techniques”, Proceedings of the World [9]. David Jeff Jackson & Sidney Joel Hannah, “Comparative Congress on Engineering and Computer Science 2007 WCE Analysis of image Compression Techniques,” System Theory CS 2007, October 24-26, 2007, San Francisco, USA. 1993, Proceedings SSST’93, 25th Southeastern [3]. Sonal Dinesh Kumar, “A Study of Various Image Symposium,pp 513-517, 7 –9March 1993. CompressionTechniques”, Proceedings of COIT, RIMT [10]. Khalid Sayood, “Introduction to Data Compression”, 3nd Institute of Engineering and Technology, Pacific, 2000, pp. Edition,San Francisco, CA, Morgan Kaufmann, 2000. 799-803. [11]. Dr. T. Bhaskara Reddy, Miss. Hema Suresh Yaragunti, Dr. S. [4]. Amir said, “Introduction to arithmetic coding- theory and Kiran, Mrs. T. Anuradha , “A Novel Approach of Lossless practice”,Imaging System Laboratory HP Laboratories Palo Image Compression using Hashing andHuffman Coding”, Alto HPL-2004-76, April 21,2004. International Journal of Engineering Research & Technology [5]. Huffman D.A., “A method for the construction of (IJERT) ISSN: 2278-0181, Vol. 2 Issue 3, March – 2013. minimumredundancycodes”, Proceedings of the Institute of [12]. Paul G. Howard and Jeffrey Scott Vitter, “Arithmetic coding RadioEngineers, 40 (9), pp. 1098–1101, September 1952. for Data Compression”, Proceeding of the IEEE, VOL 82, [6]. Ziv. J and Lempel A., “A Universal Algorithm for Sequential No. 6, June 1994. Data Compression”, IEEE Transactions on Information [13]. Amit Jain, Kamaljit I. Lakhtaria, Prateek Srivastava, “A Theory 23 (3), pp. 337–342, May 1977. Comparative Study of Lossless Compression Algorithm on [7]. Ziv. J and Lempel A., “Compression of Individual Sequences Text Data”, Proc. of Int. Conf. on Advances in Computer via Variable-Rate Coding”, IEEE Transactions on Science, AETACS, © Elsevier, 2013 InformationTheory 24 (5), pp. 530–536, September 1978. [14]. S.R.Kodituwakku,U.S.Amarasinghe,“Comparison of Lossless [8]. Subramanya A, “Image CompressionTechnique,” Potentials Data Compression Algorithms for Text Data”, Indian Journal IEEE, Vol. 20, Issue 1, pp 19-23, Feb-March 2001. of Computer Science and Engineering Vol 1 No 4 416-425.



A Comparative Study of Lossless Compression Techniques

51