A Comparative Study of Text Compression Algorithms

International Journal of Wisdom Based Computing, Vol. 1 (3), December 2011 68 A Comparative Study Of Text Compression Algorithms Robert Lourdusamy Senthil Shanmugasundaram Computer Science & Info. System Department, Department of Computer Science, Community College in Al-Qwaiya, Vidyasagar College of Arts and Science, Shaqra University, KSA Udumalpet, Tamilnadu, India (Government Arts College, Coimbatore-641 018) E-mail : [email protected] E-mail : [email protected] Abstract: Data Compression is the science and art of techniques because of the way human visual and hearing representing information in a compact form. For decades, systems work. Data compression has been one of the critical enabling Lossy algorithms achieve better compression technologies for the ongoing digital multimedia revolution. effectiveness than lossless algorithms, but lossy There are lot of data compression algorithms which are compression is limited to audio, images, and video, available to compress files of different formats. This paper provides a survey of different basic lossless data where some loss is acceptable. compression algorithms. Experimental results and The question of the better technique of the two, comparisons of the lossless compression algorithms using “lossless” or “lossy” is pointless as each has its own uses Statistical compression techniques and Dictionary based with lossless techniques better in some cases and lossy compression techniques were performed on text data. technique better in others. Among the statistical coding techniques the algorithms such There are quite a few lossless compression as Shannon-Fano Coding, Huffman coding, Adaptive techniques nowadays, and most of them are based on Huffman coding, Run Length Encoding and Arithmetic dictionary or probability and entropy. In other words, coding are considered. Lempel Ziv scheme which is a they all try to utilize the occurrence of the same dictionary based technique is divided into two families: those derived from LZ77 (LZ77, LZSS, LZH and LZB) and character/string in the data to achieve compression. This those derived from LZ78 (LZ78, LZW and LZFG). A set of paper examines the performance of statistical interesting conclusions are derived on their basis. compression techniques such as Shannon- Fano Coding, Huffman coding, Adaptive Huffman coding, Run Length I. INTRODUCTION Encoding and Arithmetic coding. The Dictionary based compression technique Lempel-Ziv scheme is divided Data compression refers to reducing the amount of into two families: those derived from LZ77 (LZ77, space needed to store data or reducing the amount of time LZSS, LZH and LZB) and those derived from LZ78 needed to transmit data. The size of data is reduced by (LZ78, LZW and LZFG). removing the excessive information. The goal of data compression is to represent a source in digital form with The paper is organized as follows: Section I contains as few bits as possible while meeting the minimum a brief Introduction about Compression and its types, requirement of reconstruction of the original. Section II presents a brief explanation about Statistical compression techniques, Section III discusses about Data compression can be lossless, only if it is Dictionary based compression techniques, Section IV has possible to exactly reconstruct the original data from the its focus on comparing the performance of Statistical compressed version. Such a lossless technique is used coding techniques and Lempel Ziv techniques and the when the original data of a source are so important that final section contains the Conclusion. we cannot afford to lose any details. Examples of such source data are medical images, text and images II. STATISTICAL COMPRESSION preserved for legal reason, some computer executable TECHNIQUES files, etc. Another family of compression algorithms is called 2.1 RUN LENGTH ENCODING TECHNIQUE lossy as these algorithms irreversibly remove some parts (RLE) of data and only an approximation of the original data can be reconstructed. Approximate reconstruction may One of the simplest compression techniques known be desirable since it may lead to more effective as the Run-Length Encoding (RLE) is created especially compression. However, it often requires a good balance for data with strings of repeated symbols (the length of between the visual quality and the computation the string is called a run). The main idea behind this is to complexity. Data such as multimedia images, video and encode repeated symbols as a pair: the length of the audio are more easily compressed by lossy compression string and the symbol. For example, the string International Journal of Wisdom Based Computing, Vol. 1 (3), December 2011 69 ‘abbaaaaabaabbbaa’ of length 16 bytes (characters) is binary tree is built. Huffman uses bottom-up approach represented as 7 integers plus 7 characters, which can be and Shanon-Fano uses Top-down approach. easily encoded on 14 bytes (as for example ‘1a2b5a1b2a3b2a’). The biggest problem with RLE is The Huffman algorithm is simple and can be that in the worst case the size of output data can be two described in terms of creating a Huffman code tree. The times more than the size of input data. To eliminate this procedure for building this tree is: problem, each pair (the lengths and the strings separately) can be later encoded with an algorithm like 1. Start with a list of free nodes, where each node Huffman coding. corresponds to a symbol in the alphabet. 2. Select two free nodes with the lowest weight 2.2 SHANNON FANO CODING from the list. 3. Create a parent node for these two nodes Shannon – Fano algorithm was simultaneously selected and the weight is equal to the weight of developed by Claude Shannon (Bell labs) and R.M. Fano the sum of two child nodes. (MIT)[3,15]. It is used to encode messages depending 4. Remove the two child nodes from the list and upon their probabilities. It allots less number of bits for the parent node is added to the list of free nodes. highly probable messages and more number of bits for 5. Repeat the process starting from step-2 until rarely occurring messages. The algorithm is as follows: only a single tree remains. After building the Huffman tree, the algorithm 1. For a given list of symbols, develop a frequency or creates a prefix code for each symbol from the alphabet probability table. simply by traversing the binary tree from the root to the 2. Sort the table according to the frequency, with the node, which corresponds to the symbol. It assigns 0 for a most frequently occurring symbol at the top. left branch and 1 for a right branch. 3. Divide the table into two halves with the total frequency count of the upper half being as close to The algorithm presented above is called as a semi- the total frequency count of the bottom half as adaptive or semi-static Huffman coding as it requires possible. knowledge of frequencies for each symbol from alphabet. 4. Assign the upper half of the list a binary digit ‘0’ and Along with the compressed output, the Huffman tree the lower half a ‘1’. with the Huffman codes for symbols or just the 5. Recursively apply the steps 3 and 4 to each of the frequencies of symbols which are used to create the two halves, subdividing groups and adding bits to Huffman tree must be stored. This information is needed the codes until each symbol has become a during the decoding process and it is placed in the header corresponding leaf on the tree. of the compressed file. Generally, Shannon-Fano coding does not guarantee 2.4 ADAPTIVE HUFFMAN CODING that an optimal code is generated. Shannon – Fano algorithm is more efficient when the probabilities are The basic Huffman algorithm suffers from the closer to inverses of powers of 2. drawback that to generate Huffman codes it requires the probability distribution of the input set which is often not 2.3 HUFFMAN CODING available. Moreover it is not suitable to cases when probabilities of the input symbols are changing. The The Huffman coding algorithm [6] is named after its Adaptive Huffman coding technique was developed inventor, David Huffman, who developed this algorithm based on Huffman coding first by Newton Faller [2] and as a student in a class on information theory at MIT in by Robert G. Gallager[5] and then improved by Donald 1950. It is a more successful method used for text Knuth [8] and Jefferey S. Vitter [17,18]. In this method, a compression. Huffman’s idea is to replace fixed-length different approach known as sibling property is followed codes (such as ASCII) by variable-length codes, to build a Huffman tree. Here, both sender and receiver assigning shorter codewords to the more frequently maintain dynamically changing Huffman code trees occuring symbols and thus decreasing the overall length whose leaves represent characters seen so far. Initially of the data. When using variable-length codewords, it is the tree contains only the 0-node, a special node desirable to create a (uniquely decipherable) prefix-code, representing messages that have yet to be seen. Here, the avoiding the need for a separator to determine codeword Huffman tree includes a counter for each symbol and the boundaries. Huffman coding creates such a code. counter is updated every time when a corresponding input symbol is coded. Huffman tree under construction Huffman algorithm is not very different from is still a Huffman tree if it is ensured by checking Shannon - Fano algorithm. Both the algorithms employ a whether the sibling property is retained. If the sibling variable bit probabilistic coding method. The two property is violated, the tree has to be restructured to algorithms significantly differ in the manner in which the ensure this property. Usually this algorithm generates International Journal of Wisdom Based Computing, Vol. 1 (3), December 2011 70 codes that are more effective than static Huffman coding.

A Comparative Study of Text Compression Algorithms

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support