Huffman Based Code Generation Algorithms: Data Compression Perspectives

Journal of Computer Science Original Research Paper Huffman Based Code Generation Algorithms: Data Compression Perspectives Ahsan Habib, M. Jahirul Islam and M. Shahidur Rahman Department of Computer Science and Engineering, Shahjalal University of Science and Technology, Sylhet, Bangladesh Article history Abstract: This article proposes two dynamic Huffman based code Received: 18-07-2018 generation algorithms, namely Octanary and Hexanary algorithm, for data Revised: 07-09-2018 compression. Faster encoding and decoding process is very important in Accepted: 06-12-2018 data compression area. We propose tribit-based (Octanary) and quadbit- based (Hexanary) algorithm and compare the performance with the existing Corresponding Author: Ahsan Habib widely used single bit (Binary) and recently introduced dibit (Quaternary) Department of Computer algorithms. The decoding algorithms for the proposed techniques have also Science and Engineering, been described. After assessing all the results, it is found that the Octanary Shahjalal University of Science and the Hexanary techniques perform better than the existing techniques in and Technology, Sylhet, terms of encoding and decoding speed. Bangladesh Email: [email protected] Keywords: Binary Tree, Quaternary Tree, Octanary Tree, Hexanary Tree, Huffman Principle, Decoding Technique, Encoding Technique, Tree Data Structure, Data Compression Introduction decompression speed of the proposed techniques are better than the others. To summarize, the proposed Huffman coding (Huffman, 1952) is very popular in techniques may be suitable for offline data compression data compression area. Nowadays it is used in data applications where encoding and decoding speed is more compression for wireless and sensor networks important with less constraint on space. (Săcăleanu et al ., 2011; Renugadevi and Darisini, 2013), data mining (Oswald and Sivaselvan, 2018; Oswald et al ., Related Works 2015). It is also found efficient for data compression in low resource systems (Radhakrishnan et al ., 2016; Matai In 1952, David Huffman introduced an algorithm et al ., 2014; Wang and Lin, 2016). The use of Huffman (Huffman, 1952) which produced optimal code for data code in word-based text compression is also very compression system. Huffman code is produced using common (Sinaga, 2015). Huffman principle produces Binary tree technique, where more frequent symbols optimal code using a Binary tree where the most frequent produce shorter codeword length and less frequent codewords are smaller in length. However, Huffman symbols produce longer codeword. Later on, so many principle does not produce a balanced tree (Rajput, popular algorithms and applications have been developed 2018). For this reason, it requires more memory to store based on Binary Huffman coding technique. Saradashri et longer codeword, and thus it also requires more time to al . (2015) explained in his book that Huffman code could decode those codewords from the memory. In this paper, also be static or dynamic. Chen et al . (Chen et al ., 1999) we first review traditional Huffman algorithm and newly introduced a method to speed up the process and reduced introduced Quaternary Huffman algorithm. Then we memory of Huffman tree. A tree clustering algorithm is introduce Octanary and Hexanary tree for construction of introduced in (Hashemian, 1995) to avoid high sparsity Huffman codes. Octanary and Hexanary structure makes of the tree. In this research, the author reduced the the underlying tree more balanced. The tree construction header size dramatically. Vitter (1987), the author and decoding algorithms for both techniques have been introduced a new method where it required less memory developed. The codeword efficiency of Binary, than the conventional Huffman technique. Chung (1997) Quaternary, Octanary and Hexanary structure have been also introduced a memory-efficient array structure to compared. The compression ratio and speed are also represent the Huffman tree. In some other researches compared for different methods using these coding codeword length of Huffman code also investigated. systems. It is found that the compression and Katona and Nemetz (1978) investigated the connection © 2018 Ahsan Habib, M. Jahirul Islam and M. Shahidur Rahman. This open access article is distributed under a Creative Commons Attribution (CC-BY) 3.0 license. Ahsan Habib et al. / Journal of Computer Science 2018, 14 (12): 1599.1610 DOI: 10.3844/jcssp.2018.1599.1610 between self-information of a source letter and its ministry (Luke 5, 2018). The frequency distribution of codeword length. A recursive Huffman algorithm is Luke 5 is shown in Fig. 3. introduced in (Lin et al ., 2012), where a tree is transformed into a recursive Huffman tree and it decoded Octanary Tree more than one symbol at a time. The decoding process Octanary tree or 8-ary tree is a tree in which each starts by reading a file bit by bit in all of the above node has 0 to 8 children (labeled as LEFT1 child, LEFT2 techniques. Recently, we introduced a code generation child, LEFT3 child, LEFT4 child, RIGHT1 child, technique based on Quaternary (dibit) Huffman tree RIGHT2 child, RIGHT3 child, RIGHT4 child). Here for (Habib and Rahman, 2017) to produce Huffman codes. constructing codes for Octanary Huffman tree, we use In this research, better encoding and decoding speed is 000 for a LEFT1 child, 001 for a LEFT2 child, 010 for a achieved by sacrificing an insignificant amount of space, where it is also found that searching two bits at a time LEFT3 child, 011 for a LEFT4 child, 100 for a RIGHT1 speed up the overall processing speed than searching a child, 101 for a RIGHT2 child, 110 for a RIGHT3 child single bit. This motivated us to search three or four bits and 111 for a RIGHT4 child. at a time. In this connection, the Octanary algorithm is The process of the construction of an Octanary tree is introduced to produce three bit based Huffman code, described below: whereas the Hexanary algorithm is introduced to produce four bit based Huffman code. The proposed • List all possible symbols with their probabilities; algorithms improved the Huffman decoding time • Find the eight symbols with the smallest probabilities compared with the existing Huffman algorithms. • Replace these by a single set containing all eight We organize the paper as follows. In section “Tree symbols, whose probability is the sum of the Structure”, traditional Binary, Quaternary, Octanary and individual probabilities Hexanary tree structures in data management system are • Repeat until the list contains single member presented. In section “Implementation”, the proposed • The octanry tree structure for Luke 5 data is shown in encoding and decoding algorithm of Octanary and Fig. 4. Hexanary techniques have been presented. Section “Result and Discussion” discusses the experimental results. Hexanary Tree Finally, Section “Conclusion” concludes the paper. Hexanary tree or 16-ary tree is a tree in which each Tree Structure node has 0 to 16 children (labeled as LEFT1 child, LEFT2 child, LEFT3 child, LEFT4 child, LEFT5 child, LEFT6 Binary and Quaternary Tree child, LEFT7 child, LEFT8 child, RIGHT1 child, RIGHT2 child, RIGHT3 child, RIGHT4 child, RIGHT5 child, A rooted tree T is called an m-ary tree if every RIGHT6 child, RIGHT7 child, RIGHT8 child). Here for internal vertex has no more than m children. The tree is called a full m-ary tree if every internal vertex has constructing codes for Hexanary Huffman tree we use exactly m children. An m-ary tree with m = 2 is called 0000 for LEFT1 child, 0001 for LEFT2 child, 0010 for a Binary tree. In a Binary tree, if an internal vertex has LEFT3 child, 0011 for LEFT4 child, 0100 for LEFT5 two children, the first child is called the LEFT child child, 0101 for LEFT6 child, 0110 for LEFT7 child, 0111 and the second child is called the RIGHT child for LEFT8 child, 1000 for RIGHT1 child, 1001 for (Adamchik, 2009). The Binary tree structure is RIGHT2 child, 1010 for RIGHT3 child, 1011 for RIGHT4 thoroughly discussed in (Huffman, 1952). A tree with child, 1100 for RIGHT5 child, 1101 for RIGHT6 child, m = 4 is called a Quaternary tree, which has at most 1110 for RIGHT7 child and 1111 for RIGHT8 child. four children, the first child is called LEFT child, the The process of the construction of a Hexanary tree is second child is called LEFT-MID child, the third child described below: is called RIGHT-MID child and the fourth child is • List all possible symbols with their probabilities called RIGHT child. The detail of Quaternary tree • Find the sixteen symbols with the smallest structures is explained in (Habib and Rahman, 2017). probabilities The Binary and Quaternary tree structures for luke 5 • Replace these by a single set containing all sixteen (Luke 5, 2018) are shown in Fig. 1 and 2, respectively. symbols, whose probability is the sum of the Luke 5 is the fifth chapter of the Gospel of Luke in the individual probabilities New Testament of the Christian Bible. The chapter • Repeat until the list contains single member relates the recruitment of Jesus' first disciples and continues to describe Jesus' teaching and healing The Hexanary tree structure for Luke 5 data is shown in Fig. 5 1600 Ahsan Habib et al. / Journal of Computer Science 2018, 14 (12): 1599.1610 DOI: 10.3844/jcssp.2018.1599.1610 186201 108273 77928 50592 57681 33609 44319 24174 26418 27948 29733 20349 e 11781 16779 t h 13209 s n a o 15249 2 1 1 9282 i 5661 w 1 8517 ⌣ 3 3 l 1 1 1 r 6 2 b ⌣ 9 2 3 4 4 7140 d 8 1 7 2652 6 3 0 6 7 4 4 4 8 3 y 4284 u m 1 0 1 9 9 6 8 8 2 0 1 0 8 f 0 3 4 4 6 3 2 7 3 4 1 4 4 6 j 1530 c g 0 0 1 8 0 2 4 3 8 7 0 9 2 k p 8 9 1 3 3 459 9 3 6 6 1 v 4 7 3 1 2 1 2 2 225 6 6 2 x 1 7 3 8 0 0 3 1 2 q z 7 0 1 4 1 1 0 5 2 3 Fig.

Load more