Modification of Adaptive Huffman Coding for Use in Encoding Large Alphabets
Total Page:16
File Type:pdf, Size:1020Kb
ITM Web of Conferences 15, 01004 (2017) DOI: 10.1051/itmconf/20171501004 CMES’17 Modification of Adaptive Huffman Coding for use in encoding large alphabets Mikhail Tokovarov1,* 1Lublin University of Technology, Electrical Engineering and Computer Science Faculty, Institute of Computer Science, Nadbystrzycka 36B, 20-618 Lublin, Poland Abstract. The paper presents the modification of Adaptive Huffman Coding method – lossless data compression technique used in data transmission. The modification was related to the process of adding a new character to the coding tree, namely, the author proposes to introduce two special nodes instead of single NYT (not yet transmitted) node as in the classic method. One of the nodes is responsible for indicating the place in the tree a new node is attached to. The other node is used for sending the signal indicating the appearance of a character which is not presented in the tree. The modified method was compared with existing methods of coding in terms of overall data compression ratio and performance. The proposed method may be used for large alphabets i.e. for encoding the whole words instead of separate characters, when new elements are added to the tree comparatively frequently. Huffman coding is frequently chosen for implementing open source projects [3]. The present paper contains the 1 Introduction description of the modification that may help to improve Efficiency and speed – the two issues that the current the algorithm of adaptive Huffman coding in terms of world of technology is centred at. Information data savings. technology (IT) is no exception in this matter. Such an area of IT as social media has become extremely popular 2 Study of related works and widely used, so that high transmission speed has gained a great importance. One way of obtaining high Huffman coding has been developed in 1952 by David communication performance is developing more Huffman. He developed the method during his Sc. D. study efficient hardware. The other one is to develop the if MIT and published in the 1952 paper "A Method for software that would allow to compress the data in such the Construction of Minimum-Redundancy Codes" [1]. a way that would reduce the size of data but not affect its Huffman coding is the most optimal among methods information content. In other words, to encode the data encoding symbols separately, but it is not always optimal by the method called lossless data compression. This compared to some other compression methods, such as term means that the methods of this type allow the e.g. arithmetic and Lempel-Ziv-Welch coding [4]. original data to be perfectly reconstructed from the However, the last two methods, as it has been said, are encoded message. patent-covered, so developers often tend to use Huffman Binary or entropy encoding are the most popular coding [3]. Comparative simplicity and high speed due branches among lossless data compression methods. The to lack of arithmetic calculations are the advantages of term entropy encoding means that the length of a code is this method as well. Due to these reasons Huffman approximately proportional to the negative logarithm of coding is often used for developing encoding engines for the occurrence of the character encoded with the code. many applications in various areas [5]. Simplifying it may be said: the higher probability of the The fact that billions of people are exchanging character is, the shorter is its code [1]. gigantic amount of data every day stimulated Several methods of entropy encoding exist, these are development of compression technologies. This topic the most frequently used methods of this branch: constantly presents significant interest for researches, the - arithmetic coding, following works may be presented as the examples: - range coding, Jarosław Duda et al. worked out the method called - Huffman coding, Asymmetric Numeral Systems, the method was - asymmetric Numeral Systems. developed on the basis of the two encoding methods: Arithmetic and Range coding are quite similar with arithmetic and Huffman coding. It combined the some differences [2], but arithmetic coding is covered by advantages from the two methods: patent, that is why, due to lack of patent coverage, * Corresponding author: [email protected] © The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/). ITM Web of Conferences 15, 01004 (2017) DOI: 10.1051/itmconf/20171501004 CMES’17 - near cur symbo probabilit hence etter Bu th compression atio o no show tual pace compression atio of ithmeti coding, nd pability saving esides f coded message h tab with h f encoding d ecoding f ffman ding 6]; code-character airs hould transmitted. r - Facebook standard orithm is ased n Z77 reason, th ther index should b troduced. Th ent- dictionary der d – effectiv entropy coding to-original-bits atio(SOBR) escribed cordance based n Huffman method 7]; with h formu below: - Bro coding gorithm which used mos of the modern nterne Browsers, uch Chrome, era. 2 Similarly, to h Zstandard is ased n e 푛 combination f LZ77 d modified ffman ding 8]. Th only way mak SOBR u to CR use So, may clearly seen hat esp ng history e on coding ee for essag bu this ee wi no b Huffman coding gorithm still resents grea teres optima since h character robabilit may differen for plication. in arious messages. 3 Huffman coding explained 3.1 Static Huffman coding algorithm Th concep of ffman ding s ased n binary tree. h tree nsists f wo inds f nodes: intermediate nodes, . nod having escendant nod and h nod which do o hav descendants. Th nod called leaves. haracter may stored only in leaf node, is ndition nsur th character cod be prefix-free [1]. means, no aracter h th code, at would th in segmen of other Fig. 1. uffman enerated based the phrase th n examp f a Huffman tree". character code. s h examp of prefix des, following t sequences may be used: n . Th cod 110 en to in segmen of h 3.2 Adaptive Huffman coding algorithm code 110101, hese wo odes may not be decoded unambiguously. h tree organized cording h Th method f aptiv Huffman Coding AHC) following rinciples 4]: reviewed h ar w proposed y effrey n a) any nod th ee may no hav ing descendant: his pe blished n 9]. either wo r on Th algorithm working uring eating tree may b) each od in h tree th number igned b observed ur 2. This umber lled weight. ending n th typ of th od weigh may hav h ollowing meanings - if the od is leaf, the weigh value ual h number f mes h character tored h leaf ccurs n th messag ent; - if h nod is termedia nod its weigh value is equ to sum f escendants' weights. c) th weigh f h righ escendan hould no than th weigh f th lef escendant. Fig. 2. The lgorithm AHC, example for encoding he or Th cod for ery aracter defined th path "abb". in h binary ee from h roo to leaf ntaining th character Figur 1), g. blank pac character This method ased n e am principles the which s h mos frequen haracter h tree h the s method lus h following xtensions 9]: cod 111, d e 'p character, which ccurs nly nce a) every od in the ee has s key number, he y has the ode [4]. numbers arranged n following way Th Figur 1 resents th tree nstructed n - maximum alu of th key number n h tree may e accordance with method f ffman ding. h calculated main eatur of is th tree nstructed efor th transmission started n th basis f nalysis f the (3 probabilit of epar characters n h who messag - th roo h퐷 푘larges 푘 key number;푘푘 푘 [1]. - an ncestor larger number an ny f Th d compression atio(DCR) s escribed y he descendants; formu 1): - th righ descendan should ave ger ey number than th lef escendant; 1 b) after pu of ny haracter tree pdated; 퐷 2 ITM Web of Conferences 15, 01004 (2017) DOI: 10.1051/itmconf/20171501004 CMES’17 - near cur symbo probabilit hence etter Bu th compression atio o no show tual pace c) spec od called NYT no y transmitted) compression atio of ithmeti coding, nd pability saving esides f coded message h tab with h used for oth ndicating th place or new aracter Listing 1. Pseudocode presenting dating ree HC. f encoding d ecoding f ffman ding 6]; code-character airs hould transmitted. r and or ignalizing a th new haracter btained, UpdateTree(character ch) - Facebook standard orithm is ased n Z77 reason, th ther index should b troduced. Th ent- key has th valu th ee, its weigh always 1 if ch is not found in the tree dictionary der d – effectiv entropy coding to-original-bits atio(SOBR) escribed cordance equ zero; 2 make a newLeafNode as the right descendant of the based n Huffman method 7]; with h formu below: d) a et of odes wit equal eight lues s alled a ock. oldNYTNode; - Bro coding gorithm which used mos of the Th encoding gorithm s escribed isting . h 3 newLeafNode's keyValue := oldNYTNode's modern nterne Browsers, uch Chrome, era. 2 algorithm ntains nly h process f pdating th tree keyValue-1; Similarly, to h Zstandard is ased n e 푛 and o no consider communication. f he 4 make a newNYTNode as the left descendant combination f LZ77 d modified ffman ding 8].