Efficient Coding of Information: Huffman Coding
Total Page:16
File Type:pdf, Size:1020Kb
Zurich Open Repository and Archive University of Zurich Main Library Strickhofstrasse 39 CH-8057 Zurich www.zora.uzh.ch Year: 2006 Efficient coding of information: Huffman coding Sridhara, Deepak Abstract: In his classic paper of 1948, Claude Shannon considered the problem of efficiently describing a source that outputs a sequence of symbols, each associated with a probability of occurrence, and provided the theoretical limits of achievable performance. In 1951, David Huffman presented a technique that attains this performance. This article is a brief overview of some of their results DOI: https://doi.org/10.1007/bf02837275 Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-155710 Journal Article Published Version Originally published at: Sridhara, Deepak (2006). Efficient coding of information: Huffman coding. Resonance, 11(2):51-73. DOI: https://doi.org/10.1007/bf02837275 GENERAL I ARTICLE Efficient Coding of Information: Huffman Coding Deepak Sridhara In his classic paper of 1948, Claude Shannon con- sidered the problem of efficiently describing a source that outputs a sequence of symbols, each associated with a probability of occurrence, and provided the theoretical limits of achievable per- formance. In 1951, David Huffman presented a technique that attains this performance. This ar- ticle is a brief overview of some of their results. Deepak Sr/dhara worked 1. Introduction as a research associate in the Deparment of Shannon's landmark paper 'A Mathematical Theory of Mathematics at llSc Communication' [1] laid the foundation for communica- between January and December 2004. Since tion and information theory as they are perceived to- January 2005, he has been day. In [1], Shannon considers two particular problems: a post-doctoral research one of efficiently describing a source that outputs a se- associate in the Institute quence of symbols each occurring with some probabil- for Mathematics at ity of occurrence, and the other of adding redundancy Unviersity of Zurich. His research interests include to a stream of equally-likely symbols so as to recover coding theory, codes over the original stream in the event of errors. The for- graphs and iterative mer is known as the source-coding or data-compression techniques, and informa- problem and the latter is known as the channel-coding tion theory. or error-control-coding problem. Shannon provided the theoretical limits of coding achievable in both situations, and in fact, he was also the first to quantify what is meant by the "information" content of a source as de- scribed here. This article presents a brief overview of the source-coding problem and two well-known algorithms that were dis- covered subsequently after [1] that come close to solv- ing this problem. Of particular note is Huffman's algo- rithm that turns out to be an optimal method to rep- Keywords resent a source under certain conditions. To state the Entropy, source coding, Huff- problem formally: Suppose a source U outputs symbols man coding. RESONANCE J February 2006 GENERAL J ARTICLE 1By a D-ary siring we mean a problem formally: Suppose a source U outputs symbols sequence (YvY2..... Yn) where Ul, u2,..., uK, belonging to some discrete alphabet/~, each yj belongs to the set {0,1,2,...,D-1}. with probabilities p(ul),P(U2),...,p(ug), respectively. The source-coding problem is one of finding a mapping from /4 to a sequence of D-ary strings 1, called code- words, such that this mapping is invertible. For the coding to be efficient, the aim is to find an invertible mapping (or, a code) that has the smallest possible av- erage codeword-length (or, code-length). By the way this problem is stated, a source U can also be treated as a discrete random variable U that takes values from /4 -- {ul, u2,..., uN} and that has a probability distri- bution given by p(V), where p(U = ui) = p(ui), for i = 1,2,...,K. As an example, suppose a random variable U takes on four possible values ul, u2, u3, u4 with probability of oc- currences R(Ul)= 1, p(u2)-- 88P(U3)= 1, p(u4)-~ 1. A naive solution for representing the symbols ul,..., u4 in terms of binary digits is to represent each symbol using two bits: ul = 00, u2 = 01, u3 -- 10, u4 = 11. The average code-length in this representation is two bits. However, this representation turns out to be a sub-optimal one. An efficient representation uses fewer bits to represent symbols occurring with a higher prob- ability and more bits to represent symbols occurring with a lower probability. An optimal representation for this example is ul = 0, u2 = 10, u3 = ll0, ua = 111. The average code-length in this representation is 1- 7+2.~+3.~+3.gl1 1 = 1.75 bits. Intuitively, the optimal number of bits needed to represent a source symbol ui Intuitively, the optimal with probability of occurrence Pi is around log 2 ~ bits. number of bits needed Further, since ui occurs with probability pi, the effec- to represent a source tive code-length of ui in such a code is Pi log2 p'7"i Shan- symbol uj with non's idea was to quantify the 'information' content of probability of a source or random variable U in precisely these terms. occurrence p; is Shannon defined the entropy of a source U to measure around log 2 llpi bits. its information content, or equivalently, the entropy of 52 RESONANCE J February 2006 GENERAL J ARTICLE a random variable U to measure its 'uncertainty', as Shannon defined the H(U) = ~iPi log2 ~ The coding techniques discussed Pi" entropy of a source U in this article are evaluated using this measure of 'infor- to measure its mation' content. information content, This article is organized as follows. Section 2 introduces or equivalently, the some definitions and notation that will be used in this entropy of a random paper. Section 3 provides the notion of typical sequences variable U to measure that occur when a sequence of independent and identi- its 'uncertainty', as cally distributed random variables are considered. This H(U)=Z,p, log 2 llpr section also states the properties of typical sequences which are used to prove the achievability part of Shan- non's source-coding theorem, in Section 4. Section 5 introduces the notion of D-ary trees to illustrate the en- coding of a source into sequences of D-ary alphabets. Another version of the source-coding theorem for the special case of prefix-free codes is presented here. A coding technique, credited to Shannon and Fano, which is a natural outcome of the proof of the source-coding theorem stated in Section 5, is also presented. Sec- tion 6 presents Huffman's optimal coding algorithm that achieves a lower average code-length than the Shannon- Fano algorithm. The paper is concluded in Section 7. 2. Preliminaries Some basic definitions and notation that will be used in this article are introduced here. Let X be a discrete random variable taking values in a finite alphabet A'. For each x E A' let p(x) denote the probability that the random variable X takes the value x written as Pr(X = x). Definition 2.1. The expectation of a function f(X) de- fined over the random variable X is defined by E[f(X)] = ~ f(x)p(x) xE)~ Definition 2.2. The entropy H(X) of a discrete random RESONANCE t February 2006 53 GENERAL [ ARTICLE The entropy H(X) of a variable X is defined by discrete random H (X) = - ~p(x) logp(x). variable X is defined xEA' by H(X) = - Z p(x) log p(x). To avoid ambiguity, the term p(x) logp(x) is defined to XE,,~" be 0 when p(x) = 0. By the previous definition, H(X) = -E[logp(X)]. For most of the discussion, we will use logarithms to the base 2, and therefore, the entropy is measured in bits. (However, when we use logarithms to the base D for some D > 2, the corresponding entropy of X is denoted as HD(X ) and is measured in terms of D-ary symbols.) Note that in the above expression, the summation is over all possible values x assumed by the random variable X, and p(x) denotes the corresponding probability of X = x. We will use uppercase letters to indicate random vari- ables, lowercase letters to indicate the actual values as- sumed by the random variables, and script-letters (e.g., X) to indicate alphabet sets. The probability distribu- tion of a random variable X will be denoted by p(X) or simply p. Extending the above definition to two or more random variables, we have Definition 2.3. The joint entropy of a vector of random variables (X1, X2,..., Xn) is defined by H(X1,X2,. ,Xn) = -- ~, P(Xl,''',Xn) lOgp(Xl,Xn...,Xn) (zl,x2 ..... =,-,)e,Vl xX2x...xX,-, where p(xl, x2,. Xn) = Pr(X 1 = Xl, X 2 : x2,. X n __~ Xn). Definition 2.4. (a) A collection of random variables {X1, X2,... X,~} with Xi taking values from an alphabet A" is called independent if for any choice {Xl, x2,.., xn} RESONANCE I February 2006 GENERAL I ARTICLE with Xi = xi, i = 1, 2,... n, the probability of the joint A collection of event {X 1 = Xl,..., X,~ = xn} equals the product random variables n 1-Ii=l Pr(Zi = xi). {X,, X 2..... X~} is said (b) A collection of random variables {X1, X2,... Xn} is to be identically said to be identically distributed if 2(i = 2(: that is, all distributed if ~=.r the alphabets are the same, and the probability Pr (Xi = that is, all the x) is the same for all i.