Compression Using Huffman Coding

IJCSNS International Journal of Computer Science and Network Security, VOL.10 No.5, May 2010 133 Compression Using Huffman Coding Mamta Sharma S.L. Bawa D.A.V. college, Abstract JPEG 2000 is a wavelet-based image compression Data compression is also called as source coding. It is the standard. Wavelets are functions that satisfy certain process of encoding information using fewer bits than an mathematical requirements and are used in representing uncoded representation is also making a use of specific encoding data or other functions. LZW compression replaces strings schemes. Compression is a technology for reducing the quantity of characters with single codes. Compression occurs when of data used to represent any content without excessively a single code is output instead of a string of characters. reducing the quality of the picture. It also reduces the number of bits required to store and/or transmit digital media. Compression The comparison between various compression algorithms is a technique that makes storing easier for large amount of data. helps us to use a suitable technique for various There are various techniques available for compression in my applications. paper work , I have analyzed Huffman algorithm and compare it with other common compression techniques like Arithmetic, LZW and Run Length Encoding. 2. FUNDAMENTALS FOR COMPRESSION Keywords: LZW, Huffman, DCT, RLE, JPEG, MPEG, Compression 2.1 Types Formats, Quantization, Wavelets. Lossy compression means that some data is lost when it is decompressed. Lossy compression bases on the 1. INTRODUCTION assumption that the current data files save more information than human beings can "perceive”. Thus the Compression refers to reducing the quantity of data used irrelevant data can be removed. to represent a file, image or video content without excessively reducing the quality of the original data. It Lossless compression means that when the data is also reduces the number of bits required to store and/or decompressed, the result is a bit-for-bit perfect match with transmit digital media. To compress something means that the original one. The name lossless means "no data is lost", you have a piece of data and you decrease its size. There the data is only saved more efficiently in its compressed are different techniques who to do that and they all have state, but nothing of it is removed. their own advantages and disadvantages. One trick is to reduce redundant information, meaning saving sometimes 2.2 Digital data representation Digital data consist of a once instead of 6 times. Another one is to find out which sequence of symbols chosen from a finite alphabet. In parts of the data are not really important and just leave order for data compression to be meaningful, there is a those away. Arithmetic coding is a technique for lossless standard representation for the uncompressed data that data compression. It is a form of variable-length entropy codes each symbol using the same number of bits. encoding. Arithmetic coding encodes the entire message Compression is achieved when the data can be represented into a single number, a fraction n where (0.0 ≤ n < 1.0) but with an average length per symbol that is less than that of where other entropy encoding techniques separate the the standard representation. Therefore for compression to input message into its component symbols and replace be meaningful, a standard representation should be each symbol with a code word. Run length encoding defined for the data to be compressed. method is frequently applied to images or pixels in a scan line. It is a small compression component used in JPEG 2.3 Color representation An image consists of various compression colors of different intensities and brightness. Red, green and blue light sources forms a set of primary colors; this is Huffman coding is a loseless data compression technique. an addictive system since the presence of all the primary Huffman coding is based on the frequency of occurrence colors, all set to their maximum intensities, results in the of a data item i.e. pixel in images. The technique is to use perception of the color white. This phenomenon of color a lower number of bits to encode the data in to binary perception is caused by the way that the human eye codes that occurs more frequently. It is used in JPEG files. detects and processes light, which makes it possible to Manuscript received May 5, 2010 Manuscript revised May 20, 2010 134 IJCSNS International Journal of Computer Science and Network Security, VOL.10 No.5, May 2010 represent an image as a set of three intensity signals in two roughly twice the length, operating on real data with even spatial dimensions. symmetry (since the Fourier transform of a real and even function is real and even), where in some variants the 2.4 Digitization In order to be processed by computers, input and/or output data are shifted by half a sample. The an image that is captured by a light sensor must first be discrete cosine transform (DCT) helps separate the image digitized. Digitization consists of three steps:1) spatial into parts (or spectral sub-bands) of differing importance sampling,2) temporal sampling, and 3) quantization. (with respect to the image's visual quality). The DCT is similar to the discrete Fourier transform, it transforms a 1) Spatial sampling: Spatial sampling consists of taking signal or image from the spatial domain to the frequency measurements of the underlying analog signal at a finite domain. set of sampling points in a finite viewing area. The two dimensional sets of sampling points are transformed into a one-dimensional set through a process called raster 3. COMPRESSION TECHNIQUES scanning. The two main ways to perform raster scanning are progressive and interlaced. 3.1 SIMPLE REPETITION 2) Temporal sampling: It is performed mainly in motion If in a sequence a series on n successive tokens appears we estimation for video sequences. The human visual system can replace these with a token and a count number of is relatively slow in responding to temporal changes. By occurrences. We usually need to have a special flag to taking at least 16 samples per second at each grid point, an denote when the repeated token appears illusion of motion is maintained. This observation is the basis for motion picture technology, which typically For Example performs temporal sampling at a rate of 24 frames/sec. 3) Quantization: After spatial and temporal sampling, the 87900000000000000000000000000000000 image consists of a sequence of continuous intensity values. The continuous intensity values are incompatible We can replace with with digital processing, and one more step is needed before this information is processed by a digital computer. 879f32 The continuous intensity values are converted to a discrete set of values in a process called quantization. Quantization Where f is the flag for zero. can be viewed as a mapping from a continuous domain to a discrete range. A particular quantization mapping is Compression savings depend on the content of the data. called a quantizer. 2.5 Redundancy Redundancy exists in two forms: spatial Applications: and temporal. 1. Suppression of zero's in a file (Zero Length The former, also called intraframe redundancy, refers to Suppression) . the redundancy that exists within a single frame of video or image, while the latter, also called interface redundancy, 2. Silence in audio data, or pauses in conversation etc. refers to the redundancy that exists between consecutive frames within a video sequence. 3. Bitmaps 2.6 Vector Quantization In vector quantization, an image is segmented into same-sized blocks of pixel values. 4. Blanks in text or program source files The blocks are represented by a fixed number of vectors called code words. The code words are chosen from a 5. Backgrounds in images finite set called a codebook. the size of the codebook affects the rate as well as the distortion while a smaller 6. Other regular image or data tokens codebook has the opposite effects. 2.7 Discrete Cosine Transform Discrete cosine 3.2 RLE transform (DCT) is a Fourier-related transform[41] similar to the discrete Fourier transform[42](DFT), but This encoding method [34] is frequently applied to images using only real numbers. DCTs are equivalent to DFTs of (or pixels in a scan line). In this instance, sequences of IJCSNS International Journal of Computer Science and Network Security, VOL.10 No.5, May 2010 135 image elements (X1,X2……….,Xn)are mapped to pairs It follows a top-down approach : (c1,ℓ1),(c2,ℓ2),……..,(cn,ℓn) where ci represent image intensity or color and li the length of the ith run of pixels 1. Sort symbols according to their (Not dissimilar to zero length suppression above). The frequencies/probabilities, e.g., ABCDE. savings are dependent on the data. In the worst case (Random Noise) encoding is heavier than original file. 2. Recursively divides into two parts, each with approx. same number of counts. Applications: It is a small compression component used in JPEG 1 compression. 0 0 1 3.3 PATTERN SUBSTITUTION 0 1 This is a simple form of statistical encoding. Here we 1 substitute a frequently repeating pattern(s) with a code. 0 The code is shorter than pattern giving us compression. More typically tokens are assigned according to frequency of occurrence of patterns: Symbol Count log (1/p) Code Subtotal (# of bits) • Count occurrence of tokens ------------------------------------------------------------------- • Sort in Descending order -- • Assign some symbols to highest count tokens A 15 1.38 00 30 A predefined symbol table may used i.e assign code i to B 7 2.48 01 14 token i. C 6 2.70 10 12 3.4 ENTROPY ENCODING D 6 2.70 110 18 Lossless compression frequently involves some form of E 5 2.96 111 15 entropy encoding and is based on information theoretic techniques.

Load more