Integrated Research Journal of Management, Science and Innovation

INTEGRATED RESEARCH JOURNAL OF MANAGEMENT, SCIENCE AND INNOVATION

www.IRJMSI.com

Published by iSaRa Solutions

IRJMSI Vol 2 Issue 1 [Year 2015]

DATA COMPRESSION TECHNIQUES

Shubham Dixit Akansha Srivastava Aayushi Mishra IMS Engineering college, IMS Engineering college, IMS Engineering Ghaziabad. Ghaziabad. College,Ghaziabad. [email protected] [email protected] [email protected]

ABSTRACT

Data compression is a process of minimizing the size of original data and information can be encoded using fewer bits in contrast to the original representation .In this paper ,we have considered different basic existing data compression techniques COMPRESSIO .We recommend improved dynamic bit reduction NETWORK INPUT N algorithm techniques which is completely based on number of unique symbols occurrences in input strings .Improved dynamic bit reduction algorithm DECOMPRES improves compression ratio as well as memory OUTPUT saving percentage compared to existing techniques SION for text data.

Fig.1 Compression and Decompression General Terms Data Compression done because most of the real world Methodology for data compression data is surplus. This is a technique which reduces the size of data by applying different methods which can either be Lossy or Lossless. Compression program can be used to Keywords convert data from an easy format to a compact form. In the same way an uncompressing program returns the DSAD, JPEG, MPEG information to its original form.

INTRODUCTION A process by which a file like Text, Audio, Videos etc may 2. TYPES OF DATA COMPRESSION be converted to compressed files, such that the original file can be fully recovered from the original file without any There are basically two classes of data compression are loss of actual information is known as data compression applied in different areas. One is lossy data compression technique. It is useful to save the storage space. which is used to compress image data files for Example : if a person want to store 4MB file, then it will communication or archives purposes. Second is lossless be better to compress the file and save it in a smaller size to save the storage. Compressed files are more easily data compression which is used to transmit or archive text exchanged because they are uploaded and downloaded or binary files i.e, used to keep their information intact at much faster. It is required that the original file is any time. reconstructed from the compressed file at any time. This method of encoding rules which allows considerable Two types of data compression are given below: reduction in total number of bits to store or transmit a file is  Lossy Compression called data compression. The more information in a file, the  Lossless Compression more is its storage space and transmission costs. So we can say Data Compression is the process of encoding data to smaller bits so that it takes less storage space and less conveyance time while communicating . 

INTEGRATED RESEARCH JOURNAL OF MANAGEMENT, SCIENCE AND INNOVATION ( IRJMSI) Page 11 www.irjmsi.com IRJMSI Vol 2 Issue 1 [Year 2015]

2.1 Lossy data compression are basically used for compression of text, images and sound. Most of the lossless compression programs uses the two different types of algorithms: one that generates a This method is used where data retrieves after statistical model for input data and another which maps decompression may not be same as the original data, but input data to bit strings using this model in a way that more can be useful for specific purpose. After lossy data frequently encountered data will produce smallest output compression is applied to a message, that message can than less frequent data. never be recovered as it was before compressing. When the compressed message is decoded it will never give back the ORIGINAL original message. Data will be lost, as lossy compression TEXT ENCODER COMPRESSED DATA cannot be decoded to give the original message. Thus this DATA method is not a good method of compression for analytical data. It can be used for Digitally Sampled Analog Data (DSAD). DSAD consists of sound, video, graphics, or DECOMPRESSE D TEXT DATA picture files. DECODER (SAME AS ORIGINAL)

ORIGINAL Fig .3 Lossless Data Compression COMPRESS TEXT ENCODER ED DATA DATA

Advantages of lossless data compression over lossy data RESTORE compression is that Lossless compression results are TEXT DECODER almost similar to the original data. The performance of DATA algorithms can be compared using the parameters like Compression Ratio and Saving Percentage. In a lossless data compression the original message can be exactly Fig.2 Lossy Data Compression decoded. Repeated patterns in a message were found and encoding those patterns systematically . This is the reason The examples of uses of Lossy data compression are in the that lossless data compression is referred as redundancy streaming media and telephony applications. Some more reduction, because it dependent on patterns in the examples of lossy data compression are JPEG, MPEG, message, it does not work well on random messages. MP3. Mostly lossy data compression techniques suffer 3 . EXISTING LOSSLESS DATA from generation loss i.e, decreasing the quality of text because of repeatedly compressing and decompressing of COMPRESSION TECHNIQUES file. Lossy image compression is basically used in digital cameras to increase storage capacities with minimum degradation of picture quality. The data compression techniques are as follows:

2.2 Lossless data compression 3.1 Bit Reduction algorithm

It is a technique which allows use of data compression algorithms to compress text data and also gives the exact The main idea is to reduce the standard 7-bit encoding to original data from the compressed data. The popular ZIP some application specific 5-bit encoding system packing file format which is used for compressing data files is an into a byte array. This method reduce the size of a string application of lossless data compression. Lossless when it is lengthy and the compression ratio is not affected compression can be used when the original data and the by the content of the string . The bit Reduction Algorithm in steps are:- decompressed data have to be identical. Lossless text data compression usually exploit statistical repetition in such a  Select frequently occurring characters from the text way so as to represent the data give by the sender more file which are to be encoded and then obtain their concisely and error free or free from any sort of loss of  ASCII code. important information contained in the input data. Since  Obtain corresponding binary code of these ASCII most of the real-world data has statistical repetition , key codes for each character. therefore lossless data compression is possible. In the same  Now put these binary numbers into a 8 bit manner as in English text, the letter 'a' is much more array. common than the letter 'z'. This type of repetition can be  Remove extra bits from binary number like removed by use of lossless compression. This method can extra 3 bits from the front. be classified according to the data type they are  Re-arrange these into array of byte and represented in a compress form. Compression algorithms

INTEGRATED RESEARCH JOURNAL OF MANAGEMENT, SCIENCE AND INNOVATION ( IRJMSI) Page 12 www.irjmsi.com IRJMSI Vol 2 Issue 1 [Year 2015]

maintain the array. character in text is such a symbol in which single blanks or  Then final text is encoded and the compression pairs of blanks are push aside. Starting with sequences of will be achieved. three bytes, they are replaced by an M-byte and a byte  Now decompression can be achieved in the reverse identifying the number of blanks in the sequence. RLE is order at client-side. useful where redundancy of data is more or it can also be

used in combination with other compression techniques 3.2 Huffman Coding also.

Huffman coding deals with data compression of ASCII Here is an example of RLE: characters. It follows top to down approach. The binary tree is built from top to down to generate best result. In Input: Huffman Coding characters in data file are converted to SSSSAAAIIIIIIPPEEEEEERRRRRRRRRRRRRR binary code and most common characters in the file will Output: 4S3A6I2P6E14R have shortest binary codes. The characters which are least common will have longest binary code. Huffman code can The drawback of RLE algorithm is that it cannot attain be determined by constructing several binary trees one by the high compression ratios in contrast to another one, where leaves represent the characters that are to be encoded. Every node contains relative probability of advanced compression methods, but the Implementation appearance of the characters belonging to the sub tree and execution of RLE is frequent which makes it a beneath the node. The edges are labeled with the bits 0 and good alternative for a complex compression algorithm. 1. The algorithm is: 3.4 Shannon-Fano coding 1. Parse input and count the appearance of each symbol. It is an earliest technique for data compression which was 2. Detect probability of appearance of each invented by Claude Shannon and Robert Fano in 1949.In symbol using the symbol count. this technique, a binary tree is generated which represent 3. Arrange symbols according to their probability of appearance. the probabilities of each symbols occurring. The symbols 4. Then generate leaf nodes for each symbol . are ordered in such a way that the most frequent symbols 5. Take two least frequent characters and then appear at the top of the tree and the least frequent symbols reasonably group them together and obtain their appear at the bottom .The algorithm for Shanon- Fano combined frequency that leads to construction of a binary tree . coding is: 6. Repeat step 5 until all elements are reached. 1. Parse the input and count the occurrence of each 7. Then label the edges from each parent to its left symbol. with the digit 0 and the edge to right with 1. 2. Determine the probability of appearance of each Tracing down the tree gives “Huffman codes” in symbol using symbol count. which shortest codes are assigned to the 3. Sort the symbols according to their probability of character with larger frequency. appearance, with the most probable at the top. 4. Then generate leaf nodes for each and every symbol. Parse the input and count the appearance 3.3 Run -length Encoding of each symbol. 5. Determine the probability of occurrence of each Data often contains sequences of identical bytes. But if we 6. Sortoccurrence,symbol the using symbols symbol according count. to their probability of 7. Then generate leaf nodes for each symbol. replace these repeated bytes sequences with the number of 8. Divide the list in two by keeping the probability of occurrences, data reduction occur which can be given a the left branch almost equal to those on the right name as Run-length Encoding. Its simply data compression branch. algorithm most effectively supported by bitmap file formats 9. Prepend 0 to the left node and 1 to the right node such as BMP as well as physical size of a repeating string codes. is reduced with the help of compression of characters. This 10. Recursively apply steps 5 and 6 to the left and right sub trees till each node in the tree becomes a leaf. repeating string which is typically encoded into two bytes Shannon-Fano coding does not guarantee the is called a run . The first byte stands for the total number generation of an exact code. Shannon – Fano of characters in the run popularly called the run count and it algorithm is more efficient when probabilities are substitute runs of two or more of the same character with a closer to inverses of powers of 2. number which represents the length of the run that will be followed by the each original and single characters which are coded as runs of 1. RLC is a generalization of zero suppression, which accept that just one symbol appears peculiarly many time in sequences. The blank (space)

INTEGRATED RESEARCH JOURNAL OF MANAGEMENT, SCIENCE AND INNOVATION ( IRJMSI) Page 13 www.irjmsi.com IRJMSI Vol 2 Issue 1 [Year 2015]

3.5 Arithmetic Coding International Journal of Advanced Research in Computer Science and Software Engineering (IJARCSSE ) Volume 3, Issue 3, March 2013 Arithmetic coding is a coding technique which provides us [2] S. Porwal, Y. Chaudhary, J. Joshi and M. Jain , “ Data Compression Methodologies for best compression ratio and provides better results than Huffman Coding. It is quite complicated in comparison to Lossless Data and Comparison between Algorithms” other techniques. A string is converted in to arithmetic International Journal of Engineering Science and encoding, the characters which have maximum probability Innovative Technology (IJESIT) Volume 2, Issue 2, of occurrence gets stored with fewer bits and the characters March that do not occur much frequently will get stored with more [3] S. Shanmugasundaram and R. Lourdusamy, bits, resulting in fewer bits used overall. Arithmetic coding “A Comparative Study of Text Compression converts the stream of input symbols into a single floating Algorithms” point number as output. As in Huffman coding, arithmetic International Journal of Wisdom Based coding does not code each symbol separately. Each symbol Computing, Vol. 1 (3), December 2011 is instead coded by considering all data. Thus a data stream [4] S. Kapoor and A. Chopra, "A Review of encoded in this manner must always be read from the Lempel Ziv Compression Techniques" IJCST Vol. 4, beginning. Random access is not possible. Issue 2, April Algorithm to generate the arithmetic code: - June 2013 1. Calculate the number of unique symbols in input. [5] I. M.A.D. Suarjaya, "A New Algorithm for Data This represents the base of the arithmetic code. Compression Optimization", (IJACSA) 2. Assign values from 0 to base to each unique International Journal of Advanced Computer symbol in the order they appear. Science and Applications, Vol. 3, No. 8, 2012, 3. Use the values from step 2, then symbols are pp.14-17 replaced with their codes in the input. [6] S.R. Kodituwakku and U. S. Amarasinghe , 4. Convert the result from step 3 from base to a "Comparison Of Lossless Data Compression sufficiently long fixed-point binary number to Algorithms For Text Data" Indian Journal of Computer Science and Engineering Vol1No. 4 maintain precision. 5. Record the length of the input string somewhere in 416-425 the result as it is needed for decoding. R. Kaur and M. Goyal, “An Algorithm for Lossless Text Data Compression” [7] Lossless Text Data Compression” International 3.6 Lempel-Ziv-Welch (LZW) Journal of Engineering Research & Technology Algorithm (IJERT), Vol. 2 Issue 7, July - 2013 [8] H.Altarawneh and M. Altarawneh, " Data Compression Techniques on Text Files: A Terry Welch created Lempel-Ziv-Welch algorithm in Comparison Study ” International Journal of 1984. .This type of compression algorithm works on any Computer Applications, Volume 26– No.5, July type of data as it creates a table of strings commonly 2011 occurring in the data being shortened , and change the [9] U. Khurana and A. Koul, “Text Compression actual data with references into the table. The table is And Superfast Searching” Thapar Institute Of formed at the time of encoding the data during compression Engineering and Technology, Patiala, Punjab, while during decompression at the same time as the data is India-147004 decoded. [10] A. Singh and Y. Bhatnagar, “ Enhancement of data compression using Incremental Encoding” CONCLUSION International Journal of Scientific & Engineering In this review we have present various techniques to Research, Volume 3, Issue 5, May-2012 compress text data in lossless manner. Various techniques [11] A.J Mann, “Analysis and Comparison of with their algorithms and disadvantages is provided in this Algorithms for Lossless Data Compression” International Journal of Information and review. This review shows that no algorithm provides Computation Technology, ISSN 0974-2239 promising result that can be used in practical applications Volume 3, Number 3 (2013), pp. 139-146 for compression of data. Therefore, there is a need to [12] K. Rastogi, K. Sengar, “Analysis and develop an lossless text compression algorithm that can Performance Comparison of Lossless compress the text data in better way that can also be used Compression Techniques for Text Data” International Journal of Engineering Technology in various practical application where compression of text and Computer Research (IJETCR) 2 (1) 2014, data is required . 16-19 [13] M. Sharma, “Compression using Huffman Coding ” IJCSNS International Journal of REFERENCES Computer Science and Network Security, VOL.10 No.5, May 2010 [1] R.S. Brar and B.Singh, “A survey on different [14] S.Shanmugasundaram and R. Lourdusamy, " compression techniques and bit reduction IIDBE: A Lossless Text Transform for Better algorithm for compression of text data” Compression " International Journal of Wisdom

INTEGRATED RESEARCH JOURNAL OF MANAGEMENT, SCIENCE AND INNOVATION ( IRJMSI) Page 14 www.irjmsi.com IRJMSI Vol 2 Issue 1 [Year 2015]

Based Computing, Vol. 1 (2), August 2011 [15] P. Kumar and A.K Varshney, “ Double Huffman Coding ” International Journal of Advanced Research in Computer Science and Software Engineering (IJARCSSE) Volume 2, Issue 8, August 2012

INTEGRATED RESEARCH JOURNAL OF MANAGEMENT, SCIENCE AND INNOVATION ( IRJMSI) Page 15 www.irjmsi.com

Shri Param Hans Education & Research Foundation Trust www.SPHERT.org

भारतीय भाषा, शिऺा, साहह配य एवं िोध ISSN 2321 – 9726 WWW.BHARTIYASHODH.COM

INTERNATIONAL RESEARCH JOURNAL OF MANAGEMENT SCIENCE & TECHNOLOGY ISSN – 2250 – 1959 (0) 2348 – 9367 (P) WWW.IRJMST.COM

INTERNATIONAL RESEARCH JOURNAL OF COMMERCE, ARTS AND SCIENCE ISSN 2319 – 9202 WWW.CASIRJ.COM

INTERNATIONAL RESEARCH JOURNAL OF MANAGEMENT SOCIOLOGY & HUMANITIES ISSN 2277 – 9809 (0) 2348 - 9359 (P) WWW.IRJMSH.COM

INTERNATIONAL RESEARCH JOURNAL OF SCIENCE ENGINEERING AND TECHNOLOGY ISSN 2454-3195 (online) WWW.RJSET.COM

INTEGRATED RESEARCH JOURNAL OF MANAGEMENT, SCIENCE AND INNOVATION ISSN 2582-5445 WWW.IRJMSI.COM