<<

Journal of Applied Sciences Research, 6(12): 2123-2132, 2010 © 2010, INSInet Publication

Application of Run-length Encoding Technique in Compression

1D.E. Asuquo, 2E.O. Nwachukwu and 3N.J. Ekong

1,3Department of University of Uyo, Uyo, Nigeria.

2Department of Computer Science University of Port Harcourt, Port Harcourt, Nigeria.

Abstract: With the growing size of database and applications use in this age of technology involving storage, processing and transfer of large information, there is an increasing need for more storage space for data of several . This large data need affect the system in the area of transfer of information or downloading time apart from the cost implication it has on the system storage resources. Improvement on the rate of transfer of data/information can be achieved by improving the medium through which the data is transferred, or by changing the data themselves so that the same information can be transferred within a shorter interval or time. with its various methods makes this second aspect of the improvement possible. This work explores the benefits of run-length encoding technique to data and implements a compression application using the technique in Java such that the original data size can be obtained.

Key words: data compression, redundancy, run-length encoding technique,

INTRODUCTION case of converting a binary number to octal or to decimal, even hexadecimal and so forth. This may The growing size of database used by most times help to reduce the size of the data with organizations for information storage and retrieval is regards to the number of bytes used for the information highly becoming a difficult task to cope with, stored. especially in today’s world of information technology In this work, concepts from information theory as requiring several additional gigabytes of hard disk they relate to the goals and evaluation of data space to meet up with. This no doubt has some cost compression techniques are briefly discussed. This implications [11,20]. However, it is worthy of note that work applies both computer science and engineering storage and transfer of information are essential principles and practices to the creation, operation, and elements for the proper functioning of an organization maintenance of typical with qualities such as of any type at any level. The faster the mode of correctness, reliability, robustness, user friendliness, information transfer without distortion in the data or maintainability, interoperability, and portability. The message sent, the smoother the structures function. focus of this work is to present an analysis of data Data compression provides a solution in this direction. compression and develop a great speed data It finds important application in the areas of compression application using Run-Length Encoding storage and data transmission. According to Lelewer (RLE) technique implemented in Java Programming and Hirschberg [13], Held [7] and Nelson [15], the aim of Language. The application is capable of preventing loss data compression is to reduce redundancy in stored or of data during compression, displaying the compression communicated data, thus increasing effective data ratio statistics in order to access the degree of density. compression, which is measured efficiency, and Information needs to be represented in a form that decoding the compressed output to obtain the exact eliminates some form of redundancy. For example, it original data. Compression is useful because it helps is often times simply adequate to describe sex by using reduce the consumption of expensive resources such as the first letters rather than putting it down in full, say hard disk space or transmission . M for male and F for Female or using some shorter defined such as 1 assigned to male and 2 Information Theory: Historically, information theory assigned to female. Also changes in numerical was developed by in 1948 to find representation from one numbering or base system fundamental limits on operations such to another may help in reducing redundancy as in the as compressing data and on reliably storing and

Corresponding Author: D.E. Asuquo, Department of Computer Science University of Uyo, Uyo, Nigeria. Email: [email protected] 2123 J. Appl. Sci. Res., 6(12): 2123-2132, 2010

communicating data. Applications fundamental to the of X given some x , then the entropy of X is theory include lossless data compression (e.g files), defined as: lossy data compression (e.g ) and coding (for DSL lines). Information theory is defined to be the H(X) = Ex[I(x)] = -Σρ(x) logρ(x) (2) study of efficient coding and its consequences, in the form of speed of transmission and probabilities of Here, I(x) is the self-information, which is the entropy errors [8]. A key measure of information in the theory contribution of an individual message, and Ex is the is known as entropy, which is usually expressed by the expected value. Entropy is maximized when all the average number of needed for storage or messages in the message space are equiprobable ρ(x) communication [19,5,12,24]. = 1/n, i.e. most unpredictable – in which case Ralph Hartley’s , Transmission of Information, uses the word information as a H(X) = logn. (3) measureable quantity, reflecting the receiver’s ability to distinguish one sequence of symbols from any other, Equation (3) is similar to Hartley’s equation (1). There thus quantifying information as: is a special case of information entropy for a random variable with two outcomes known as the binary H = logSn = nlogS. (1) entropy function, usually taken to the logarithmic base 2: where S was the number of possible symbols, n was

the number of symbols in a transmission. The natural Hb(ρ) = -ρ log2 ρ - (1-ρ)log2(1-ρ). (4) unit of information was therefore the decimal digit. Other units include the nat, which is based on the Data Compression: Considering the storage (or natural logarithm and the Hartley in his honour. Today, memory) organization, we can see that storage of data based on the binary logarithm, the is the unit of into a disk through the write operation involving a information. Information theory is based on probability single character or of information of a file theory and statistics [1]. Using a statistical description requires one sector, which has an array of 512 bytes. for data, information theory quantifies the number of Also a file of data with a size more than the size of a bits needed to describe the data, which is the of capacity 1.44 MB (mega bytes) say a information entropy of the source message. The most file of 2,500,000 bytes or 2.5MB would require 2 important quantities of information are entropy, the diskettes of such floppy provided the file can be information in a random variable, and mutual logically separated. Today, diskettes have become information, the amount of information in common almost obsolete due to the availability of flash disks between two random variables. The former quantity and optical disks but the problem still abound. Another indicates how easily message data can be compressed major problem faced by the system user is transferring while the latter can be used to find the communication data from one medium to another e.g. RAM to Floppy rate across a channel. This work concentrates on the disk or even downloading of data over a network. The former. transmission of data between computers and terminals The entropy, H, of a discrete random variable X is entails several delays that have cumulative effects on a measure of the amount of uncertainty associated with the rate of information transfer. Data transmission over the value of X. For example, a fair coin flip (2 equally a or channel must be in likely outcomes) will have less entropy than a roll of acceptable format for the channel that constitutes part a die (6 equally likely outcomes). Suppose one of the delay factors. When is transmitted transmits 1000 bits (0s and 1s). If these bits are known over analogue line, ahead of transmission (to be a certain value with (modular/demodulator) must be employed to convert absolute probability), logic dictates that no information the digital pulse of the business machine into has been transmitted. If, however, each is equally and modulated signal acceptable for transmission on the independently likely to be 0 or 1, 1000 bits (in the analogue telephone circuit. Two are required information theoretical sense) have been transmitted. for point to point circuit, making the total internal Between these two extremes, Reza [17] quantify delay between the first bit entering the modems and the first modulated signal produced by the device and information as follows: If is the set of all messages subsequently some other delays which might occur. These pose storage problems during data transfer.

{x1,…,xn} that X could be, and ρ(x) is the probability Data compression is often referred to as coding, where coding is a very general term encompassing any

2124 J. Appl. Sci. Res., 6(12): 2123-2132, 2010

special representation of data which satisfies a given short codewords to those input blocks with high method. A code is a mapping of source messages into probabilities and long codewords to those with low codewords [13]. is one of the important probabilities. This concept is similar to that of the and direct applications of information theory. It is . Run-length encoding performs lossless subdivided into source coding theory and channel data compression and is well suited to palette-based coding theory. While source coding (data compression) iconic images. It is a form of data compression in removes as much redundancy as possible, channel which runs of data (that is, sequences in which the coding (error-correction codes) adds just the right kind same data value occurs in many consecutive data of redundancy (error correction) needed to transmit the elements) are stored as a single data value and count, data efficiently and faithfully across a noisy channel. rather than as the original run[7,21,15]. Its name so Data compression, viewed as a branch of information accurately describes the process because it encodes a theory, has as primary objective to minimize the run of bytes to the following 2-byte form: {byte, amount of data to be transmitted [10,2,17,14]. There are length}, with length representing the number of runs of two formulations for the compression problem: lossless a single byte. This is most useful on data that contains data compression and lossy data compression. In the many such runs: for example, relatively simple graphic former, the data must be reconstructed exactly while images such as icons, line drawings, and animations. the latter allocates bits needed to reconstruct the data, As an example, consider the following data source or within a specified fidelity level measured by a string of 24 letters: Input String: distortion function. Data compression software is aabbbbbbbbbbeeeffggghhij application software in a form of a control system To encode the above string, the output would be based on several engineering discipline, computer and {‘a’, 2}, {‘b’, 10}, {‘e’, 3}, {‘f’, 2}, {‘g’, 3}, {‘h’, 2}, communication fields. Some equipment and practices {‘i’, 1}, and {‘j’, 1}. The total compressed form for are manufacturer’s specific, thus may not encourage this source is just 16 bytes. This saves us exactly 8 equipment compatibility and interoperability. A simple bytes, with a compression ratio of 33%. In Held [7], a characterization of data compression is that it involves screen containing plain black text on a solid white transferring a string of characters in some background with many long runs of white in the representation (such as ASCII or EBCDIC) into a new blank space, and many short runs of black pixels string (of bits, for example) which contains the same within the text was considered. If we take a information but whose length is as small as possible. hypothetical single scan line, with B representing a It involves some form of encoding and decoding. black and W representing white, we have: In fact, the savings achieved by data compression can be very dramatic as reductions as high as 80% is WWWWWWWWWWWWBWWWWWWWWWWW not uncommon [16,23]. Cormack reports that data WBBBWWWWWWWWWWWWWWWWWWWW compression programs based on WWWWBWWWWWWWWWW WWWW reduced the size of a large student-record database by 42.1% when only some of the information was If we apply the run-length encoding data compression compressed. As a consequence of this size reduction, to the above hypothetical scan line, we get the number of disk operations required to load the the following: database was reduced by 32.7% [3]. Data compression 12W1B12W3B24W1B14W routines developed with specific applications in mind have achieved compression factors as high as 98% Interpret this as twelve W's, one B, twelve W's, [18,22,8]. three B's, etc. The run-length code represents the original 67 Methodology: Data compression methods or techniques characters in only 18. Of course, the actual format used are widely classified into lossless and lossy for the storage of images is generally binary rather than compression techniques. Our interest is on the former ASCII characters like this, but the principle remains the which involves and includes the same. Even binary data files can be compressed with Shannon-Fano algorithm, Huffman coding of images, this method; specifications often dictate , , Lempel- repeated bytes in files as padding space. Run-length Ziv-Welch (LZW) algorithm and run-length encoding encoding can be efficiently used in machines (RLE). The Shannon-Fano algorithm is not guaranteed (combined with other techniques into Modified to produce an optimal code. Huffman compression Huffman coding) because most faxed documents are algorithm is suitable for compressing text or program white space, with occasional interruptions of black. files. The basic idea in Huffman coding is to assign

2125 J. Appl. Sci. Res., 6(12): 2123-2132, 2010

The algorithmic view is that run length encoding When the compression module is clicked, the stands out from other methods of compression. It does compression screen (fig.2) with a dialogue box appears not try to reduce the average symbol size like Huffman with the message: coding or arithmetic coding, and it does not replace Welcome to Bobby File Compression, Click on the strings with dictionary references like Lempel-Ziv and OK button, and wait for the next step LZW style coding. RLE replaces a string of repeated i. click on the 'OK' button. You will have a new symbols with a single symbol and a count (run length) dialogue box (fig.3) with the instruction: indicating the number of times the symbol is repeated. "Enter filename with its extension for first step of This justifies our adoption of the technique in this compression". work. In Dipperstein [4], the string: ii. type in the filename in the text box provided e.g. "aaaabbcdeeeeefghhhij" implementation.doc was replaced, using RLE, with: iii. click on the "OK" button. "a4b2c1d1e5f1g1h3i1j1". The numbers are in bold to Compression module 2 opens with a compression indicate that they are values, not symbols. Counting the in progress screen (fig.4) carrying a message: before and after string size reveals that RLE did not "Welcome to Bobby File Compression, Click on make this string any smaller as expected. Part of the the OK button, and wait for the next step". reason is because it now takes a symbol and a value to iv. click on the "OK" button. The second step of represent what used to take just a single symbol. One compression begins with an opened dialogue box of the ways of preventing single symbols from (fig.5) carrying a message: expanding is to only apply coding to runs of more than "Enter the compressed filename and its new one and use the run length to indicate the number of extension '.zip' and click on 'OK' button. additional copies of the encoded symbol that are v. type in the filename e.g implementation.zip in required. Ideally that would give a shorter string like thetext b ox provided and click on the 'OK' button. this: "a3b1cde4fgh2ij". The output (fig.6) appears. This output is a RESULTS AND DISCUSSIONS folder-like compressed file usually yellowish in color having as filename 'implementation' with a of System Design: The diagrams below give a flow of 1KB as compared to the original file size of 30KB. our system architectural design. However, extarcting back the original file Java applications do not interact directly with a (implementation.doc) involves double clicking on the computer’s central processing unit (CPU) or operating compressed file. system and are therefore platform independent, meaning that they can run on any type of personal computer, Conclusion: Today, many data processing applications workstation, or mainframe computer. This cross- require storage of large volumes of data, and the platform capability, referred to as “write once, run number of such applications is constantly increasing as everywhere,” caught our attention and could be doing the use of computers extends to new disciplines. At the same to many software developers and users. With same time, the widespread of computer communication Java, software developers can write applications that networks is resulting in massive transfer of data over will run on otherwise incompatible operating systems. communication links. Compressing data to be stored or The case of manipulating bits, language flexibility and transmitted reduces storage and/or communication costs. program extensibility and maintenance makes it When the amount of data to be transmitted is reduced, possible to write a very compact and effective code the effect is that of increasing the capacity of the with it. . The system is composed of a graphical user Similarly, compressing a file to half its original interface. The title bar signify the title of the program size is equivalent to doubling the capacity of the “BOBY FILE COMPRESSION”, while the dialogue storage medium. It then becomes feasible to store the box contain a ‘welcoming message’ with an instruction data at a higher, faster level of the storage hierarchy to Click ‘OK’ button and wait for the next message. and reduce the load on the input/output channels of the The program id complied to an object file (being the computer system. Our compression application indicates executable file) with a name COMPRESSION that lossless data compression using run-length EXECUTABLE FILE and a capital reddish alphabet encoding technique can be efficiently implemented in ‘A’ as LOGO (fig 6). The executable file works when Java programming language. However, there is room it is double click. for improvement in implementation of compression ratios, memory usage, and speed.

2126 J. Appl. Sci. Res., 6(12): 2123-2132, 2010

Fig. 1: Gives a flow of our system architectural design

Fig. 2: Main menu

Fig. 3: First compression step

2127 J. Appl. Sci. Res., 6(12): 2123-2132, 2010

Fig. 4: Second compression module

Fig. 5: Confirmation of a successful compression

Fig. 6: Compression screen

2128 J. Appl. Sci. Res., 6(12): 2123-2132, 2010

Fig. 7: First step of compression

Fig. 8: Compression in progress

2129 J. Appl. Sci. Res., 6(12): 2123-2132, 2010

Fig. 9: Second step of compression

Fig. 10: Confirmation of a successful compression

REFERENCES 2. Ash, R.B., 1990. Information Theory. Dover Publications, Inc., New York. 1. Arndt, C., 2004. Information Measures: 3. Cormack, G.V., 1985. Data Compression on a Information and its Description in Science and Database System. Commun. ACM., 28(12): 1336- Engineering, Springer Series: Signals and 1342. Communication Technology.

2130 J. Appl. Sci. Res., 6(12): 2123-2132, 2010

4. Dipperstein, M., 2004. Run Length Encoding 24. Yeung, R.W., 2008. Information Theory and (RLE) Discussion and Implementation. Network Coding. Springer http://michael.dipperstein.com/rle/index.html. 5. Galleger, R., 1968. Information Theory and Appendix Reliable Communication. John Wiley and Sons, package compressions; New York. import java.io.*; 6. Held, G., 1987. Data Compression: Techniques and import java.util.zip.*; Applications, Hardware and Software import javax.swing.*; Considerations (2nd ed.). John Wiley and Sons, import java.awt.event.*; New York. import java.awt.*; 7. Held, G., 1991. Data Compression. John Wiley public class Compression extends JFrame { and Sons Ltd., New York. public static void main(String[] args) throws 8. Horspool, R.N. and Cormark, G.V., 1987. A IOException { locally Adaptive Data Compression Scheme, String word1 = "

Welcome to Bobby file 9. Ingels, F.M., 1971. Information and Coding compressions\n
click the Ok 10. Jaynes, E.T., 1957. Information Theory and button\n
and wait for the next step"; 11. Korfhage, R.R., 1997. Information Storage and JOptionPane.showMessageDialog(null, Retrieval. John Wiley and Sons, New York. word1,"Bobby File Compressions", 12. Leff, H.S. and Rex, A.F., 1990. Maxwell’s JOptionPane.INFORMATION_MESSAGE); Demon: Entropy, Information Computing. BufferedReader input = new BufferedReader(new Princeton University Press, New York. InputStreamReader(System.in)); 13. Lelewer, D.A. and D.S. Hirschberg, 1987. Data String filesToZip = Compression. ACM Computing Surveys, 19(3): JOptionPane.showInputDialog(null,"Enter filename and its 14. Mackay, D., 2003. Information Theory, Inference extension \n
for first step of compression","Bobby File Press, Cambridge. Compressions", JOptionPane.PLAIN_MESSAGE); 15. Nelson, M.R., 1991. The Data Compression Book, File f = new File(filesToZip); M&T Books, Redwood City. if(!f.exists()) { 16. Reghbati, H.K., 1981. An Overview of Data String word2 = "
File not found."; 17. Reza, F.M., 1994. An Introduction to Information JOptionPane.showMessageDialog(null, Theory. Dover Publications, Inc., New York. word2,"Bobby File Compressions", 18. Severance, D.G., 1983. A Practitioner’s Guide to JOptionPane.WARNING_MESSAGE); DataBase Compression. Inform. Systems., 8(1): 51- System.exit(0); } 62. String word = "
Bingo! You just completed the first Mathematical Theory of Communication. step\n
click University of Illinois Press, Urbana III. Ok and follow the next instruction"; 20. Singhal, A., 2001. Modern Information Retrieval: JOptionPane.showMessageDialog(null, word,"Bob A Brief Overview. Bulletin of the IEEE Compressions", Society Technical Committee on Data JOptionPane.INFORMATION_MESSAGE); Engineering., 24(4): 35-43. String zipFileName = 21. Storer, J.A., 1988. Data Compression: Methods and JOptionPane.showInputDialog(null,"Enter the compressed 22. Welch, T.A., 1984. A Technique for High- filename and its new extension '.zip' and click Performance Data Compression. Computer., 17(6): Ok","Bobby File 8-19. Compressions",JOptionPane.PLAIN_MESSAGE); 23. Witten, I.H., R.M. Neal and J.G. Cleary, 1987. byte[] buffer = new byte[18024]; Arithmetic Coding for Data Compression. try { Commun. ACM., 30(6): 520-540. ZipOutputStream out = new ZipOutputStream(new FileOutputStream(zipFileName));

2131 J. Appl. Sci. Res., 6(12): 2123-2132, 2010 out.setLevel(Deflater.DEFAULT_COMPRESSION); in.close(); FileInputStream in = new FileInputStream(filesToZip); out.close(); } out.putNextEntry(new ZipEntry(filesToZip)); catch (IllegalArgumentException iae) { int len; iae.printStackTrace(); while ((len = in.read(buffer)) > 0) { System.exit(0); out.write(buffer, 0, len); } } String word4 = "

Project work compiled and Executed by fnfe.printStackTrace(); Ekong, Ndifreke Jacob\n
Department of Computer } Science\n
Faculty of Science ioe.printStackTrace(); \n
University of Uyo\n
Check the directory for the compressed } file"; Compression e = new Compression(); JOptionPane.showMessageDialog(null, } word4,"Bobby File Compressions", JOptionPane.PLAIN_MESSAGE); out.closeEntry();

2132