An Comparison of Two Program

by Walter Scheper Table of Contents

Executive Summary ...... 1

Methods ...... 3

Raw Data ...... 5

Analysis of Data ...... 6

Conclusion ...... 8

Appendix ...... 9

i Executive Summary

The purpose of this study is to determine which compression program, WinZip v8.1 or WinRAR v2.90, is the better general purpose tool in terms of efficiency for compressing data. WinZip is the defacto standard consumer compression program. WinRAR uses a less well-known format and is the predominant method of compressing files for distribution via Usenet newsgroups. Deciding between the two is important, because an incorrect decision could waste time and space. Through the course of the study, I found that

WinZip v8.1 significantly out-preformed WinRAR v2.90 for compressing both ASCII text and multimedia WAV files. When compressing ASCII text files, WinRAR typically compressed the data 7% smaller than WinZip, but at a cost of taking 10 times as long. The comparison is still worse for WinRAR when dealing with the hard-to- multimedia files. The difference in compression for multimedia files is reduced to less than 1%, with some cases actually in WinZip’s favor, while WinRAR now takes 13 times as long.

These results are completely different from what I expected to see. Data compression is primarily achieved by encoding information so that fewer bits are used to represent those parts which occur more frequently. The

1 tradeoff is that less frequently used items must be encoded with more bits. As a result, what you count and the relative frequency of those parts will determine how well your data is compressed. The ASCII text I used was mostly written in the English language, which provides a particularly skewed frequency count and therefore something easily compressed.

However, WAV files are much more random, meaning there is less skewing of the frequency count towards certain values.

This makes compression difficult. Knowing this, what I expected to see was that WinZip and WinRAR would perform about the same, with WinRAR achieving slightly better performance, while the type of file would primarily affect overall efficiency. I was quite surprised to see just how much WinZip out preformed WinRAR and how little difference the file type made in overall efficiency.

2 Methods

To collect the data, I first created the input files.

This was done by searching the internet for a large number of files containing only ASCII text, preferably consisting of standard English. I then concatenated the files together until I had six files of ASCII text approximately 12,000,000 in size. The six WAV files were created by 'ripping' music tracks from CDs, ripping each track generated a WAV file of approximately 26,000,000 bytes. Each file was then compressed using the latest versions of both WinRAR and

WinZIP. The order in which I compressed the files was randomized to break up any linking characteristics and the computer was freshly re-booted to clear up as much system resources for the tests as possible. A stopwatch was used to time how long it took each program to compress a file, and the original and compressed file sizes were recorded.

Finally, I calculated the ratio of compressed size to original size.

The experiment resulte din two measures of efficiency that are important when dealing with computers: time and space. Since time and space are two key elements of computing efficiency, I wanted to explore these two measures together rather than apart. Using the gathered information,

3 I determined how much time it took WinRAR and WinZIP to compress a file by a certain percentage. To do this I, simply divided the percentage size difference by the time to compress. Doing this also eliminates the effect of the input

WAV files having much larger sizes than the input text files and should give us a good measurement of the total efficiency of each compression program. This value was then used for determining whether program type (WinZip or WinRAR) or file type (wav or text) was the more important factor in determining overall efficiency.

4 Raw Data

Trial Prog Type Time Orig. Comp. Ratio Cr/s

6 txt 3.09 12,003,805 4,533,235 37,765 12.22

18 zip txt 2.42 12,061,980 4,624,590 38,340 15.84

7 zip txt 3.97 12,007,756 4,290,472 35.731 9.000

24 zip txt 2.38 12,036,161 4,594,080 38.169 16.04

11 zip txt 3.38 12017031 4546661 37.835 11.19

5 zip txt 2.76 12021495 4433054 36.876 13.36

17 zip wav 6.31 27,694,844 26,274,644 94.872 15.04

22 zip wav 6.35 26041388 25007422 96.030 15.12

14 zip wav 8.15 34879148 32339260 92.734 11.38

10 zip wav 7.31 30853580 28462609 92.251 12.62

9 zip wav 5.97 26471804 25163007 95.056 15.92

3 zip wav 7.02 28482764 24572574 86.272 12.29

8 txt 24.90 12003805 3696517 30.795 1.237

4 rar txt 25.74 12061980 3772400 31.275 1.215

19 rar txt 23.27 12007756 3479978 28.981 1.245

20 rar txt 24.74 12036161 3758207 31.224 1.262

23 rar txt 24.99 12017031 3720525 30.960 1.241

1 rar txt 25.49 12021495 3601750 29.961 1.175

15 rar wav 81.82 27,694,844 26,335,278 95.091 1.162

16 rar wav 76.81 26041388 25065804 96.254 1.253

2 rar wav 102.59 34873148 32397711 92.902 0.9056

12 rar wav 90.30 30853580 28525337 92.454 1.024

21 rar wav 78.58 26471804 25191745 95.164 1.211

13 rar wav 83.04 28482764 24522980 86.098 1.037

5 Analysis of Data

Even looking at just the raw data shows that WinZip has a significant performance edge on WinRAR: scoring an average of 13.3 in the Compression per Second statistic, as well as being about 10 times faster than WinRAR. However, the raw data fails to tell us if there is any interaction between the compression program types and the file types. For that we need to look at the output from Matlab in the Appendix.

Going through the output from the means(), mfit() and ANOVA functions in Matlab quickly demonstrates the primacy of program type in determining the overall efficiency. First we look at the output of the means() function for each factor, program type and file type. Looking at this, we begin to see that program type is the more important factor, since the mean differences are 12.6295 and 11.7125 vs 0.7866 and

-0.1304. This is also re-inforced by the Fitted Mean Effect in Table III. Finally, looking at the output of the ANOVA table we see that the program type is the only factor that has a F-value greater that one, 325.18140, and is also the only factor with a p-value that indicates statistical significance.

These results strongly indicate that there is no interaction between program and file type, and that program

6 type is the most important factor in overall efficiency.

7 Conclusion

The importance of this study is that it demonstrates that there is a major difference between these two compression programs in terms of efficiency. While WinRAR does achieve somewhat better compression, it has a cost in terms of time that outweighs the compression advantage.

However, the study does not answer the question of the underlying cause of the results. It may be that the data compression algorithm used by WinRAR is simply slower, but it could be an implementation issue. In order to determine this, I would need more specific information about the algorithms used by WinZip and WinRAR. However, it is interesting to see that the WAV files, which are hard to compress, don’t affect the efficiency as I had assumed they would. This fact tends to point towards an implementation issue in WinRAR.

8 Appendix

Mean Plot of Cr/s vs Program Type and File Type

9 I. means() output for Program Type

Means of Compression/second, by Program Type

Source N Mean zip 12 13.335 rar 12 1.164

II. Means() output for File Type Means of Compression/second, by File Type

Source N Mean wav 12 7.4136 txt 12 7.0854

Table of means of Compression/second by Program Type and File Type; with Mean Differences

x1 zip rar | x2 wav 13.7283 1.0988 | 12.6295 txt 12.9417 1.2292 | 11.7125 ------0.7866 -0.1304

III. mfit(comp.rps,comp.prog,comp.type)

Overall Mean 7.2495

Fitted Main Effect of Compression/second, by Program Type Source N Main Effect zip 12 6.0855 rar 12 -6.0855

Fitted Main Effect of Compression/second, by File Type Source N Main Effect wav 12 0.16407 txt 12 -0.16407

Table of 2-way Program Type by File Type Interaction Effects

10 x1 zip rar x2 wav 0.22927 -0.22927 txt -0.22927 0.22927

VI. Lm output

Sequential Sums of Squares ANOVA Table:

Models Compression/second = Program Type + File Type + (Program Type * File Type)

Source df SS MS F P-val prog 1 888.80430 888.80430 325.18140 7.7161e-14 type 1 0.64603 0.64603 0.23636 0.63213000 prog*type 1 1.26150 1.26150 0.46154 0.50469000 Error 20 54.66510 2.73330

R-square 0.94218

Standard Error 1.6533

11