Performance Analysis of Generic Compression Algorithm Tools Andrew Keating, Jonathon Marolf Embry-Riddle Aeronautical University
Total Page:16
File Type:pdf, Size:1020Kb
Performance Analysis of Generic Compression Algorithm Tools Andrew Keating, Jonathon Marolf Embry-Riddle Aeronautical University Introduction Methods Results Conclusions Compression serves a very important purpose in conserving data backup space and improving the speed of digital data Archival Tool F F crit Environment • For each file type and size Best Compression Ratio - When looking at just compression RAR: 5mb 1719488 2.627441 communications. To take full advantage of compression, the Gentoo Linux Workstation with kernel version 2.6.31-r6. (i.e. text 5mb) an excel ratio 7zip wins for best audio compression overall. Info-zip and workbook was created that RAR: 25mb 119782.1 2.627441 best compression utility available for the given situation System contains an Intel Pentium 4 CPU rated at 2.5 GHZ gzip win for best compression of random data because of their should be used. Because of the inherent tradeoffs between would calculate the mean, RAR: 125mb 184026.4 2.627441 and contains 1GB of main memory rated at DDR 226. median, mode, and Info-zip: 5mb 3525.32 2.627441 small header files. 7zip wins for best compression of text data compression ratio and compression/decompression time, standard deviation for Info-zip: 25mb 3460.035 2.627441 and video data, just barely beating RAR within our confidence compression and identifying the ‘best’ utility can be troublesome. Tools Info-zip: 125mb 4588.768 2.627441 interval. Overall, for compression 7zip delivers the best decompression times as Each compression utility is meant to compress data, but Gzip: 5mb 1767.889 2.627441 7zip: version 9.04 of ‘p7zip’ well as the compression performance. Gzip: 25mb 7210.017 2.627441 exhibits different performance because of the different • Bzip2: version 1.0.5-r1 of ‘bzip2’ and version 1.0.5-r1 of ratio. Best Compression Time - Gzip wins for fastest audio algorithms they use. The major algorithms used by each ‘bunzip2’ • Excel charts compared at Gzip: 125mb 312.9928 2.627441 compression time while info-zip wins for fastest compression of Bzip2: 5mb 174024.9 2.627441 matching tool are described below: • Gzip: version 1.4 of ‘gzip’ and version 1.4 of ‘gunzip’ multiple levels to random data. Text and video data results were too close to call determine winners in each Bzip2: 25mb 157437.2 2.627441 • Info-zip: version 3.0 of ‘zip’ and version 6.0-r1 of ‘unzip’ for within our confidence interval of 90% so info-zip a gzip tie. of the 16 categories Bzip2: 125mb 122540.7 2.627441 Best Decompression Time - Info-zip and gzip tie for fasted 7zip uses the Lempel–Ziv–Markov chain algorithm (LZMA) • RAR: version 3.8.0 of the officiall utility ‘rar’ • Single factor ANOVA 7zip: 5mb 83832.75 2.627441 decompression of audio data due to per confidence interval. which is itself an extension of the Lempel-Ziv algorithm analysis is done on the 7zip: 25mb 158563.1 2.627441 Gzip wins for fasted decompression of Random data. (specifically LZ77[1]) or sliding window algorithm. The sliding Metrics data to ensure results not 7zip: 125mb 184026.4 2.627441 affected by error. (See Surprisingly Rar wins for best decompression of text data, while window algorithm compresses data by keeping track of a set • Compression Time: the amount of time it took the tool to Figure 1) info-zip wins for best decompression of video data. Info-zip and of data and how far away the next set is. compress the test file • Graph summary of gzip win for best overall decompression time compression time and Figure 1: ANOVA • Decompression Time: the amount of time it took the tool Best Ratio / Time - Gzip and info-zip have the best compression Bzip2 uses widely different methods to compress data compression ratio. (See to decompress the produced compressed file Figure 2 and 3) Analysis ratio for all categories, and are too close to differentiate due to compared to 7zip. Bzip2 begins by using the Burrows– • Compression Ratio: the amount of compression our confidence interval. A summary of the results are below. Wheeler transform[2] which is a reversible algorithm that can performed by the tool move repeating patterns of data next to each other Compression Compression Decompression Compression (BANANA becomes BNNAAA). This is a useful step Factors and Levels Ratio Time Time /Time because the next algorithm, the move-to-front transform[3], • Size of input file works best when repeating patterns are next to each other • 5mb, 25mb, and 125mb Audio 7zip Gzip Info-zip Info-zip / gzip as any repetitions are represented as zeroes (BNNAAA • Type of input file (uncompressed data) Info-zip / gzip Info-zip Gzip Info-zip / gzip becomes 1, 13, 0, 2, 0, 0). The final step of the Bzip2 • ASCII text, audio, video, and random Random algorithm is to use Huffman coding[4] to translate the integer 7zip Info-zip / RAR Info-zip / gzip Text values into the minimal set of bits necessary to express Measurements gzip them. 7zip Info-zip / Info-zip Info-zip / gzip Measurements were made by running the utility on the Video input data and using version 1.7-r1 of the GNU time utility gzip Gzip & info-zip use the DEFLATE[5] algorithm which is a to time it’s execution time. Figure 4: Summary of Conclusions combination of LZ77 and Huffman coding. Essentially the binary data is compressed using Huffman coding and then Run Count the codes are expressed as a set of X (length) and Y (offset) Based on the results of an initial 10 run test and the Bibliography bits in order to differentiate Huffman codes that repeat. selected 90% confidence interval and +/- 5% error, it was Figure 2: Compression determined that 100 runs for each combination of factor 1.Description of LZ77 algorithm, RAR uses the most exotic compression methods compared and level would be sufficient and feasible. The full results Time <http://www.cs.duke.edu/courses/spring03/cps296.5/p to the other tools discussed. It begins by using a system of this preliminary analysis can be seen in Table 1. apers/ziv_lempel_1977_universal_algorithm.pdf> called Prediction by Partial Matching[6]. The goal of this 2.Description of Burrows–Wheeler transform algorithm, algorithm is to create a mathematical model that can <http://www.eecs.harvard.edu/~michaelm/CS222/burro accurately describe the sequence of bytes. In order to Measurements Needed ws-wheeler.pdf> accomplish this the algorithm examines each byte in 3.Description of move-to-front transform algorithm, sequence and determines the probability of a symbol 7zip bzip2 gzip info-zip rar Mean <http://www.cs.cmu.edu/~sleator/papers/adaptive- (unique byte) of coming after another. Audio 15451 188 1 1 1 47.75 data-compression.pdf> 4.Description of Huffman Coding, This Analysis is meant to provide multiple results and Random 51 227 1 1 83 72.6 <http://compression.ru/download/articles/huff/huffman_ conclusions in an effort to aid a reader in deciding which 1952_minimum-redundancy-codes.pdf> compression utility is proper for their given situation. Text 27364 129 16 14 101 65 5.Description of DEFLATE Algorithm, Video 9 410 5 6 30 92 <http://tools.ietf.org/html/rfc1951> 6.Description of Prediction by Partial Matching algorithm, Mean 30 238.5 5.75 5.5 53.75 69.3375 <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10. Figure 3: Compression 1.1.14.4305> Table 1: Initial Runs Ratios Results.