<<

R&D Centre for Mobile Applications (RDC) FEE, Dept of Telecommunications Engineering Czech Technical University in Prague RDC Technical Report TR-13-4

Internship report Evaluation of Compressibility of the Output of the Information-Concealing Algorithm

Julien Mamelli, [email protected] 2nd year student at the Ecole´ des Mines d’Al`es (Nˆımes,France)

Internship supervisor: Luk´aˇsKencl, [email protected]

August 2013 Abstract

Compression is a key element to exchange files over the Internet. By generating re- dundancies, the concealing algorithm proposed by Kencl and Loebl [?], appears at first glance to be particularly designed to be combined with a compression scheme [?]. Is the output of the concealing algorithm actually compressible? We have tried 16 compression techniques on 1 120 files, and the result is that we have not found a solution which could advantageously use repetitions of the concealing method. Acknowledgments

I would like to express my gratitude to my supervisor, Dr Luk´aˇsKencl, for his guidance and expertise throughout the course of this work.

I would like to thank Prof. Robert Beˇst´akand Mr Pierre Runtz, for giving me the opportunity to carry out my internship at the Czech Technical University in Prague.

I would also like to thank all the members of the Research and Development Center for Mobile Applications as well as my colleagues for the assistance they have given me during this period.

1 Contents

1 Introduction 3

2 Related Work 4 2.1 Information concealing method ...... 4 2.2 Archive formats ...... 5 2.3 Compression algorithms ...... 5 2.3.1 Lempel-Ziv algorithm ...... 5 2.3.2 Huffman coding ...... 6 2.3.3 Burrows-Wheeler transform ...... 6 2.3.4 Dynamic Markov compression ...... 6 2.3.5 Prediction by partial matching ...... 6

3 Solution 7 3.1 Archive formats ...... 10 3.2 Compression algorithms ...... 10 3.2.1 Huffman coding ...... 10 3.2.2 Burrows-Wheeler transform ...... 11 3.2.3 Dynamic Markov compression ...... 11 3.2.4 Prediction by partial matching ...... 11

4 Performance Evaluation 12 4.1 Without dust ...... 12 4.1.1 Archive formats ...... 13 4.1.2 Compression algorithms ...... 18 4.2 With dust ...... 20

5 Conclusion 21

Bibliography 23

2 Chapter 1

Introduction

Over the past decade, the proliferation of large capacity storage media has brought security issues to the forefront. In fact in the context of cloud computing, a very large number of private information are running on the Internet. In order to protect messages against unwanted observers some different techniques have been created. The repeats- based information concealing algorithm is one of them.

This method has been designed to hide messages while preserving the information content to be further analyzed. Therefore some different features, like word-frequency, are kept from the initial message. The main idea is to firstly generate several substrings from the input sequence, which are then repeated, shuffled and reassembled in the output sequence.

It is important to stress that there is no symbols added to the input.

Following this idea, researchers have made the assumption that compression methods could advantageously use repeated strings to reduce the size of the output file [?].

Consequently, the aim of the internship was to evualte the compressibility of the output of the information concealing algorithm.

The work was divided into two main parts. At the beginning, we have been focused on testing a vast number of archive formats and comparing each other. During this first time, particular attention was given to evaluate them in the context of the concealing algorithm, and independently of their overall efficiency.

Then, we have deeply investigated archive formats by understanding what compression algorithms they use. In this second part, each compression algorithm have been separately tested, to test their own performance.

3 Chapter 2

Related Work

2.1 Information concealing method

Based on the DNA structure, the information concealing method is intended to hide messages before exchanging them over the Internet. The main concept is to use repetitions as a key element to forbid the reconstruction of the original sequence. Actually the reconstruction is probably computationally hard. Consequently, even if the attacker knows the algorithm itself, he will not be able to acces to the input message [?].

The k parameter represents the size of blocks. As we can see on the following picture, if we chose k = 4, the input text file is read by successive blocks of 4 characters.

4 2.2 Archive formats

Several archive formats have been tested with the aim of drawing a comparison between them.

.gz . .xz . . .bz2 LZ77 LZMA LZMA2 Multiple Multiple BWT Huffman methods methods Huffman coding coding

BWT is the Burrows-Wheeler transform.

We have completed this study by compressing data with some other less commonly used formats (., .bh, .cab, ., .yz1, .).

In order to further investigate compression techniques, and to pratically determine which one could be particularly effective for the concealing method, we have looked for algorithms used by these archive formats. We have then tested four of them separately on the same panel of data, to evaluate their own performance.

2.3 Compression algorithms

As we can see on the previous table, some algorithms are frequently used:

• Lempel-Ziv algorithm and its variants (LZ77, LZMA, LZMA2)

• Huffman coding

• Burrows-Wheeler transform

2.3.1 Lempel-Ziv algorithm Described in 1977 by Jacob Ziv and Abraham Lempel, it is the most commonly known algorithm for dictionnary-based compression. Using a particular dictionnary for each file, it the input by creating a pointer to the dictionnary for each repetition of a given sequence. In texts files, every repeated word is linked to its reference in the dictionnary.

Consequently, we have supposed that Lempel-Ziv algorithm could advantageously use repetitions generated by the concealing algorithm, to reach good compression perfor- mances.

5 2.3.2 Huffman coding Designed in 1952 by David Huffman, Huffman coding starts by constructing a list of symbols used in the input file (in descending order of frequencies). Then, at each iteration, the two symbols which have the smallest probabilities are placed on the tree, and removed from the list. Finally the method assigns a unique binary code to each leaf of the tree. As a consequence, the output file is a binary string.

Huffman coding is often combined with other techniques to have better results.

2.3.3 Burrows-Wheeler transform Invented by Michael Burrows and David Wheeler in 1994, this method converts a string S to another string L which satisfies two conditions:

• Any region of L will tend to have a concentration of just a few symbols.

• It is possible to reconstruct the original string S.

We have also looked for other compression algorithms. So we have tested Dynamic Markov compression and Prediction by partial matching on our same data set.

2.3.4 Dynamic Markov compression Created by Gordon Cormack and Nigel Horspool in 1987, it is a statistical compression technique which compress a file by predicting the next bit using the previously seen bits.

2.3.5 Prediction by partial matching Developed by John Cleary and Ian Witten in 1984, it first calculates probability distri- bution of characters and add states to the an existing machine. This algorithm is known to be particularly effective on texts.

6 Chapter 3

Solution

We have chosen several data and a set of compression techniques in order to evaluate them. Is the output of the concealing algorithm compressible? The main purpose of the evaluation process was to anwser this question.

Compression methods have been implemented on different types of data. The properties of the input files are presented in the following tables.

Text files (.txt):

File 01 File 02 File 03 File 04 File 05 Characters (without spaces) 430 9 156 26 554 18 280 1 239 Characters (with spaces) 505 11 119 32 397 22 219 1 508 Size (kB) 0,529 10,9 31,6 21,8 1,5

File 06 File 07 File 08 File 09 File 10 Characters (without spaces) 28 981 3 178 3 147 4 188 15 371 Characters (with spaces) 35 546 3 941 3 823 5 137 18 642 Size (kB) 35 3,87 3,79 5,04 18,3

Audio files (.wav):

File 01 File 02 File 03 File 04 File 05 Length (mm:ss) 00:09 00:02 00:02 00:02 00:04 Size (kB) 78 27,8 26,9 27,5 46,2

File 06 File 07 File 08 File 09 File 10 Length (mm:ss) 00:02 00:01 00:01 00:01 00:01 Size (kB) 27,8 29,5 13,7 19,3 7,19

7 We have first tested each compression method without using the concealing algorithm, to ensure that they can use repetitions to improve compression performances in a general context.

So we have made each text file three times longer by copying the content. Then we have compressed them and compared their size with the size of the simple compressed text files. Determining the following ratio, we have evaluated the overall compression performances of each compression technique.

size of the compressed simple file size of the compressed triple file

In the two following tables, every value is the average compression ratio over all files for the given compression method.

Archive format Compression ratio .arc 0,998 .yz1 0,997 .7z 0,986 .xz 0,985 .rar 0,978 .cab 0,977 .sqx 0,915 .zip 0,892 .bh 0,890 .gz 0,891 .zpaq 0,732 .bz2 0,727

Compression algorithm Compression ratio Dynamic Markov compression 0,994 Prediction by partial matching 0,859 Burrows-Wheeler transform 0,676 Huffman coding 0,333

As we can see, most of the compression techniques have a very high compression ratio (greater than 0,8). So these compression methods can use repetitions to improve com- pression in a general context. In the sections that follow, these compression techniques will be used in the particular context of the concealing algorithm.

8 The general process used to evaluate compressibility of the output of the concealing al- gorithm can be split into two parts. Files are compressed before and after being concealed by the algorithm.

Figure 3.1: Evaluation process

In order to evaluate compression techniques, two matrices have been used.

We defined α as the uncompressed size ratio. It was established before the compression. size of the input file α = size of the output file

We also defined β as the compressed size ratio. It is the indicator of performance. size of the compressed input file β = size of the compressed output file

The closer β is to 1, the better the compression technique is ranked. According to the conjecture [?], this matrix must be different from α.

It is important to note that the compress size ratio is different from the compression ratio of the technique itself, because we are not evaluating the same file before and after the compression. Indeed we estimate the performance of the compression technique in the context of the concealing algorithm.

Tables of results are presented with the average compressed size ratio. For a given compression method and a given value of the k parameter, the average compressed size ratio over all files equals the sum of the compressed size ratio of each file, divided by the number of files.

We have used a 2004 Matlab implementation of the method, which conceals text files (.txt) and audio files (.wav).

9 3.1 Archive formats

The panel considered is composed by ten text files and ten audio files. We have chosen four values of the k parameter (k = 3, k = 4, k = 5, k = 10) and we have tested 12 archive formats (.gz, .7z, .xz, .zip, .rar, .bz2, .arc, .bh, .cab, .sqx, .yz1, .zpaq). As a consequence, the set was composed by 960 files.

As initially supposed, Lempel-Ziv algorithm seems to be a key element to reach good compression performances with the concealing algorithm. Indeed, 7-ZIP, and XZ, which implement Lempel-Ziv algorithm are among the best archive formats in the context of the concealing algorithm. And , which does not use it, has the lowest compressed size ratio in every case.

GZIP has the better compressed size ratio. As an implementation of both Lempel-Ziv algorithm (LZ77) and Huffman coding, it could suggest that combining a dictionnary- based compression algorithm with an entropic one is a good choice for the concealing method.

BZIP2 combines Huffman coding and Burrows-Wheeler transform. This solution shows a lower result, we can deduce from it that Huffman coding is not sufficient to ensure a good compressed size ratio.

3.2 Compression algorithms

In this second part, we have only used texts. We have chosen the same values of the k parameter (k=3, k=4, k=5, k=10) and we have tested 4 compression algorithms (Huffman coding, Burrows-Wheeler transform, Dynamic Markov compression, Prediction by partial matching). So the set was composed by 160 files.

3.2.1 Huffman coding The output file of Huffman coding being a binary string, we have to divide it by 8 to find the actual compressed size, because each binary character is itself encoded on one .

It should be underlined that Huffman coding depends only on the k parameter. Indeed, results with this algorithm are independent from the size of the input text files.

To compress a file with a small value of the k parameter, Huffman coding represents the best compression algorithm.

10 3.2.2 Burrows-Wheeler transform The higher the k parameter, the higher the compressed size ratio.

There is a peak with the first and the fifth text files because they are the shortest ones.

3.2.3 Dynamic Markov compression We have chosen a memsize of 256 MBytes.

The higher the k parameter, the higher the compressed size ratio.

3.2.4 Prediction by partial matching The higher the k parameter, the higher the compressed size ratio.

To compress a file with a high value of the k parameter, Prediction by partial matching represents the best compression algorithm.

11 Chapter 4

Performance Evaluation

We have looked for files to compress them, and to evaluate if β is close to 1, like in the conjecture, or if it is close to α.

This chapter presents the results obtained on the data set.

The concealing method can be used with or without dust. Dust is a little string added to the end of each segment in order to make the reconstruction process even more com- plicated. We have first tested compression techniques without dust, and eventually we have added it in text files to compare with the previous results.

4.1 Without dust

During this first part, we have used the concealing method without dust.

The first two figures (figure 4.1 and figure 4.2) present the uncompressed size ratio for the input files.

Figure 4.1: Text files Figure 4.2: Audio files For all input files, α is between 0.3 and 0.4.

12 4.1.1 Archive formats The following tables presents the average compressed size ratio over all files.

Text files (.txt):

Archive Format .gz .7z .xz .zip .rar .bz2 For k = 3 0,331 0,345 0,340 0,341 0,340 0,310 For k = 4 0,355 0,368 0,362 0,361 0,361 0,320 For k = 5 0,390 0,409 0,402 0,400 0,401 0,346 For k = 10 0,506 0,528 0,525 0,517 0,520 0,430

Archive Format .arc .bh .cab .sqx .yz1 .zpaq For k = 3 0,343 0,332 0,349 0,341 0,337 0,326 For k = 4 0,354 0,351 0,371 0,360 0,350 0,333 For k = 5 0,379 0,391 0,408 0,397 0,378 0,355 For k = 10 0,465 0,508 0,523 0,512 0,480 0,429

As wee can see, results are not close to 1. They are close to α.

Audio files (.wav):

Archive Format .gz .7z .xz .zip .rar .bz2 For k = 3 0,434 0,415 0,414 0,436 0,410 0,380 For k = 4 0,475 0,454 0,453 0,477 0,446 0,393 For k = 5 0,522 0,504 0,504 0,525 0,489 0,425 For k = 10 0,646 0,637 0,636 0,645 0,598 0,513

Results are slightly better with audio files in average. But these same results are also varying more than those achieved with text files, as wee can see on the figures below.

13 These graphs display results from the evaluation of the compressed size ratio for each archive format and for each value of the k parameter.

The first four figures (from figure 4.3 to figure 4.6) show results with text files.

Figure 4.3: Text files with k=3

Figure 4.4: Text files with k=4

14 Figure 4.5: Text files with k=5

Figure 4.6: Text files with k=10

15 The four following figures (from figure 4.7 to figure 4.10) show the compressed size ratio for each audio file.

Figure 4.7: Audio files with k=3

Figure 4.8: Audio files with k=4

16 Figure 4.9: Audio files with k=5

Figure 4.10: Audio files with k=10

17 4.1.2 Compression algorithms This second section focuses on the evaluation of the compression algorithms.

In all cases below, the given value is the average compressed size ratio over all files. Compression Huffman BWT Dynamic Markov Prediction by Algorithm coding compression partial matching For k = 3 0,366 0,315 0,321 0,330 For k = 4 0,349 0,326 0,326 0,350 For k = 5 0,345 0,352 0,343 0,389 For k = 10 0,328 0,427 0,393 0,505

BWT is the Burrows-Wheeler transform.

Once again, we see that the compressed size ratio is close to the uncompressed size ratio.

The following figures (from figure 4.11 to figure 4.14) show how text files was compressed by each compression algorithm.

Figure 4.11: Huffman coding

Huffman coding preserves the order of the data. And the higher the k parameter, the lower the compressed size ratio, unlike other compression algorithms.

18 Figure 4.12: Burrows-Wheeler transform

Figure 4.13: Dynamic Markov compression

19 Figure 4.14: Prediction by partial matching

4.2 With dust

In this second part, dust have been added at the end of each substring. We have retested all the sixteen compression methods with k=3. In this new case, α = 0, 037. The following table presents compressed size ratio over all text files. Adding dust make a big difference in terms of growth of the output but the compressed size ratio remains close to the uncompressed size ratio. Compression algorithm β .gz 0,029 .7z 0,033 .xz 0,032 .zip 0,031 .rar 0,031 .bz2 0,028 .arc 0,032 .bh 0,030 .cab 0,032 .sqx 0,030 .yz1 0,031 .zpaq 0,030 Huffman coding 0,037 Burrows-Wheeler Transform 0,029 Dynamic Markov compression 0,028 Prediction by partial matching 0,029

20 Chapter 5

Conclusion

As we have seen at the beginning, every compression method tested here can efficiently compress files when the whole text is repeated. Thus, generally speaking, these compres- sion techniques can advantageously use repetitions to improve their performances.

However, the results achieved during this internship show that none of these methods provides a satisfactory compressed size ratio with the concealing algorithm.

Maybe the reason is that these compression techniques use one character at a time, instead of using blocks.

Consequently, to continue this work, it would be useful to create a new compression method, perhaps based on the Dynamic Markov compression (because it is the one which has reached the best compression ratio). This algorithm, especially designed for the concealing algorithm, would take into account the value of the k parameter.

21 List of Figures

3.1 Evaluation process ...... 9

4.1 Text files ...... 12 4.2 Audio files ...... 12 4.3 Text files with k=3 ...... 14 4.4 Text files with k=4 ...... 14 4.5 Text files with k=5 ...... 15 4.6 Text files with k=10 ...... 15 4.7 Audio files with k=3 ...... 16 4.8 Audio files with k=4 ...... 16 4.9 Audio files with k=5 ...... 17 4.10 Audio files with k=10 ...... 17 4.11 Huffman coding ...... 18 4.12 Burrows-Wheeler transform ...... 19 4.13 Dynamic Markov compression ...... 19 4.14 Prediction by partial matching ...... 20

22 Bibliography

23