Full Document
Total Page:16
File Type:pdf, Size:1020Kb
R&D Centre for Mobile Applications (RDC) FEE, Dept of Telecommunications Engineering Czech Technical University in Prague RDC Technical Report TR-13-4 Internship report Evaluation of Compressibility of the Output of the Information-Concealing Algorithm Julien Mamelli, [email protected] 2nd year student at the Ecole´ des Mines d'Al`es (N^ımes,France) Internship supervisor: Luk´aˇsKencl, [email protected] August 2013 Abstract Compression is a key element to exchange files over the Internet. By generating re- dundancies, the concealing algorithm proposed by Kencl and Loebl [?], appears at first glance to be particularly designed to be combined with a compression scheme [?]. Is the output of the concealing algorithm actually compressible? We have tried 16 compression techniques on 1 120 files, and the result is that we have not found a solution which could advantageously use repetitions of the concealing method. Acknowledgments I would like to express my gratitude to my supervisor, Dr Luk´aˇsKencl, for his guidance and expertise throughout the course of this work. I would like to thank Prof. Robert Beˇst´akand Mr Pierre Runtz, for giving me the opportunity to carry out my internship at the Czech Technical University in Prague. I would also like to thank all the members of the Research and Development Center for Mobile Applications as well as my colleagues for the assistance they have given me during this period. 1 Contents 1 Introduction 3 2 Related Work 4 2.1 Information concealing method . 4 2.2 Archive formats . 5 2.3 Compression algorithms . 5 2.3.1 Lempel-Ziv algorithm . 5 2.3.2 Huffman coding . 6 2.3.3 Burrows-Wheeler transform . 6 2.3.4 Dynamic Markov compression . 6 2.3.5 Prediction by partial matching . 6 3 Solution 7 3.1 Archive formats . 10 3.2 Compression algorithms . 10 3.2.1 Huffman coding . 10 3.2.2 Burrows-Wheeler transform . 11 3.2.3 Dynamic Markov compression . 11 3.2.4 Prediction by partial matching . 11 4 Performance Evaluation 12 4.1 Without dust . 12 4.1.1 Archive formats . 13 4.1.2 Compression algorithms . 18 4.2 With dust . 20 5 Conclusion 21 Bibliography 23 2 Chapter 1 Introduction Over the past decade, the proliferation of large capacity storage media has brought security issues to the forefront. In fact in the context of cloud computing, a very large number of private information are running on the Internet. In order to protect messages against unwanted observers some different techniques have been created. The repeats- based information concealing algorithm is one of them. This method has been designed to hide messages while preserving the information content to be further analyzed. Therefore some different features, like word-frequency, are kept from the initial message. The main idea is to firstly generate several substrings from the input sequence, which are then repeated, shuffled and reassembled in the output sequence. It is important to stress that there is no symbols added to the input. Following this idea, researchers have made the assumption that compression methods could advantageously use repeated strings to reduce the size of the output file [?]. Consequently, the aim of the internship was to evualte the compressibility of the output of the information concealing algorithm. The work was divided into two main parts. At the beginning, we have been focused on testing a vast number of archive formats and comparing each other. During this first time, particular attention was given to evaluate them in the context of the concealing algorithm, and independently of their overall efficiency. Then, we have deeply investigated archive formats by understanding what compression algorithms they use. In this second part, each compression algorithm have been separately tested, to test their own performance. 3 Chapter 2 Related Work 2.1 Information concealing method Based on the DNA structure, the information concealing method is intended to hide messages before exchanging them over the Internet. The main concept is to use repetitions as a key element to forbid the reconstruction of the original sequence. Actually the reconstruction is probably computationally hard. Consequently, even if the attacker knows the algorithm itself, he will not be able to acces to the input message [?]. The k parameter represents the size of blocks. As we can see on the following picture, if we chose k = 4, the input text file is read by successive blocks of 4 characters. 4 2.2 Archive formats Several archive formats have been tested with the aim of drawing a comparison between them. .gz .7z .xz .zip .rar .bz2 LZ77 LZMA LZMA2 Multiple Multiple BWT Huffman methods methods Huffman coding coding BWT is the Burrows-Wheeler transform. We have completed this study by compressing data with some other less commonly used formats (.arc, .bh, .cab, .sqx, .yz1, .zpaq). In order to further investigate compression techniques, and to pratically determine which one could be particularly effective for the concealing method, we have looked for algorithms used by these archive formats. We have then tested four of them separately on the same panel of data, to evaluate their own performance. 2.3 Compression algorithms As we can see on the previous table, some algorithms are frequently used: • Lempel-Ziv algorithm and its variants (LZ77, LZMA, LZMA2) • Huffman coding • Burrows-Wheeler transform 2.3.1 Lempel-Ziv algorithm Described in 1977 by Jacob Ziv and Abraham Lempel, it is the most commonly known algorithm for dictionnary-based compression. Using a particular dictionnary for each file, it compress the input by creating a pointer to the dictionnary for each repetition of a given sequence. In texts files, every repeated word is linked to its reference in the dictionnary. Consequently, we have supposed that Lempel-Ziv algorithm could advantageously use repetitions generated by the concealing algorithm, to reach good compression perfor- mances. 5 2.3.2 Huffman coding Designed in 1952 by David Huffman, Huffman coding starts by constructing a list of symbols used in the input file (in descending order of frequencies). Then, at each iteration, the two symbols which have the smallest probabilities are placed on the tree, and removed from the list. Finally the method assigns a unique binary code to each leaf of the tree. As a consequence, the output file is a binary string. Huffman coding is often combined with other techniques to have better results. 2.3.3 Burrows-Wheeler transform Invented by Michael Burrows and David Wheeler in 1994, this method converts a string S to another string L which satisfies two conditions: • Any region of L will tend to have a concentration of just a few symbols. • It is possible to reconstruct the original string S. We have also looked for other compression algorithms. So we have tested Dynamic Markov compression and Prediction by partial matching on our same data set. 2.3.4 Dynamic Markov compression Created by Gordon Cormack and Nigel Horspool in 1987, it is a statistical compression technique which compress a file by predicting the next bit using the previously seen bits. 2.3.5 Prediction by partial matching Developed by John Cleary and Ian Witten in 1984, it first calculates probability distri- bution of characters and add states to the an existing machine. This algorithm is known to be particularly effective on texts. 6 Chapter 3 Solution We have chosen several data and a set of compression techniques in order to evaluate them. Is the output of the concealing algorithm compressible? The main purpose of the evaluation process was to anwser this question. Compression methods have been implemented on different types of data. The properties of the input files are presented in the following tables. Text files (.txt): File 01 File 02 File 03 File 04 File 05 Characters (without spaces) 430 9 156 26 554 18 280 1 239 Characters (with spaces) 505 11 119 32 397 22 219 1 508 Size (kB) 0,529 10,9 31,6 21,8 1,5 File 06 File 07 File 08 File 09 File 10 Characters (without spaces) 28 981 3 178 3 147 4 188 15 371 Characters (with spaces) 35 546 3 941 3 823 5 137 18 642 Size (kB) 35 3,87 3,79 5,04 18,3 Audio files (.wav): File 01 File 02 File 03 File 04 File 05 Length (mm:ss) 00:09 00:02 00:02 00:02 00:04 Size (kB) 78 27,8 26,9 27,5 46,2 File 06 File 07 File 08 File 09 File 10 Length (mm:ss) 00:02 00:01 00:01 00:01 00:01 Size (kB) 27,8 29,5 13,7 19,3 7,19 7 We have first tested each compression method without using the concealing algorithm, to ensure that they can use repetitions to improve compression performances in a general context. So we have made each text file three times longer by copying the content. Then we have compressed them and compared their size with the size of the simple compressed text files. Determining the following ratio, we have evaluated the overall compression performances of each compression technique. size of the compressed simple file size of the compressed triple file In the two following tables, every value is the average compression ratio over all files for the given compression method. Archive format Compression ratio .arc 0,998 .yz1 0,997 .7z 0,986 .xz 0,985 .rar 0,978 .cab 0,977 .sqx 0,915 .zip 0,892 .bh 0,890 .gz 0,891 .zpaq 0,732 .bz2 0,727 Compression algorithm Compression ratio Dynamic Markov compression 0,994 Prediction by partial matching 0,859 Burrows-Wheeler transform 0,676 Huffman coding 0,333 As we can see, most of the compression techniques have a very high compression ratio (greater than 0,8).