Tobiesen Ole.Pdf (2.195Mb)

Faculty of Science and Technology MASTER’S THESIS Study program/ Specialization: Spring semester, 2016 Master of Science in Computer Science Open Writer: Ole Tobiesen ………………………………………… (Writer’s signature) Faculty supervisor: Reggie Davidrajuh Thesis title: Data Fingerprinting -- Identifying Files and Tables with Hashing Schemes Credits (ECTS): 30 Key words: Pages: 145, including table of contents Data Fingerprinting and appendix Fuzzy Hashing Machine Learning Enclosure: 4 7z archives Merkle Trees Finite Fields Mersenne Primes Stavanger, 15 June 2016 Front page for master thesis Faculty of Science and Technology Decision made by the Dean October 30th 2009 Abstract INTRODUCTION: Although hash functions are nothing new, these are not limited to cryptographic purposes. One important field is data fingerprinting. Here, the purpose is to generate a digest which serves as a fingerprint (or a license plate) that uniquely identifies a file. More recently, fuzzy fingerprinting schemes — which will scrap the avalanche effect in favour of detecting local changes — has hit the spotlight. The main purpose of this project is to find ways to classify text tables, and discover where potential changes or inconsitencies have happened. METHODS: Large parts of this report can be considered applied discrete math- ematics — and finite fields and combinatorics have played an important part. Ra- bin’s fingerprinting scheme was tested extensively and compared against existing cryptographic algorithms, CRC and FNV. Moreover, a self-designed fuzzy hashing algorithm with the preliminary name No-Frills Hash has been created and tested against Nilsimsa and Spamsum. NFHash is based on Mersenne primes, and uses a sliding window to create a fuzzy hash. Futhermore, the usefullness of lookup tables (with partial seeds) were also explored. The fuzzy hashing algorithm has also been combined with a k-NN classifier to get an overview over it’s ability to classify files. In addition to NFHash, Bloom filters combined with Merkle Trees have been the most important part of this report. This combination will allow a user to see where a change was made, despite the fact that hash functions are one-way. Large parts of this project has dealt with the study of other open-source libraries and applications, such as Cassandra and SSDeep — as well as how bitcoins work. Optimizations have played a crucial role as well; different approaches to a problem might lead to the same solution, but resource consumption can be very different. RESULTS: The results have shown that the Merkle Tree-based approach can track changes to a table very quickly and efficiently, due to it being conservative when it comes to CPU resources. Moreover, the self-designed algorithm NFHash also does well in terms of file classification when it is coupled with a k-NN classifyer. CONCLUSION: Hash functions refers to a very diverse set of algorithms, and not just algorithms that serve a limited purpose. Fuzzy Fingerprinting Schemes can still be considered to be at their infant stage, but a lot has still happened the last ten years. This project has introduced two new ways to create and compare hashes that can be compared to similar, yet not necessarily identical files — or to detect if (and to what extent) a file was changed. Note that the algorithms presented here should be considered prototypes, and still might need some large scale testing to sort out potential flaws. KEYWORDS: Fingerprinting, k-NN, Supervised Learning, Sliding Window, Rolling Hash, Classification, fuzzy fingerprinting, locality-sensitive hashing, one-way function, Merkle Trees, collision resistance, hash functions, Galois fields, Mersenne primes, Rabin’s Fingerprinting Scheme, CRC, Bloom Filters, Jaccard, Damerau- Levenshtein Page i For my fiancee, Dongjing Page ii LIST OF FIGURES List of Figures 1 Just like it is difficult and tedious to reassemble the shredded paper into a complete document, it is also computationally infeasible to try to decipher a hashed output . 15 2 A hash functions scrambles an input and then checks for a match among other scrambled outputs . 16 3 How the Merkle-Damgård construction operates . 18 4 Chosen-prefix attacks explained (this picture is public domain) . 23 5 Two identical images, with two completely different SHA-256 digests, due to a few bytes differing in the file header . 24 6 A naive chunk-wise hashing approach might sometimes give good results on two equally long inputs . 25 7 A naive chunkwise hashing approach will not give good results if there is a length disparity (even by one character) . 26 8 A rolling hash will utilize a sliding window to hash each byte individually. If one of these digests matches or exceeds the predefined block boundariy, a block boundary is defined. 26 9 Nilsimsa makes use of accumulators instead of taking the rolling hash approach. 28 10 Example of a Merkle Tree with a depth of 3 . 34 11 Example of a Merkle Tree with an Odd Number of Terminal Nodes . 35 12 Example of a Bloom Filter with m entries, 2 digests per entry, 3 bits per digest, and n bits in the bitset . 36 13 Bloom Filters will tell if an element is probably in the dataset or data structure being referenced . 37 14 Example of a two-class, two-dimensional approach with MLE . 40 15 How a hypersphere in a two-dimensional k-NN approach with n = 9 and various kn values operates . 41 16 Class ω1 is represented by the blue line, while class ω2 is represented by the orange line. The purple line is where the two probability of the two classes intersect . 44 17 Example of a two-dimensional dataset with three classes using the k-NN approach . 46 18 Example of a two-dimensional dataset with three classes where overfitting is avoided . 47 19 Left: Errors due to high variance. Right: Errors due to high bias (this picture is public domain) . 48 20 NFHash explained in three steps . 54 21 How a change in node C will affect the traversal . 60 22 Initial Approach: Comparing two Merkle Trees with a different number of nodes (a depth of 4 and 2 for each tree respectively) . 62 23 Final Approach when comparing two Trees of different depth . 63 Page iii LIST OF FIGURES 24 An example of how it can be applied in a table with 5 columns and 16 rows 64 25 Inserting a row and a column into a table (blue color) . 66 26 Worst-case insertion compared to best-case insertion of a row or a column when represented by Merkle Trees . 67 27 UML class diagram of the entire project . 70 28 How the different fuzzy hashing schemes handle 50, 100, and 200 consec- utive 30 KB tables. The times are given in ms . 83 29 Time used by each algorithm to hash a single file of the respective sizes. The times are given in ms . 85 30 Output when comparing two different SAAB 900 models . 96 31 Comparing two Different-Length Tables . 97 32 Two similar png images that have two different NFHash digests because the latter is an inverted version of the former in terms of colors . 104 Page iv LIST OF TABLES List of Tables 1 Important Acronmyms . viii 2 Arithmetic operations compared to their Boolean equivalents . 8 3 Example of how a binary Galois field with a degree of 6 works . 10 4 Addition and subtraction in GF (2) (top) can be considered identical to an XOR operation. Multiplication (bottom) can be considered identical to an AND operation . 11 5 Examples of irreducible polynomials of various degrees over GF(2) . 12 6 Security requirements for a secure hashing scheme . 17 7 Collisions risks and fingerprint sizes . 20 8 How a Cyclic Redundancy Check Operates (for illustrative purposes, the four padded bits are not included) . 23 9 A Simplified Example Showcasing the Jaccard Coefficient Between Java and C++ . 31 10 Examples of How to Calculate Levenshtein Distance (the bottom table is the most efficient approach). Here, s means substitution, i means insertion, and d means deletion . 31 11 Matrix representation of the most optimal solution in Table 10 . 32 12 Levenshtein and how it stacks up against Damerau-Levenshtein . 32 13 A Two-Class Dataset . 43 14 Damerau-Levenshtein distance between test sample and samples in dataset 43 15 A confusion matrix explained . 45 16 Base64 Encoding Table . 57 17 NFHash compared to Nilsimsa and Spamsum . 59 18 Different settings for each subtest in Test II . 90 Page v LISTINGS Listings 1 FNV-1a (pseudocode) . 21 2 Damerau-Levenshtein Distance . 70 3 Terniary conditions no longer offer many advantages in terms of optimizations . 72 4 How an x86-64 CPU Does Division . 73 6 How CRC-64 is implemented in this library (the lookup table can be found in the appendix). This is mostly based on the existing CRC-32 class in the java.util.zip package . 74 7 How the table generator for Rabin’s Fingerprinting Scheme works . 75 8 k-NN discriminant function . 76 9 Merkle Tree constructor . 77 10 Sorting one list according to another . 77 11 How the class representing nodes appears . 78 12 Comparing Two Trees . 79 13 A small part of a CSV from the first class . 90 14 A small part of a CSV from the second class . 90 15 bloomFilter.java . .g 16 CRC64.java . .h 17 DamerauLevenshtein.java . .j 18 kNN.java . .k 19 NFHash.java . .n 20 tableGenerator.java . .o 21 Rabin.java . .p 22 merkle.java . .s 23 treeComparator.java . .w Page vi LISTINGS Acknowledgements This project would not have been possible without the help and support from several peo- ple. I would like to thank the supervisors in this project, Derek Göbel (Avito LOOPS), Paolo Predonzani (Avito LOOPS) and Reggie Davidrajuh (UiS) for helping me with this thesis, for providing me with feedback and tips, for helping me better test and debug the library — and for offering good advice.

Tobiesen Ole.Pdf (2.195Mb)

Fundamental Data Structures Contents

Remote Synchronization Using Differential Compression

Data Structures

String Hashing for Collection-Based Compression