Tobiesen Ole.Pdf (2.195Mb)

Total Page:16

File Type:pdf, Size:1020Kb

Load more

Faculty of Science and Technology MASTER’S THESIS Study program/ Specialization: Spring semester, 2016 Master of Science in Computer Science Open Writer: Ole Tobiesen ………………………………………… (Writer’s signature) Faculty supervisor: Reggie Davidrajuh Thesis title: Data Fingerprinting -- Identifying Files and Tables with Hashing Schemes Credits (ECTS): 30 Key words: Pages: 145, including table of contents Data Fingerprinting and appendix Fuzzy Hashing Machine Learning Enclosure: 4 7z archives Merkle Trees Finite Fields Mersenne Primes Stavanger, 15 June 2016 Front page for master thesis Faculty of Science and Technology Decision made by the Dean October 30th 2009 Abstract INTRODUCTION: Although hash functions are nothing new, these are not lim- ited to cryptographic purposes. One important field is data fingerprinting. Here, the purpose is to generate a digest which serves as a fingerprint (or a license plate) that uniquely identifies a file. More recently, fuzzy fingerprinting schemes — which will scrap the avalanche effect in favour of detecting local changes — has hit the spotlight. The main purpose of this project is to find ways to classify text tables, and discover where potential changes or inconsitencies have happened. METHODS: Large parts of this report can be considered applied discrete math- ematics — and finite fields and combinatorics have played an important part. Ra- bin’s fingerprinting scheme was tested extensively and compared against existing cryptographic algorithms, CRC and FNV. Moreover, a self-designed fuzzy hashing algorithm with the preliminary name No-Frills Hash has been created and tested against Nilsimsa and Spamsum. NFHash is based on Mersenne primes, and uses a sliding window to create a fuzzy hash. Futhermore, the usefullness of lookup tables (with partial seeds) were also explored. The fuzzy hashing algorithm has also been combined with a k-NN classifier to get an overview over it’s ability to classify files. In addition to NFHash, Bloom filters combined with Merkle Trees have been the most important part of this report. This combination will allow a user to see where a change was made, despite the fact that hash functions are one-way. Large parts of this project has dealt with the study of other open-source libraries and applications, such as Cassandra and SSDeep — as well as how bitcoins work. Optimizations have played a crucial role as well; different approaches to a problem might lead to the same solution, but resource consumption can be very different. RESULTS: The results have shown that the Merkle Tree-based approach can track changes to a table very quickly and efficiently, due to it being conservative when it comes to CPU resources. Moreover, the self-designed algorithm NFHash also does well in terms of file classification when it is coupled with a k-NN classifyer. CONCLUSION: Hash functions refers to a very diverse set of algorithms, and not just algorithms that serve a limited purpose. Fuzzy Fingerprinting Schemes can still be considered to be at their infant stage, but a lot has still happened the last ten years. This project has introduced two new ways to create and compare hashes that can be compared to similar, yet not necessarily identical files — or to detect if (and to what extent) a file was changed. Note that the algorithms presented here should be considered prototypes, and still might need some large scale testing to sort out potential flaws. KEYWORDS: Fingerprinting, k-NN, Supervised Learning, Sliding Window, Rolling Hash, Classification, fuzzy fingerprinting, locality-sensitive hashing, one-way func- tion, Merkle Trees, collision resistance, hash functions, Galois fields, Mersenne primes, Rabin’s Fingerprinting Scheme, CRC, Bloom Filters, Jaccard, Damerau- Levenshtein Page i For my fiancee, Dongjing Page ii LIST OF FIGURES List of Figures 1 Just like it is difficult and tedious to reassemble the shredded paper into a complete document, it is also computationally infeasible to try to decipher a hashed output . 15 2 A hash functions scrambles an input and then checks for a match among other scrambled outputs . 16 3 How the Merkle-Damgård construction operates . 18 4 Chosen-prefix attacks explained (this picture is public domain) . 23 5 Two identical images, with two completely different SHA-256 digests, due to a few bytes differing in the file header . 24 6 A naive chunk-wise hashing approach might sometimes give good results on two equally long inputs . 25 7 A naive chunkwise hashing approach will not give good results if there is a length disparity (even by one character) . 26 8 A rolling hash will utilize a sliding window to hash each byte individually. If one of these digests matches or exceeds the predefined block boundariy, a block boundary is defined. 26 9 Nilsimsa makes use of accumulators instead of taking the rolling hash approach. 28 10 Example of a Merkle Tree with a depth of 3 . 34 11 Example of a Merkle Tree with an Odd Number of Terminal Nodes . 35 12 Example of a Bloom Filter with m entries, 2 digests per entry, 3 bits per digest, and n bits in the bitset . 36 13 Bloom Filters will tell if an element is probably in the dataset or data structure being referenced . 37 14 Example of a two-class, two-dimensional approach with MLE . 40 15 How a hypersphere in a two-dimensional k-NN approach with n = 9 and various kn values operates . 41 16 Class ω1 is represented by the blue line, while class ω2 is represented by the orange line. The purple line is where the two probability of the two classes intersect . 44 17 Example of a two-dimensional dataset with three classes using the k-NN approach . 46 18 Example of a two-dimensional dataset with three classes where overfitting is avoided . 47 19 Left: Errors due to high variance. Right: Errors due to high bias (this picture is public domain) . 48 20 NFHash explained in three steps . 54 21 How a change in node C will affect the traversal . 60 22 Initial Approach: Comparing two Merkle Trees with a different number of nodes (a depth of 4 and 2 for each tree respectively) . 62 23 Final Approach when comparing two Trees of different depth . 63 Page iii LIST OF FIGURES 24 An example of how it can be applied in a table with 5 columns and 16 rows 64 25 Inserting a row and a column into a table (blue color) . 66 26 Worst-case insertion compared to best-case insertion of a row or a column when represented by Merkle Trees . 67 27 UML class diagram of the entire project . 70 28 How the different fuzzy hashing schemes handle 50, 100, and 200 consec- utive 30 KB tables. The times are given in ms . 83 29 Time used by each algorithm to hash a single file of the respective sizes. The times are given in ms . 85 30 Output when comparing two different SAAB 900 models . 96 31 Comparing two Different-Length Tables . 97 32 Two similar png images that have two different NFHash digests because the latter is an inverted version of the former in terms of colors . 104 Page iv LIST OF TABLES List of Tables 1 Important Acronmyms . viii 2 Arithmetic operations compared to their Boolean equivalents . 8 3 Example of how a binary Galois field with a degree of 6 works . 10 4 Addition and subtraction in GF (2) (top) can be considered identical to an XOR operation. Multiplication (bottom) can be considered identical to an AND operation . 11 5 Examples of irreducible polynomials of various degrees over GF(2) . 12 6 Security requirements for a secure hashing scheme . 17 7 Collisions risks and fingerprint sizes . 20 8 How a Cyclic Redundancy Check Operates (for illustrative purposes, the four padded bits are not included) . 23 9 A Simplified Example Showcasing the Jaccard Coefficient Between Java and C++ . 31 10 Examples of How to Calculate Levenshtein Distance (the bottom table is the most efficient approach). Here, s means substitution, i means inser- tion, and d means deletion . 31 11 Matrix representation of the most optimal solution in Table 10 . 32 12 Levenshtein and how it stacks up against Damerau-Levenshtein . 32 13 A Two-Class Dataset . 43 14 Damerau-Levenshtein distance between test sample and samples in dataset 43 15 A confusion matrix explained . 45 16 Base64 Encoding Table . 57 17 NFHash compared to Nilsimsa and Spamsum . 59 18 Different settings for each subtest in Test II . 90 Page v LISTINGS Listings 1 FNV-1a (pseudocode) . 21 2 Damerau-Levenshtein Distance . 70 3 Terniary conditions no longer offer many advantages in terms of optimiza- tions . 72 4 How an x86-64 CPU Does Division . 73 6 How CRC-64 is implemented in this library (the lookup table can be found in the appendix). This is mostly based on the existing CRC-32 class in the java.util.zip package . 74 7 How the table generator for Rabin’s Fingerprinting Scheme works . 75 8 k-NN discriminant function . 76 9 Merkle Tree constructor . 77 10 Sorting one list according to another . 77 11 How the class representing nodes appears . 78 12 Comparing Two Trees . 79 13 A small part of a CSV from the first class . 90 14 A small part of a CSV from the second class . 90 15 bloomFilter.java . .g 16 CRC64.java . .h 17 DamerauLevenshtein.java . .j 18 kNN.java . .k 19 NFHash.java . .n 20 tableGenerator.java . .o 21 Rabin.java . .p 22 merkle.java . .s 23 treeComparator.java . .w Page vi LISTINGS Acknowledgements This project would not have been possible without the help and support from several peo- ple. I would like to thank the supervisors in this project, Derek Göbel (Avito LOOPS), Paolo Predonzani (Avito LOOPS) and Reggie Davidrajuh (UiS) for helping me with this thesis, for providing me with feedback and tips, for helping me better test and debug the library — and for offering good advice.
Recommended publications
  • Fundamental Data Structures Contents

    Fundamental Data Structures Contents

    Fundamental Data Structures Contents 1 Introduction 1 1.1 Abstract data type ........................................... 1 1.1.1 Examples ........................................... 1 1.1.2 Introduction .......................................... 2 1.1.3 Defining an abstract data type ................................. 2 1.1.4 Advantages of abstract data typing .............................. 4 1.1.5 Typical operations ...................................... 4 1.1.6 Examples ........................................... 5 1.1.7 Implementation ........................................ 5 1.1.8 See also ............................................ 6 1.1.9 Notes ............................................. 6 1.1.10 References .......................................... 6 1.1.11 Further ............................................ 7 1.1.12 External links ......................................... 7 1.2 Data structure ............................................. 7 1.2.1 Overview ........................................... 7 1.2.2 Examples ........................................... 7 1.2.3 Language support ....................................... 8 1.2.4 See also ............................................ 8 1.2.5 References .......................................... 8 1.2.6 Further reading ........................................ 8 1.2.7 External links ......................................... 9 1.3 Analysis of algorithms ......................................... 9 1.3.1 Cost models ......................................... 9 1.3.2 Run-time analysis
  • Remote Synchronization Using Differential Compression

    Remote Synchronization Using Differential Compression

    Date of acceptance Grade th 6 November, 2012 ECLA Instructor Sasu Tarkoma Lossless Differential Compression for Synchronizing Arbitrary Single- Dimensional Strings Jari Karppanen Helsinki, September 20th, 2012 Master’s Thesis UNIVERSITY OF HELSINKI Department of Computer Science HELSINGIN YLIOPISTO − HELSINGFORS UNIVERSITET – UNIVERSITY OF HELSINKI Tiedekunta − Fakultet – Faculty Laitos − Institution − Department Faculty of Science Department of Computer Science Tekijä − Författare − Author Jari Karppanen Työn nimi − Arbetets titel − Title Lossless Differential Compression for Synchronizing Arbitrary Single-Dimensional Strings Oppiaine − Läroämne − Subject Computer Science Työn laji − Arbetets art − Level Aika − Datum − Month and year Sivumäärä − Sidoantal − Number of pages th Master’s thesis Sep 20 , 2012 93 pages Tiivistelmä − Referat − Abstract Differential compression allows expressing a modified document as differences relative to another version of the document. A compressed string requires space relative to amount of changes, irrespective of original document sizes. The purpose of this study was to answer what algorithms are suitable for universal lossless differential compression for synchronizing two arbitrary documents either locally or remotely. Two main problems in differential compression are finding the differences (differencing), and compactly communicating the differences (encoding). We discussed local differencing algorithms based on subsequence searching, hashtable lookups, suffix searching, and projection. We also discussed
  • Data Structures

    Data Structures

    Data structures PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Thu, 17 Nov 2011 20:55:22 UTC Contents Articles Introduction 1 Data structure 1 Linked data structure 3 Succinct data structure 5 Implicit data structure 7 Compressed data structure 8 Search data structure 9 Persistent data structure 11 Concurrent data structure 15 Abstract data types 18 Abstract data type 18 List 26 Stack 29 Queue 57 Deque 60 Priority queue 63 Map 67 Bidirectional map 70 Multimap 71 Set 72 Tree 76 Arrays 79 Array data structure 79 Row-major order 84 Dope vector 86 Iliffe vector 87 Dynamic array 88 Hashed array tree 91 Gap buffer 92 Circular buffer 94 Sparse array 109 Bit array 110 Bitboard 115 Parallel array 119 Lookup table 121 Lists 127 Linked list 127 XOR linked list 143 Unrolled linked list 145 VList 147 Skip list 149 Self-organizing list 154 Binary trees 158 Binary tree 158 Binary search tree 166 Self-balancing binary search tree 176 Tree rotation 178 Weight-balanced tree 181 Threaded binary tree 182 AVL tree 188 Red-black tree 192 AA tree 207 Scapegoat tree 212 Splay tree 216 T-tree 230 Rope 233 Top Trees 238 Tango Trees 242 van Emde Boas tree 264 Cartesian tree 268 Treap 273 B-trees 276 B-tree 276 B+ tree 287 Dancing tree 291 2-3 tree 292 2-3-4 tree 293 Queaps 295 Fusion tree 299 Bx-tree 299 Heaps 303 Heap 303 Binary heap 305 Binomial heap 311 Fibonacci heap 316 2-3 heap 321 Pairing heap 321 Beap 324 Leftist tree 325 Skew heap 328 Soft heap 331 d-ary heap 333 Tries 335 Trie
  • String Hashing for Collection-Based Compression

    String Hashing for Collection-Based Compression

    String Hashing for Collection-Based Compression Andrew Gregory Peel orcid.org/0000-0001-6710-7513 Submitted in total fulfilment of the requirements of the degree of Doctor of Philosophy December 2015 Department of Computing and Information Systems Melbourne School of Engineering The University of Melbourne Produced on archival quality paper i Abstract Data collections are traditionally stored as individually compressed files. Where the files have a significant degree of similarity, such as genomes, incremental backup archives, versioned repositories, and web archives, additional compression can be achieved using references to matching data from other files in the collection. We describe compression using long-range or inter-file similarities as collection- based compression (CBC). The principal problem of CBC is the efficient location of matching data. A common technique for multiple string search uses an index of hashes, referred to as fingerprints, of strings sampled from the text, which are then compared with hashes of substrings from the search string. In this thesis, we investigate the suitability of this technique to the problem of CBC. A CBC system, cobald, was developed which employs a two-step scheme: a preliminary long-range delta encoding step using the fingerprint index, fol- lowed by a compression of the delta file by a standard compression utility. Tests were performed on data collections from two sources: snapshots of web crawls (54 Gbytes) and multiple versions of a genome (26 Gbytes). Using an index of hashes of fixed-length substrings of length 1 024 bytes, significantly improved compression was achieved. The compression of the web collection was six times more than the compression achieved by gzip and three times 7-zip.