December, 2018 2018

Ben-Gurion University of the Negev The Faculty of Natural Sciences Department of Computer Science The Relation Between Jaccard Similarity and Edit Distance in LSH Based Malware Detection Thesis submitted in partial fulfillment of the requirements for the Master of Sciences Degree Mohammad Ghanayim Under the supervision of: Prof. Shlomi Dolev December, 2018 2018 Ben-Gurion University of the Negev The Faculty of Natural Sciences Department of Computer Science The Relation Between Jaccard Similarity and Edit Distance in LSH Based Malware Detection Mohammad Ghanayim Under the supervision of: Prof. Sidomi Dolev 20.12.2018 Mohammad Ghanayim: Date: 20.12.2018 Prof. Shlomi Dolev: Date: Committee Chairperson: Date: 23.12.2018 December,August, 2018 Abstract In this work, we employ textual data mining methods which are usually used for finding similar items in large datasets, namely n-grams and MinHash- ing and Locality Sensitive Hashing, for behavioral analysis and detection of malware. Following the misuse approach of intrusion detection, we train aclassifierbyusingtheabovetechniquestoefficientlyclusteradatasetof malicious Windows API call traces with respect to Jaccard similarity. The obtained clustering is used with great success in classifying new query traces for being either malicious or benign. The computation associated with extracting n-grams and calculating Jac- card similarity is much more efficient than the computation of edit distance (linear versus quadratic time complexity). Thus, we examine the possibility to utilize the Jaccard similarity in an estimation of edit distance. We formu- late inequalities defining the relationship between Jaccard similarity and edit distance, that impose upper and lower bounds on the edit distance values in terms of the Jaccard values. The scope of our analytical results is limited to representing strings (strings derived from sorted sets on n-grams) rather than the original (raw) textual data. Yet in practice, we obtained an indica- tion of solid correspondence between the edit distance of original strings and the edit distance of their representing strings. This thesis is based on a paper presented at IEEE NCA 2017, the In- ternational Symposium on Network Computing and Applications, held in Cambridge, MA, USA. Acknowledgments Gratitudes IwouldliketoexpressmysinceregratitudetomyadvisorProf.Shlomi Dolev and Prof. Sergey Frenkel (Russian Academy of Sciences) for providing me with guidance and assistance during my first research experience and for their patience and motivation. I would also like to thank our Taiwanese research partners Prof. Yeali S. Sun and Prof. Shun-Wen Hsiao from whom I learned a lot and enjoyed our joint work. I would like to acknowledge Amit Elran, Shaked Sagi, and Yoav Beeri, whom I instructed in their B.Sc. final project, for their contributions to the code used in this work. Likewise, I want to thank all my colleagues in Prof. Dolev’s lab and its administrative coordinator, Ms. Timi Budai, for the enjoyable times I spent with them. Last but not least, I would like to thank my parents, Zaki and Maysoon, for their unfailing support and continuous encouragement. Grant assistance This research was partially supported by the Council of Higher Education Scholarship for Master Students, a grant from the Israeli Ministry of Science Technology and Space and the National Science Council of Taiwan, and The Lynne and William Frankel Center for Computer Science. Contents 1 Introduction 1 2 Preliminaries 3 2.1 Malware Analysis And Detection ................ 3 2.1.1 Detection Approaches: Misuse vs. Anomaly ...... 3 2.1.2 Behavioral Analysis of Malware ............. 3 2.2 Machine Learning ......................... 4 2.2.1 Supervised And Unsupervised Machine Learning .... 4 2.2.2 Clustering ......................... 4 2.3 Jaccard Similarity And Edit Distance .............. 5 2.3.1 Jaccard Similarity ..................... 5 2.3.2 Edit Distance ....................... 6 2.4 Shingling, n-grams And Representing Strings .......... 7 2.4.1 Shingling Strings And n-grams ............. 7 2.4.2 Representing Strings ................... 7 2.5 Finding Similar Items ....................... 8 2.5.1 MinHashing ........................ 8 2.5.2 Locality Sensitive Hashing (LSH) ............ 10 3 Behavioral Analysis and Detection of Malware Using Local- ity Sensitive Hashing 12 3.1 Traces of API Calls ........................ 12 3.2 Misuse Detection: Clustering-Backed Classifier ......... 14 3.3 Experiments and Results ..................... 16 3.3.1 Do malware and benign traces have low similarity, de- spite the disregard for semantics? ............ 16 3.3.2 Does the learning algorithm comply with the Empirical Risk Minimization principle? ............... 18 3.3.3 Classifier’s Performance ................. 19 4 Edit Distance Approximation in Terms of Jaccard Similarity 23 4.1 Jaccard Similarity Versus Edit Distance ............. 23 4.2 Normalized Edit Distance .................... 24 4.3 The Middle Ground: Sets of n-grams & Representing Strings . 26 4.4 Bounds on Normalized Edit Distance .............. 29 4.5 Normalized Edit Distance Approximation ............ 31 List of Figures 1 Illustration of intersection and symmetric di↵erence of sets .. 6 2 Example API Call Trace ..................... 13 3Theclassifiertrainingproceduce................. 15 4 Jaccard similarity rates to the nearest medoid, bening vs. malicious ............................... 17 5 Classifier’s error rate versus similarity threshold ........ 20 6Classifier’sROCcurve...................... 21 7Classifier’serrorrateversusn-gramsize............. 22 8Classifier’serrorversustrainingsetsize............. 22 9Theaveragen-gram frequency versus n-gram size ....... 27 10 The average di↵erence between the NED on original docu- ments and NED on representing strings versus n-gram size .. 28 11 NED and its approximation, NED,measuredonasampleset 32 g 1 Introduction Machine learning and data mining are becoming prevalent in cybersecurity applications, due to their increased e↵ectiveness compared to conventional methods. Learning algorithms are particularly good at analyzing large datasets, identifying underlying trends and patterns, and therefore their ability to detect abnormalities and threats is far higher than manually defined conventional detectors. An essential practice of cybersecurity is the study and analysis of malware, which is employed in other cybersecurity practices such as the development of countermeasures and protection against malware. In principle, behavioral - or dynamic - analysis of malware is done by monitoring and recording the behavior of malware and threats during their operation in controlled environments. One of the strategies for capturing malware’s behavior is to trace the malware program’s calls to the operating systems APIs. API call behavioral analysis has been proposed in the literature; Santos et al. [1]and Islam et al. [2] proposed malware detectors based on API calls in combination with other static features. Younjoon et al. [3] analyzed API calls using DNA sequence alignment algorithms. Gupta et al. [4]clusteredhashsignaturesof API calls to detect the type of malware. In Section 3, we propose our method for behavioral analysis of malware based on their traces of API calls. We analyze the API calls of malicious programs by clustering the traces, with respect to their Jaccard Similarity, using efficient techniques, namely MinHashing and Locality Sensitive Hash- ing. The analysis of the API calls is done while disregarding the semantics 1 of API names and arguments, making our solution platform-independent. On top of the latter clustering of malicious API traces, we develop a benign/malicious classifier, which is used for API calls misuse detection. The efficiency of the clustering (the training), the monitoring, and the detection is critical. Otherwise the performance of the entire monitored system might be downgraded, or alternatively, the ability to detect threats in real-time and eliminating their damage may be lost. Similar approaches, i.e. applying clustering algorithms for anomaly detection or misuse detection, can be found, for example, in [5], [6]and[7]. When dealing with traces of API calls, in which the ordering indicates important information about the malware behavior, it seems that using edit distance is more adequate in comparing and clustering these traces than using Jaccard similarity, which doesn’t capture di↵erences in ordering. However, the computation of edit distance is more complex than the computation of Jaccard. In Section 4, we show that there is a relationship, under certain con- straints, between edit distance and Jaccard similarity, which can serve as a theoretical base for an estimation of edit distance by means of Jaccard. Jaccard is a measure of similarity between two sets, while edit distance is a measure of dissimilarity between two strings, such as traces of API calls. Moreover, the values of Jaccard similarity lie in the unit interval, i.e [0,1], while edit distances are natural numbers. Thus, for unifying the ranges and domains of both metrics we define a normalized form of edit distance and use sets of n-grams and their representing strings as the middle ground for the estimation. 2 2 Preliminaries 2.1 Malware Analysis And Detection 2.1.1 Detection Approaches: Misuse vs. Anomaly Intrusion detection systems are based on three main approaches: misuse detection, anomaly detection or hybrid combinations of both approaches. The anomaly detection approach defines and models the normal accepted

December, 2018 2018

Mining of Massive Datasets

Applied Statistics

Arxiv:2102.08942V1 [Cs.DB]

Similarity Search Using Locality Sensitive Hashing and Bloom Filter

Compressed Slides

Lecture Note

Distributed Clustering Algorithm for Large Scale Clustering Problems

Minner: Improved Similarity Estimation and Recall on Minhashed Databases

Setsketch: Filling the Gap Between Minhash and Hyperloglog

Strand: Fast Sequence Comparison Using Mapreduce and Locality Sensitive Hashing

Efficient Minhash-Based Algorithms for Big Structured Data

Hash-Grams: Faster N-Gram Features for Classification and Malware Detection Edward Raff Charles Nicholas Laboratory for Physical Sciences Univ