Ben-Gurion University of the Negev The Faculty of Natural Sciences Department of Computer Science

The Relation Between Jaccard Similarity and Edit Distance in LSH Based Malware Detection

Thesis submitted in partial fulfillment of the requirements for the Master of Sciences Degree

Mohammad Ghanayim

Under the supervision of:

Prof. Shlomi Dolev

December, 2018 2018

Ben-Gurion University of the Negev

The Faculty of Natural Sciences

Department of Computer Science

The Relation Between Jaccard Similarity and Edit Distance in LSH Based Malware Detection

Mohammad Ghanayim

Under the supervision of:

Prof. Sidomi Dolev

20.12.2018 Mohammad Ghanayim: Date: 20.12.2018 Prof. Shlomi Dolev: Date:

Committee Chairperson: Date: 23.12.2018

December,August, 2018 Abstract

In this work, we employ textual methods which are usually used for finding similar items in large datasets, namely n-grams and MinHash- ing and Locality Sensitive Hashing, for behavioral analysis and detection of malware. Following the misuse approach of intrusion detection, we train aclassifierbyusingtheabovetechniquestoecientlyclusteradatasetof malicious Windows API call traces with respect to Jaccard similarity. The obtained clustering is used with great success in classifying new query traces for being either malicious or benign. The computation associated with extracting n-grams and calculating Jac- card similarity is much more ecient than the computation of edit distance (linear versus quadratic time complexity). Thus, we examine the possibility to utilize the Jaccard similarity in an estimation of edit distance. We formu- late inequalities defining the relationship between Jaccard similarity and edit distance, that impose upper and lower bounds on the edit distance values in terms of the Jaccard values. The scope of our analytical results is limited to representing strings (strings derived from sorted sets on n-grams) rather than the original (raw) textual data. Yet in practice, we obtained an indica- tion of solid correspondence between the edit distance of original strings and the edit distance of their representing strings. This thesis is based on a paper presented at IEEE NCA 2017, the In- ternational Symposium on Network Computing and Applications, held in Cambridge, MA, USA. Acknowledgments

Gratitudes IwouldliketoexpressmysinceregratitudetomyadvisorProf.Shlomi Dolev and Prof. Sergey Frenkel (Russian Academy of Sciences) for providing me with guidance and assistance during my first research experience and for their patience and motivation. I would also like to thank our Taiwanese research partners Prof. Yeali S. Sun and Prof. Shun-Wen Hsiao from whom I learned a lot and enjoyed our joint work. I would like to acknowledge Amit Elran, Shaked Sagi, and Yoav Beeri, whom I instructed in their B.Sc. final project, for their contributions to the code used in this work. Likewise, I want to thank all my colleagues in Prof. Dolev’s lab and its administrative coordinator, Ms. Timi Budai, for the enjoyable times I spent with them. Last but not least, I would like to thank my parents, Zaki and Maysoon, for their unfailing support and continuous encouragement.

Grant assistance This research was partially supported by the Council of Higher Education Scholarship for Master Students, a grant from the Israeli Ministry of Science Technology and Space and the National Science Council of Taiwan, and The Lynne and William Frankel Center for Computer Science. Contents

1 Introduction 1

2 Preliminaries 3 2.1 Malware Analysis And Detection ...... 3 2.1.1 Detection Approaches: Misuse vs. Anomaly ...... 3 2.1.2 Behavioral Analysis of Malware ...... 3 2.2 Machine Learning ...... 4 2.2.1 Supervised And Unsupervised Machine Learning .... 4 2.2.2 Clustering ...... 4 2.3 Jaccard Similarity And Edit Distance ...... 5 2.3.1 Jaccard Similarity ...... 5 2.3.2 Edit Distance ...... 6 2.4 Shingling, n-grams And Representing Strings ...... 7 2.4.1 Shingling Strings And n-grams ...... 7 2.4.2 Representing Strings ...... 7 2.5 Finding Similar Items ...... 8 2.5.1 MinHashing ...... 8 2.5.2 Locality Sensitive Hashing (LSH) ...... 10

3 Behavioral Analysis and Detection of Malware Using Local- ity Sensitive Hashing 12 3.1 Traces of API Calls ...... 12 3.2 Misuse Detection: Clustering-Backed Classifier ...... 14 3.3 Experiments and Results ...... 16 3.3.1 Do malware and benign traces have low similarity, de- spite the disregard for semantics? ...... 16 3.3.2 Does the learning algorithm comply with the Empirical Risk Minimization principle? ...... 18 3.3.3 Classifier’s Performance ...... 19

4 Edit Distance Approximation in Terms of Jaccard Similarity 23 4.1 Jaccard Similarity Versus Edit Distance ...... 23 4.2 Normalized Edit Distance ...... 24 4.3 The Middle Ground: Sets of n-grams & Representing Strings . 26 4.4 Bounds on Normalized Edit Distance ...... 29 4.5 Normalized Edit Distance Approximation ...... 31 List of Figures

1 Illustration of intersection and symmetric di↵erence of sets .. 6 2 Example API Call Trace ...... 13 3Theclassifiertrainingproceduce...... 15 4 Jaccard similarity rates to the nearest medoid, bening vs. ma- licious ...... 17 5 Classifier’s error rate versus similarity threshold ...... 20 6Classifier’sROCcurve...... 21 7Classifier’serrorrateversusn-gramsize...... 22 8Classifier’serrorversustrainingsetsize...... 22 9Theaveragen-gram frequency versus n-gram size ...... 27 10 The average di↵erence between the NED on original docu- ments and NED on representing strings versus n-gram size .. 28 11 NED and its approximation, NED,measuredonasampleset 32

g 1 Introduction

Machine learning and data mining are becoming prevalent in cybersecu- rity applications, due to their increased e↵ectiveness compared to conven- tional methods. Learning algorithms are particularly good at analyzing large datasets, identifying underlying trends and patterns, and therefore their abil- ity to detect abnormalities and threats is far higher than manually defined conventional detectors. An essential practice of cybersecurity is the study and analysis of malware, which is employed in other cybersecurity practices such as the development of countermeasures and protection against malware. In principle, behavioral - or dynamic - analysis of malware is done by monitoring and recording the behavior of malware and threats during their operation in controlled environments. One of the strategies for capturing malware’s behavior is to trace the malware program’s calls to the operating systems APIs. API call behavioral analysis has been proposed in the literature; Santos et al. [1]and Islam et al. [2] proposed malware detectors based on API calls in combination with other static features. Younjoon et al. [3] analyzed API calls using DNA sequence alignment algorithms. Gupta et al. [4]clusteredhashsignaturesof API calls to detect the type of malware. In Section 3, we propose our method for behavioral analysis of malware based on their traces of API calls. We analyze the API calls of malicious programs by clustering the traces, with respect to their Jaccard Similarity, using ecient techniques, namely MinHashing and Locality Sensitive Hash- ing. The analysis of the API calls is done while disregarding the semantics

1 of API names and arguments, making our solution platform-independent. On top of the latter clustering of malicious API traces, we develop a be- nign/malicious classifier, which is used for API calls misuse detection. The eciency of the clustering (the training), the monitoring, and the detection is critical. Otherwise the performance of the entire monitored system might be downgraded, or alternatively, the ability to detect threats in real-time and eliminating their damage may be lost. Similar approaches, i.e. applying clus- tering algorithms for anomaly detection or misuse detection, can be found, for example, in [5], [6]and[7]. When dealing with traces of API calls, in which the ordering indicates important information about the malware behavior, it seems that using edit distance is more adequate in comparing and clustering these traces than using Jaccard similarity, which doesn’t capture di↵erences in ordering. However, the computation of edit distance is more complex than the computation of Jaccard. In Section 4, we show that there is a relationship, under certain con- straints, between edit distance and Jaccard similarity, which can serve as a theoretical base for an estimation of edit distance by means of Jaccard. Jaccard is a measure of similarity between two sets, while edit distance is a measure of dissimilarity between two strings, such as traces of API calls. Moreover, the values of Jaccard similarity lie in the unit interval, i.e [0,1], while edit distances are natural numbers. Thus, for unifying the ranges and domains of both metrics we define a normalized form of edit distance and use sets of n-grams and their representing strings as the middle ground for the estimation.

2 2 Preliminaries

2.1 Malware Analysis And Detection

2.1.1 Detection Approaches: Misuse vs. Anomaly

Intrusion detection systems are based on three main approaches: misuse detection, anomaly detection or hybrid combinations of both approaches. The anomaly detection approach defines and models the normal accepted behavior and detects anomalies as deviations from that normal behavior. Conversely, misuse detection approach defines and models the abnormal be- havior, and detects intrusions that resemble the abnormal model. In misuse detection, anything not known is considered as normal behavior. According to the misuse detection approach, it is assumed that it is easier to define abnormal behavior, and its model can be easily extended to newly discovered types of attacks. A disadvantage of misuse detection is the in- ability of discovering unknown attacks, dubbed zero-day attacks, which are detectable by anomaly detection systems.

2.1.2 Behavioral Analysis of Malware

While static analysis examines the binary codes of malware without running it, dynamic or behavioral analysis executes malware in sand-boxed environ- ments for the purpose of observing its behavior. It is more dicult to obscure the runtime behavior in order to evade detection systems, compared to ob- struction of binary code for the same purpose, which is the main advantage of behavioral analysis over the traditional methods of static analysis.

3 2.2 Machine Learning

Arthur Samuel, who coined the term Machine Learning, defined it as “a field of study that gives computers the ability to learn without being explic- itly programmed”. Machine learning algorithms analyze and model sample dataset in order to identify underlying patterns and form a generalization which enables classification and prediction of new unseen data.

2.2.1 Supervised And Unsupervised Machine Learning

Supervised learning algorithms train on labeled datasets, in which each ex- ample is a pair of an input object and its desired output label. An optimal outcome of such an algorithm is a hypothesis, obtained by learning and gen- eralizing from the training data, that is able to map new unseen inputs to their correct labels. Unsupervised learning algorithms train on unlabeled datasets, and the task is to infer from the data and find hidden structures and patterns of the data. It is useful for cases where labeled data is not readily available and the task of labeling is hard.

2.2.2 Clustering

Clustering is an approach of unsupervised learning, aimed at finding patterns in high-dimensional unlabeled data. Clustering of a data set is the task of partitioning it into di↵erent clusters, such that objects in the same cluster are more similar (with respect to a given metric) to each other than to objects in other clusters.

4 Amedoidofasetofitems,oracluster,isdefinedastheitemwhichhas the minimal average distance to all other items. Formally, let x1,x2 ...xk be asetofk items with distance function d,thenthemedoidofthesetis:

k

xmedoid =argmin d(y, xi) y x1,x2...xk i=1 2{ } X Medoids are used on datasets for which a mean or centroid cannot be defined, as the medoid is always an item within the group, as opposed to a centroid.

x1+x2+ +xk Acentroidofaclusteristhemeanofallitsitems,xcentroid = k ··· , and thus it is not restricted to be among the cluster’s items.

2.3 Jaccard Similarity And Edit Distance

2.3.1 Jaccard Similarity

Jaccard similarity or is a measure of similarity between finite sets. It is defined as the size of the intersection divided by the size of the union of the sets: X Y J (X, Y )=| \ | S X Y | [ | Jaccard distance, which is a measure of dissimilarity between finite sets, is the complement of Jaccard similarity J (X, Y )=1 J (X, Y ). It is defined D S as the size of the symmetric di↵erence between two sets (denoted by XY ), which is comprised of the set of elements in either of the sets but not in both: (Figure 1). XY J (X, Y )= | | D X Y | [ | Example 1 Let A = a, b, c, d and B = c, d, e, f be two sets. { } { } 5 Then, A B = a, b, c, d, e, f =6, | [ | |{ }| A B = c, d =2, | \ | |{ }| AB = a, b, e, f =4. | | |{ }| 2 1 4 2 And therefore, JS(A, B)= 6 = 3 and JD(A, B)= 6 = 3 .

Figure 1: The intersection (left) and the symmetric di↵erence (right) of sets.

Jaccard similarity or distance between two sets can be computed in linear time of the sets’ sizes, using certain implementations of the set data struc- tures. MinHash and Locality Sensitive Hashing may be combined together in order to achieve accurate and ecient estimations of Jaccard similarities between pairs of sets in large datasets.

2.3.2 Edit Distance

Edit distance, or , is a metric that measures the dif- ferences between strings. The edit distance between any two strings x and y is defined as the minimal number of edit operations (delete, insert and substitute) of (single) symbols done on one sequence x to obtain the other sequence y. Edit distance is computed by dynamic programming in quadratic time. It has the advantage of the ability to identify di↵erences in the ordering inside

6 the strings, which cannot be achieved using Jaccard on shingled strings.

2.4 Shingling, n-grams And Representing Strings

2.4.1 Shingling Strings And n-grams

Given an integer n and a string s,then-grams set of the string s is all the contiguous substring of length n in s.Then-grams are collected by a process of shingling - passing a sliding window of length n on the string.

2.4.2 Representing Strings

Representing strings of n-grams sets [8]areobtainedbysortingthen-grams in the set according to, say, lexicographic ordering, and concatenating them as it is described in Section 3.9.2 in [8]. The outcome is strings of n-grams (string over the alphabet of the n-grams) that are sorted and have no repe- titions.

Example 2 Let s denote the string “Hello-World!”, then the 2-grams set of s is: ’el’, ’o-’, ’Wo’, ’lo’, ’ll’, ’ld’, ’d!’, ’-W’, ’rl’, ’or’, ’He’ { } and it representing strings is the string : “ -W He Wo d! el ld ll lo o- or rl ” · · · · · · · · · ·

7 2.5 Finding Similar Items

Finding similar items in datasets is a major data mining problem. Next, we list ecient techniques used for finding similar textual documents (strings, sequences or traces) in data sets.

2.5.1 MinHashing

MinHashing [8, 9]isacompressionmethodforsetsofitems,whichpreserves the Jaccard similarity of sets after the compression. MinHashing makes it possible to replace long text document by much shorter unified-length Min- Hash signatures, also it allows faster estimation of the Jaccard similarity between two sets without explicit computation of their union and intersec- tion. First, the text documents are shingled into n-grams and stored as a binary bag of words or bit vectors, which have one coordinate for every possible string of length n over the alphabet ⌃, i.e, the length of these vectors is ⌃ n. | | Acoordinategetsvalue1ifandonlyifitscorrespondingstringappearsinthe text document at least once. For added eciency, each n-gram is hashed into a shorter integer, say 32-bit, that replaces the n-gram and makes it easier to manipulate. Additionally, an n-gram that hashes to integer x is placed in x-th coordinate in the bit vector. Second, the bit vectors are compressed into much shorter signatures using MinHashing. MinHash signatures are integer vectors that have the following property: the Jaccard similarity of the bit vectors is the same as the expected similarity of their signatures. MinHash signatures’ length is determined by

8 the number of MinHash functions used in the process. A MinHash function is defined by a random permutation of the bit vec- tors’ coordinates (the same permutation on all vectors), and the MinHash value of a MinHash function is defined as the index of the first bit, in the permuted order, to have a value 1. Formally, let S be a set of n-grams rep- resented by a bit vector, and let ⇡1,⇡2,⇡3 ...⇡k be random permutations of

the bit vector’s coordinates and h⇡1 ,h⇡2 ,h⇡3 ...h⇡k their matching MinHash functions. Then, the MinHash signature of the set S is:

[h⇡1 (S),h⇡2 (S),h⇡3 (S) ...h⇡k (S)]

Implementation Trick However, permuting such bit vectors is prohibitive, due to their size and the fact that most of their bits are zeros, which correspond to n-grams that do not appear in the original document. Yet, with a simple and ecient workaround [10]wecanavoidanycalculationsonthezerobits.Thekeyidea is to use hash functions to simulate the permutations and apply them only on existing n-grams, i.e, only on the bits with ones. In particular, we need a hash function h(x) that takes a 32-bit integer in the range [0, 232 1] and maps it to another integer, ideally without collisions. Put another way, let x be an n-gram initially positioned in the x-th bit in the vector, it is going to be mapped to position h(x) in the permuted order. Hash functions of the form h(x)=(ax+b)modp will do the job, where x is an n-gram (integer), a and b coecients are randomly chosen integers less than the maximum value of x and p is a constant prime number slightly larger

9 than the maximum value of x.Di↵erentvaluesofa and b imply di↵erent permutations. Thus, the MinHash value can be obtained by iterating with h(x) on the set of n-grams present in the text document and selecting the minimal resulting hash value. With that said, let D be a text document and let S be its set of n-grams stored as a bit vector (a bit for every possible n-gram). Let Sˆ denote the set of n-grams exclusively found in D (Sˆ’s items are exactly the bits with ones in the bit vector S,inotherwordsSˆ is much smaller than S). Let the following functions

h1(x)=(a1x + b1)modp

h2(x)=(a2x + b2)modp

h3(x)=(a3x + b3)modp ...

hk(x)=(akx + bk)modp be hash functions simulating permutations ⇡1,⇡2,⇡3 ...⇡k respectively. Then, the MinHash signature of length k of the n-gram set S is:

[min h1(x), min h2(x), min h3(x) ...min hk(x)] x Sˆ x Sˆ x Sˆ x Sˆ 2 2 2 2

2.5.2 Locality Sensitive Hashing (LSH)

Atrivialmethodtofindgroupsofsimilaritemsinadatasetistocalculate the similarity between each and every pair of traces, and then find all pairs of items that are above some similarity threshold sim(x ,x ) t,thengroup i j them into one cluster. This method is time consuming, since in a dataset of

N 2 size N there are 2 = O(N )pairsofitems. 10 Alternatively, similar traces can be grouped together using LSH in linear time with only a small increase in false negative results. Locality sensitive hashing, see e.g., [8, 11], hashes items into buckets several times, such that:

Similar items are hashed into the same bucket with high probability. •

Items that are not similar enough are hashed into a common bucket • with low probability.

Items that are mapped to the same bucket are considered to be candidates for being similar. In case more accuracy is desired, the candidate items may be further investigated by explicitly computing the similarity measure among pairs of items in a bucket. Note that, buckets usually consist of significantly fewer items compared to the total number of items. Typically, the similarity measure is explicitly computed only for a limited amount of item pairs within the same bucket. LSH and MinHash may be combined together to form a solution for the problem of finding similar items in a dataset of textual documents. For improved eciency the text items are MinHashed into signatures, then LSH is performed on these signatures (integer vectors) using the banding technique [8]. According to this technique, a signature is cut into b subsequent portions, sub-vectors, which are hashed separately into the same hash table, giving each pair of items b opportunities to be declared similar. The fewer bands we use, the fewer candidates, of higher similarity, we will get, and vice versa. The outcome of this process is a hash-table with similar items (with respect to Jaccard) in its buckets, from another perspective, it is a clustering of the dataset.

11 3 Behavioral Analysis and Detection of Mal-

ware Using Locality Sensitive Hashing

In this section, we describe how machine learning techniques and data mining techniques for finding similar textual items, namely clustering, MinHashing and Locality Sensitive Hashing, are employed for behavioral analysis and detection of malware. As a training phase, traces of programs’ OS API calls pass through MinHashing and Locality Sensitive Hashing (LSH) procedures to eciently obtain a clustering of the traces, then during the prediction phase a query trace is compared to the clusters and labeled accordingly.

3.1 Traces of API Calls

The traces of this work were obtained, by Prof. Yeali S. Sun’s lab, from the hypervisor Runtime Execution Introspection and Profiling (REIP) system based on Virtual Machine Introspection (VMI) techniques to profile hooked Windows API calls. The hooking was limited, as prior knowledge, only to a set of 55 critical Windows APIs listed in Table 1.Acollectionofknown malware and benign programs were executed in VM environment and mon- itored, producing a labeled dataset of malicious API calls traces and few benign traces for testing. An example raw trace can be found in Figure 2a.Themeaningfulparts of the trace, such as the APIs names and some of the arguments’ values (the arguments’ names are left out), were extracted from each such trace, concatenated and delimited by a ’@’, resulting in a long string as in Figure 2b.

12 1376 malware.exe #308810000 RegQueryValue hKey=HKEY_LOCAL_MACHINE\System\CurrentControlSet\ Services\LDAP\Ldap... Return=SUCCESS type=REG_DWORD data=1 #308120000 LoadLibrary lpFileName=adsldpc.dll Return=SUCCESS #313460000 LoadLibrary lpFileName=adsldpc Return=SUCCESS ....

(a) The beginning of a malware trace, which depicts the first 3 hooked Win- dows API calls (RegQueryValue, LoadLibrary, LoadLibrary) of the program mal- ware.exe with process ID 1376.

RegQueryValue@HKEY_LOCAL_MACHINE\System\ CurrentControlSet\Services\LDAP\Ldap@SUCCESS# [email protected]@SUCCESS# LoadLibrary@adsldpc@SUCCESS....

(b) Processed Trace Example

Figure 2: Example API Call Trace

13 Table 1: The 55 hooked Windows APIs.

During the clustering and the detection of the API call traces, the semantics of the API names and arguments were ignored and the traces are regarded as plain text, simplifying the solution and making it platform-independent.

3.2 Misuse Detection: Clustering-Backed Classifier

Adopting the misuse detection approach, and equipped with a decent sim- ilarity measure for textual data, Jaccard, we clustered a dataset comprised only of malicious traces as the training phase of the classifier. The procedures of shingling, MinHashing and Locality Sensitive Hashing are combined together, successively, in order to eciently achieve high quality clustering of the malicious data set, avoiding pairwise comparisons between the traces in the dataset. Each trace is shingled into n-grams, hashed by MinHash producing compressed signatures, which are in turn hashed again using LSH by the banding technique. The outcome of the above process is a hash table, where each of its buckets contains a subset of similar traces

14 grouped together.

Figure 3: The classifier training proceduce

A label of a trace solely specifies whether the trace either belongs to a malicious or a benign program. Such binary labels do not imply any informa- tion about certain properties of the programs, nor the type of the malware or its nature. Clustering the malicious traces enables better dissection of the dataset beyond its collective malicious nature, by implicitly capturing the underlying proprieties of the di↵erent classes of malware, and categorizing them into clusters of somehow related malicious traces. In other words, the clustering learns from the malicious dataset without any explicit descriptions of each malware or its class. The medoid of each cluster, i.e. of each hash table bucket, is selected and used as its representative. Under the assumption of low similarity between malware and benign traces, the binary (malicious/benign) classification of new query traces is done by checking whether the query trace fits one of the malicious clusters, if so it is classified as malicious. Otherwise, if the new trace outlies the malicious clusters, it is classified as benign. Technically, this is achieved by comparing each query trace q against all the medoids m1,m2 ...mk;if its maximal Jaccard similarity to one of the medoids exceeds a predefined threshold t,thatiswhethermaxJS(q, mi) >t,thenitisclassifiedasmali- i=1...k cious, otherwise it is classified as benign.

15 The computation time of the training phase - shingling, MinHash, and LSH - is linear in the size of the training dataset. The required computation for finding a cluster’s medoid is quadratic in the cluster size. The prediction of a query trace involves calculating Jaccard similarity between the query trace and each of the medoids (as sets of n-grams), the number of medoids is bounded by the size of the hash table.

3.3 Experiments and Results

3.3.1 Do malware and benign traces have low similarity, despite the disregard for semantics?

In order to assess our assumption of low similarity rates between malicious and benign traces, we conducted an experiment aimed at testing whether there are tangible di↵erences - in terms of Jaccard similarity - between ma- licious and benign traces, even when the API calls are treated as plain text, without taking into account any semantics of API names, arguments or re- turn values. We clustered 1652 malicious traces using MinHash of 20 hash functions (signature length is 20), and LSH with 4 bands and 517 buckets hash table. The traces were mapped into 501 buckets, with the medoid of each bucket as its representative. Two test datasets, 48 traces each, were used. The first consists of all malicious traces and the second consists of all benign traces. We compared each test trace against all the medoids, and recorded its Jaccard similarity to the most similar medoid (the one with the maximal Jaccard similarity value). The measurement values are depicted in Figure 4.

16 Figure 4: Jaccard similarity rates to the nearest medoid. A point on the innermost circle has Jaccard similarity of 0, while a point on the outermost circle has Jaccard similarity of 1.

The distribution of the measurement values demonstrates the significant distinction between the high similarity rates of the malicious traces to the medoids (all the medoids are malicious) on one hand, and the lower similarity rates of the benign traces to the medoids on the other hand. This result substantiates our assumption of low similarity between malware and benign traces, even without any considerations of semantics. Hence, after clustering only malicious traces, benign traces are expected to be outliers of the clusters.

17 3.3.2 Does the learning algorithm comply with the Empirical Risk Minimization principle?

Amodel’sempiricalriskisdefinedasitserrorrateonthesamedataset used in its training phase. Accordingly, an Empirical Risk Minimization (ERM) algorithm is one that finds a model that has the minimal empirical risk. To determine if our learning algorithm minimizes the empirical risk, we conducted an experiment which measures the empirical risk of the classifier by training it on a set of malicious traces, then testing the classifier on the same set. In the context of LSH clustering and medoids, the minimal empirical risk is 0%, which is achievable by assigning each trace of the training set its own ”unit cluster” and therefore each trace is its own medoid. This way, when testing the classifier on the same training set (which consists of malicious traces only), it is guaranteed that any query trace will have an identical medoid trace, i.e. with Jaccard similarity rate of 1 which exceeds any possible threshold. Thus, all the traces in the training set will be labeled as malicious, yielding 0% error rate. Alternatively, 0% empirical risk is achievable by setting the similarity threshold to 0. In our experiment, we trained and then tested the classifier on 1700 mali- cious traces using shingle size of 13 characters, MinHash of 20 hash functions, similarity threshold of 0.7, LSH with 4 bands and various sizes of hash table. When using hash tables of 500 to 1000 buckets the classifier erred on 2% to 7% of the training set. Even when using hash tables as large as 50’000 buckets in order to accommodate larger number of sparse clusters (up to

18 1500), we were not able to get error rates bellow 0.1%. Hence, the learning algorithm used in this work is not an ERM algorithm.

3.3.3 Classifier’s Performance

In order to tune the classifier and assess its performance, we conducted few experiments measuring the detection error rate and how it is a↵ected by di↵erent values of the learning algorithm parameters. A detection error is either a false positive - malicious trace that was classified as benign, or a false negative - benign trace that was classified as malicious.

Similarity Threshold • Figure 5 shows the classifier’s overall error rates for threshold values rang- ing from 0 to 1, next to the error rates of the benign set and malicious set separately (in this experiment we used n-gram size of 13 characters, MinHash of 20 hash functions, LSH with 4 bands and 517 buckets hash table). The lowest error rates are achieved using similarity thresholds of 0.75-0.80, while higher threshold values yield overfitted models and conse- quent degradation in performance. The receiver operating characteristic curve of the classifier is shown in Figure 6.

Another noteworthy observation, seen at the extremities of the thresh- old axis, is the classifier’s contrasting behavior on malicious traces versus benign traces . In the classification process, query traces are labeled ac- cording to their similarity to a clustering of malicious traces; technically a similarity threshold is used to binarize the labels. Inherently, for thresh- olds closer to 0, the classifier is prone to label any query trace as malicious,

19 Figure 5: Classifier’s error rate on a test set of 48 malicious traces and 48 benign traces versus similarity threshold value.

and vice versa, for thresholds closer to 1, it is prone to label any query trace as benign. Thus, at the left end of the threshold axis, in Figure 5,all the test traces were labeled as malicious leading to 100% error rate on the benign set and 0% error rate on the malicious set. Conversely, at the right end of the threshold axis, all the test traces were labeled as benign leading to 0% error rate on the benign set and 100% error rate on the malicious set.

N-gram Size • Figure 7 shows the detection error rate of the classifier, trained on around 1600 malicious traces, using various n-gram sizes, MinHash of 30 functions, LSH with 4 bands, 1019 buckets hash table and similarity threshold of 0.75. The best performance is achieved with n-grams of 16 characters, whereas

20 Figure 6: Classifier’s ROC curve. The area under the curve is 0.945

larger n-grams lead to overfitting of the model, expressed by an increase in the detection error rate.

Training dataset size • We trained the classifier several times while varying the size the of mali- cious training set, starting from 100 to 1652 traces, using n-gram size of 16, MinHash of 30 functions, LSH with 4 bands, 1019 buckets hash table and similarity threshold of 0.75. Figure 8 shows how the detection error rate begins at around 20% and decreases with the training dataset size, eventually reaching 8.3%.

21 Figure 7: Classifier’s error rate on a test set of 48 malicious traces and 48 benign traces versus n-gram size.

Figure 8: Classifier’s error rate on a test set of 48 malicious traces and 48 benign traces versus malicious training set size.

22 4 Edit Distance Approximation in Terms of

Jaccard Similarity

4.1 Jaccard Similarity Versus Edit Distance

While Jaccard similarity has the advantage of computation eciency over edit distance, it lacks the sensitivity to the ordering inside the compared strings (as a by-product of comparing sets of n-grams, obtained by the shin- gling process applied before). Edit distance is sensitive to the ordering of characters inside the strings, and therefore, it better fits scenarios where the ordering encapsulates important information, for instance in traces of API calls. Two strings with low edit distance share many alike substrings in similar locations in the respective strings. It’s fair to say that edit distance signifies the structural di↵erences between two strings while Jaccard reflects the content di↵erences only. However, the computation of edit distance is time-consuming (quadratic time by classical dynamic programing or O(n2 log log n/ log2 n)by[12]), and accordingly, edit distance approximation has been studied extensively [13] [14][15] . In this section, we describe our take on the question of edit distance approximation, by proposing a method for estimating edit distance by means of other easily calculated measures of similarity, e.g. Jaccard, for which there exist ecient hashing based linear algorithms for estimating Jaccard similarity for large data sets [8].

23 4.2 Normalized Edit Distance

A necessary step in the path to find the relation between Jaccard and edit distance, or ED in short, is to unify their range to the unit interval range, i.e. to [0,1]. To this end, we use a common normalized form of edit distance, which is obtained by normalizing the edit distance of two strings by the longer string’s length, called Normalized Edit Distance and denoted by NED.

ED(x, y) NED(x, y)= max x , y {| | | |}

Correspondingly, let NED’s complement be the “Normalized Edit Similarity” denoted by SimNED:

ED(x, y) SimNED(x, y)=1 max x , y {| | | |}

The following properties of ED may assist us in utilizing Jaccard for estimating SimNED:

In accordance with its definition, the edit distance of two strings x and • y is non-negative and bounded by the length of the longest string:

0 ED(x, y) max x , y   {| | | |}

Which implies that:

ED(x, y) 0 1  max x , y  {| | | |}

Therefore the range of both normalized forms of edit distance and sim-

24 ilarity, NED and SimNED, is the unit interval, as desired:

NED(x, y), SimNED(x, y) [0, 1]. 2

SimNED is bounded: •

x y SimNED(x, y) 1 || || ||  max x , y {| | | |}

as shown by the next lemma.

Lemma 1 x y SimNED(x, y) 1 || || || (1)  max x , y {| | | |} Proof 1 By the definition of edit distance it holds:

ED(x, y) x y || || ||

As inserting x y symbols (the di↵erence in strings’ lengths) are the || || || minimum required edit operation to bring the shorter string to the longer one. Or, symmetrically, deleting x y symbols from the longer string to bring || || || it to the shorter one. Hence, ED(x, y) x y || || || max x , y |max x , y {| | | |} {| | | |} And therefore:

ED(x, y) x y 1 1 || || || max x , y  max x , y {| | | |} {| | | |}

25 ED(x, y) SimNED(x, y)=1 NED(x, y)=1 max x , y {| | | |} x y SimNED(x, y) 1 || || ||  max x , y {| | | |}

4.3 The Middle Ground: Sets of n-grams & Repre-

senting Strings

An obvious hurdle in utilizing Jaccard as the basis for the estimation of edit distance is that these metrics are defined on di↵erent domains. Jaccard is defined over unordered sets, in which each di↵erent element appears only once, although it may occur many times in di↵erent parts of the document. Yet, edit distance is defined over strings and depends on the order of the symbols in the strings. In order to overcome the diculties associated with this discrepancy, we confine the argument to the scope of certain types of sets and strings, both derived from the original documents: sets of n-grams and their representing strings. In this work’s application, and similar applications in general, the doc- uments in question pass through a shingling process (collecting all the sub- strings of a certain length - n-grams - appearing in the document), which is a prevalent first stage in most modern methods of textual similarity estimation [8]. The outcome of the shingling process is sets of n-grams, i.e. without rep- etitions of n-grams, one set per document, or API trace in particular. These sets are used for computing Jaccard similarities. The representing strings of the aforementioned sets are obtained by sort-

26 ing and concatenating the elements of the sets, as described in Section 3.9.2 in [8]. As a result, we get strings of n-grams (strings over the alphabet of the n-grams) that are sorted and have no repetitions. These representing strings are used for estimation of edit distance. We denote sets of n-grams with uppercase letters, e.g. X,andtheir representing strings with the same letter in lowercase, e.g. x. Representing strings do di↵er from the original documents. However and roughly speaking, the lower the frequencies of n-grams in the original doc- ument (the number of occurrences of the same n-gram in the document), the higher the correspondence is between the document and its represent- ing string. The latter claim is supported by the measurements depicted in Figures 9 and 10.

Figure 9: The average n-gram frequency (average number of repeating occur- rences of the same n-gram in each document) versus n-gram size, as measured in a sample of 25 API call traces with average length of 1458.

The experiment result summarized in Figure 10 indicates that it is pos-

27 Figure 10: The average di↵erence (delta) between the NED on pairs of origi- nal documents and NED on their representing strings versus n-gram size, as measured in a sample of 25 API call traces. sible to choose n-gram lengths that yield less than 0.1 average di↵erence be- tween NED of the original documents, and NED of their representing strings. This justifies our choice to concentrate on analyzing the representing strings as we do in the sequel. We note that in general, one may sample a given data set and tune the length of the n-grams for that specific data set, taking into account the correspondence between original documents and their rep- resenting strings, prior to proceeding with the clustering of the representing strings. In the next lemma, we show a direct relation connecting Jaccard distance between sets of n-grams on one hand, and the longest common subsequence and the insertion/deletion edit distance (without substitution operations) of the corresponding representing strings on the other hand.

28 Lemma 2 E(x, y) J (X, Y )= (2) D E(x, y)+C(x, y)

Where JD(X, Y ) denotes Jaccard distance, E(x,y) denotes the insertion/dele- tion edit distance, and C(x,y) denotes the length of the longest common sub- sequence (LCS) of x and y [8].

Proof 2 The symmetric di↵erence between two sets X and Y consists of the elements in Y but not in X and vice versa, i.e. XY = X/Y Y/X. [ The fact that x and y are the representing strings of the n-grams sets X and Y, respectively, implies that E(x, y)= XY . | | The edit operations (insertion or deletion) required for transforming string x into y; are deleting the elements that are in x but not in y, and inserting the elements that are in y but not in x, which is exactly the size of the symmetric di↵erence. Moreover, since the elements in the representing strings are sorted and have no repetitions, we know that the length of the strings’ LCS is exactly the size of their corresponding sets intersection, i.e. C(x, y)= X Y . | \ | Therefore,

E(x, y) XY XY = | | = | | = J (X, Y ) E(x, y)+C(x, y) XY + X Y X Y D | | | \ | | [ |

4.4 Bounds on Normalized Edit Distance

In the next theorem, we show upper and lower bounds on the normalized edit distance of any representing strings in terms of Jaccard distance of their corresponding n-grams sets.

29 Theorem 1 The following inequality holds for Jaccard distance and NED, for any n-gram sets X and Y and their representing strings x and y:

J (X, Y ) 1 ↵ NED(x, y) (1 + ↵) D (3)   2 J (X, Y ) D

min x , y {| | | |} where ↵ = max x , y {| | | |}

Proof 3 It holds that,

E(x, y)= XY = x + y 2C(x, y)(4) | | | | | |

x + y E(x,y) | | | | And by substituting C(x, y)= 2 in Lemma 2, we obtain:

2E(x, y) J (X, Y )= (5) D E(x, y)+ x + y | | | |

Note that the expression on the right side of equality (5) is increasing in E(x, y), and it holds ED(x, y) E(x, y), as one substitution operation is  equivalent to a sequence of two operations: insert and then delete, or vice versa.

2ED(x, y) J (X, Y ) (6) D ED(x, y)+ x + y | | | | Isolating ED(x, y):

x + y ED(x, y) J (X, Y ) | | | | (7)  D · 2 J (X, Y ) D

30 Converting to the normalized Edit Distance we obtain:

ED(x, y) x + y NED(x, y)= J (X, Y ) | | | | max x , y  D · (2 J (X, Y ))max x , y {| | | |} D {| | | |}

Taking into account that x + y = max x , y + min x , y , we | | | | {| | | |} {| | | |} obtain: J (X, Y ) NED(x, y) (1 + ↵) D  2 J (X, Y ) D min x , y {| | | |} where ↵ = max x , y {| | | |} And by Lemma 1,

x y J (X, Y ) || || || NED(x, y) (1 + ↵) D max x , y   2 J (X, Y ) {| | | |} D

Notice that,

x y max x , y min x , y || || || = {| | | |}| {| | | |} =1 ↵ max x , y max x , y {| | | |} {| | | |}

Hence, J (X, Y ) 1 ↵ NED(x, y) (1 + ↵) D   2 J (X, Y ) D

4.5 Normalized Edit Distance Approximation

Equipped with the bounds of the normalized edit distance found in the pre- vious section, we define an approximation of the normalized edit distance as the arithmetic mean of its upper and lower bounds, denoted by NED:

JD(x,y) g (1 ↵)+(1+↵) 2 J (x,y) NED(x, y)= D 2 g 31 Figure 11: NED and its approximation, NED,asmeasuredon105pairs of traces’ representing strings. Each radial line, or each spoke, corresponds to a single pair of traces. The red X markg on a spoke is the NED of the corresponding pair, and the black bullet is the NED of the same pair. • g Figure 11 depicts the actual NED alongside its approximation NED, the measurements are done over 105 pairs of traces’ representing strings,g arbitrarily chosen from the KVM data set. A near perfect approximation of NED values is observed in the majority of the cases. The use of such NED approximation allows us to supplement the clus- tering technique, outlined in the previous section, by a scheme where LSH provides Jaccard similar documents (traces) grouped in clusters, in a way

32 that facilitates a decent estimation of edit distance for any pair of docu- ments with low overhead, without explicit computation of edit distance but rather by computing NED of their representing strings, while utilizing the less demanding Jaccard.g In case the pair of documents are part of the same cluster, their Jaccard distances were already calculated beforehand during the process of choosing the medoid of that cluster, thus we may consider their NED calculation as a free ride.

Moreover,g in future work, NED approximation may serve as the basis for a refinement mechanism of the clustering. LSH clustering method described in the previous section provides clustering of documents with respect to Jaccard similarity, we may refine the obtained clustering with respect to edit distance. The refinement will take place alongside the process of choosing medoids, in particular, for each cluster we will exclude those traces for which the value of NED is significantly higher than JD,whenmeasuredfromthecluster’s medoid,g and possibly assign them to another cluster with a closer medoid.

33 References

[1] Igor Santos, Jaime Devesa, Felix Brezo, Javier Nieves, and Pablo Garcia Bringas. Opem: A static-dynamic approach for machine-learning-based malware detection. In International Joint Conference CISIS12-ICEUTE 12-SOCO 12 Special Sessions,pages271–280.Springer,2013.

[2] Rafiqul Islam, Ronghua Tian, Lynn M Batten, and Steve Versteeg. Clas- sification of malware based on integrated static and dynamic features. Journal of Network and Computer Applications,36(2):646–656,2013.

[3] Youngjoon Ki, Eunjin Kim, and Huy Kang Kim. A novel approach to detect malware based on api call sequence analysis. International Journal of Distributed Sensor Networks,11(6):659101,2015.

[4] Sanchit Gupta, Harshit Sharma, and Sarvjeet Kaur. Malware charac- terization using windows api call sequences. In International Conference on Security, Privacy, and Applied Cryptography Engineering,pages271– 280. Springer, 2016.

[5] Kingsly Leung and Christopher Leckie. Unsupervised anomaly detec- tion in network intrusion detection using clusters. In Proceedings of the Twenty-eighth Australasian conference on Computer Science-Volume 38, pages 333–342. Australian Computer Society, Inc., 2005.

[6] Misty Blowers and Jonathan Williams. Machine learning applied to cyber operations. In Network Science and Cybersecurity,pages155– 175. Springer, 2014.

34 [7] Gilbert R Hendry and Shanchieh J Yang. Intrusion signature creation via clustering anomalies. In Data Mining, Intrusion Detection, Informa- tion Assurance, and Data Networks Security 2008, volume 6973, page 69730C. International Society for Optics and Photonics, 2008.

[8] Jure Leskovec, Anand Rajaraman, and Je↵rey David Ullman. Mining of massive datasets.Cambridgeuniversitypress,2014.

[9] Andrei Z Broder. On the resemblance and containment of documents. In Compression and complexity of sequences 1997. proceedings,pages 21–29. IEEE, 1997.

[10] Chris McCormick. Minhash tutorial with python code. http://www. mccormickml.com, June 2015.

[11] Aristides Gionis, Piotr Indyk, , et al. Similarity search in high dimensions via hashing. In Vldb,volume99,pages518–529, 1999.

[12] Szymon Grabowski. New tabulation and sparse dynamic programming based techniques for sequence similarity problems. Discrete Applied Mathematics,212:96–103,2016.

[13] Gad M Landau, Eugene W Myers, and Jeanette P Schmidt. Incremental string comparison. SIAM Journal on Computing,27(2):557–582,1998.

[14] Diptarka Chakraborty, Debarati Das, Elazar Goldenberg, Michal Koucky, and Michael Saks. Approximating edit distance within con- stant factor in truly sub-quadratic time. In 2018 IEEE 59th Annual

35 Symposium on Foundations of Computer Science (FOCS),pages979– 990. IEEE, 2018.

[15] Amir Abboud and Arturs Backurs. Towards hardness of approximation for polynomial time problems. In LIPIcs-Leibniz International Proceed- ings in Informatics,volume67.SchlossDagstuhl-Leibniz-Zentrumfuer Informatik, 2017.

36 תוכן העניינים 1 הקדמה 1

2 רקע 3 2.1 ניתוח וזיהוי נוזקות ...... 3 2.1.1 גישות לזיהוי נוזקות: זיהוי שימוש לרעה וזיהוי חריגים ...... 3 2.1.2 ניתוח התנהגותי של נוזקות ...... 3 2.2 למידת מכונה ...... 4 2.2.1 למידה מונחית ולמידה בלתי מונחית ...... 4 2.2.2 סיווג לאשכולות ...... 4 2.3 דמיון ג׳קארד ומרחק עריכה ...... 5 2.3.1 דמיון ג׳קארד ...... 5 2.3.2 מרחק עריכה ...... 6 n 2.4-גראם ומחרוזות מייצגות ...... 7 n 2.4.1-גראמים של מחרוזות ...... 7 2.4.2 מחרוזות מייצגות ...... 7 2.5 איתור איברים דומים ...... 8 2.5.1 גיבוב מינימום ...... 8 2.5.2 גיבוב משמר מרחק ...... 10

3 ניתוח התנהגות וזיהוי נוזקות בעזרת גיבוב משמר מרחק 12 3.1 עקבות של קריאות לממשקי מערכת ההפעלה ...... 12 3.2 זיהוי שימוש לרעה: מסווג מבוסס על אשכולות ...... 14 3.3 ניסויים ותוצאות ...... 16 3.3.1 האם בין עקבות זדוניות לבין עקבות שפירות קיים דמיון נמוך, 16 למרות אי-התייחסות לסמנטיקה ? ...... 3.3.2 האם האלגוריתם המוצע ממזער את השגיאה האמפירית? ..... 18 3.3.3 ביצועי המסווג ...... 19

4 קירוב של מרחק עריכה בעזרת דמיון ג׳קארד 23 4.1 דמיון ג׳קארד מול מרחק עריכה ...... 23 4.2 מרחק עריכה מנורמל ...... 24 4.3 העמק השווה: קבוצות של n-גראמים ומחרוזות מייצגות ...... 26 4.4 חסמים על מרחק העריכה המנורמל ...... 29 4.5 קירוב של מרחק העריכה המנורמל...... 31

הקשר בין דמיון ג׳קארד ומרחק עריכה בזיהוי נוזקות מבוסס גיבוב משמר מרחק

חיבור זה מהווה חלק מהדרישות לקבלת התואר "מוסמך למדעי הטבע"

מוחמד גנאים

אוניברסיטת בן-גוריון בנגב, 2018.

תקציר

בעבודה זו אנו מסתמכים על שיטות מתחום כריית המידע, שייעודם העיקרי הוא איתור איברים דומים באוספים גדולים של מידע טקסטואלי, בפרט שיטות גיבוב- מינימום וגיבוב משמר מרחק, לצורך ביצוע ניתוח התנהגותי וזיהוי של נוזקות. לצורך כך, ועל פי גישת ׳זיהוי שימוש לרעה׳ סיווגנו לאשכולות את אוסף עקבות קריאותהן של נוזקות ידועות לממשקי מערכת ההפעלה. הסיווג לאשכולות בוצע ביחס לדמיון ג׳קארד בין זוגות של עקבות. הפילוח המתקבל מניתוח האשכולות משמש כבסיס למסווג לזיהוי התנהגות זדונית/שפירה של תוכנות בזמן ריצה.

החישוב הנדרש למציאת דמיון ג׳קארד בין זוג מסמכים הרבה יותר יעיל מהחישוב למציאת מרחק עריכה או בשמו האחר מרחק לוינשטיין (זמן לינארי לעומת זמן ריבועי). ועל כן בעבודה זו אנו בוחנים את האפשרות להשתמש בדמיון ג׳קארד כבסיס לחישוב מקורב של מרחק עריכה. לצורך כך, אנו מציגים נוסחאות אשר מגדירות, תחת תנאים מסויימים, את הקשר בין דמיון ג׳קארד לבין מרחק עריכה, ובפרט מגדירות חסמים, עליון ותחתון, על הערכים של מרחק עריכה במונחים של ערכי דמיון ג׳קארד.

מאמר בנושא הוצג בכנס IEEE NCA 2017, the International Symposium on Network Computing and Applications שנערך בקיימברידג׳, ארה״ב.

2231 prinla nolo-unix inun nu5ippn Dvnnn inth nOnnn

ny-ip fl1flI -rimpra ivn-r 112 nwpn pn-in 113V/13 2122 001213 illitr13 111711T1

(M.Sc.) "nun with -wow" lxinn Maph niv.pyrno p5n nunn nr num

O'N32 Inn!),

:rviruna 1'117 9311M '9119

20.12.2018 Ammon nnmn 20.12.2018 :nnmn nrynn

23.12.2018 :rvn-Onnn nivin

רבמצדpoiaix, , 8 1 202018 , אוניברסיטת בן-גוריון בנגב הפקולטה למדעי הטבע המחלקה למדעי המחשב

הקשר בין דמיון ג׳קארד ומרחק עריכה בזיהוי נוזקות מבוסס גיבוב משמר מרחק

חיבור זה מהווה חלק מהדרישות לקבלת התואר "מוסמך למדעי הטבע" (.M.Sc)

מוחמד גנאים

בהנחיית: פרופ' שלומי דולב

ד רבמצדצמבר , 8 1 20201