<<

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 13, NO. 10, OCTOBER 2018 2553 Sparse Coding for N-Gram Feature Extraction and Training for File Fragment Classification

Felix Wang , Tu-Thach Quach, Jason Wheeler, James B. Aimone, and Conrad D. James

Abstract— File fragment classification is an important step in Because the search space of where each fragment may poten- the task of file carving in digital forensics. In file carving, files tially belong to is so large, automated tools are indispensable must be reconstructed based on their content as a result of their in making this time-consuming process practical. fragmented storage on disk or in memory. Existing methods for classification of file fragments typically use hand-engineered The main goal of file fragment classification is to determine features, such as byte histograms or entropy measures. In this which type of file (e.g. doc, html, jpg, pdf) a fragment paper, we propose an approach using sparse coding that enables belongs to from its content. In the literature, a variety of automated feature extraction. Sparse coding, or sparse dictionary algorithms have been used for classification learning, is an unsupervised learning algorithm, and is capable of file fragments to their respective file types. These include of extracting features based simply on how well those features can be used to reconstruct the original data. With respect to file support vector machines (SVMs), k-nearest neighbor (kNN), fragments, we learn sparse dictionaries for n-grams, continuous linear discriminant analysis (LDA), and artificial neural sequences of bytes, of different sizes. These dictionaries may then networks (ANNs) [2]Ð[8]. A common trend across these be used to estimate n-gram frequencies for a given file fragment, existing methods is the development of a large number of but for significantly larger n-gram sizes than are typically found hand-engineered features over the file fragments which are in existing methods which suffer from combinatorial explosion. To demonstrate the capability of our sparse coding approach, then used for classification. Example features are histograms we used the resulting features to train standard classifiers, such as of the bytes (unigrams) within a file fragment, as well support vector machines over multiple file types. Experimentally, as histograms of pairs of bytes (bigrams). More global we achieved significantly better classification results with respect features include the Shannon entropy over these unigrams and to existing methods, especially when the features were used in bigrams, or the compressed length of the file fragment [4]Ð[6]. supplement to existing hand-engineered features. Techniques originally from other areas of machine learning Index Terms— File fragment classification, file carving, auto- such as natural language processing (NLP) have also been mated feature extraction, sparse coding, dictionary learning, adapted to file fragments, producing features such as the unsupervised learning, n-gram, support vector machine. contiguity between bytes and the longest continuous streak of repeating bytes [3]. For distinguishing between higher I. INTRODUCTION entropy file types, features derived from statistical tests for N DIGITAL forensics, content-based analytics are often randomness have been applied [7]. Irequired to classify file fragments in the absence of other In contrast to hand-engineering features, we propose an identifying information. During data recovery, for example, approach for automated feature extraction from file fragments the fragmentation of files on damaged media or memory using sparse coding, also known as sparse dictionary learning dumps results in files that appear corrupted or missing even (details in Section II-A). This approach has a number of though the data is present [1]. For file carving, in order to advantages over hand-engineered features. Primarily, the fea- reconstruct the complete file from these fragments, analysts tures that are extracted through sparse coding are found in an must determine which fragment might go with which file. unsupervised manner, as opposed to features that might need to be laboriously constructed to be suitable for a particular Manuscript received August 25, 2017; revised January 8, 2018 and domain [9]. Furthermore, because the features extracted by February 20, 2018; accepted March 25, 2018. Date of publication this approach minimize reconstruction error, they capture a April 5, 2018; date of current version May 14, 2018. This work was supported by Sandia National Laboratories’ Laboratory Directed Research and Develop- significant amount of information about the domain without ment (LDRD) Program and in part by the Hardware Acceleration of Adaptive needing prior domain knowledge. Another benefit of this Neural Algorithms Grand Challenge Project. Sandia National Laboratories approach, due to the sparsity constraint, is the extraction of is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia LLC, a wholly owned subsidiary of specialized features targeted to each specific file type, or even Honeywell International Inc., for the U.S. Department of Energy’s National within file types containing more complex internal structure Nuclear Security Administration under Contract DE-NA0003525. The asso- (e.g. doc, pdf, zip) [10]. ciate editor coordinating the review of this manuscript and approving it for publication was Prof. Xinpeng Zhang. (Corresponding author: Felix Wang.) The remaining sections of the paper are organized as The authors are with Sandia National Laboratories, Albuquerque, follows: in Section II we provide background on the sparse NM 87123 USA (e-mail: [email protected]). coding algorithm as well as the SVM classifier; in Section III Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. we describe our experiments on how we use our sparse Digital Object Identifier 10.1109/TIFS.2018.2823697 coding approach to extract features from file fragment data and 1556-6013 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 2554 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 13, NO. 10, OCTOBER 2018 train our classifier to differentiate among multiple file types; in Section IV we provide and discuss the results from our experiments; and in Section V we make concluding remarks.

II. BACKGROUND A. Sparse Coding Sparse coding, or sparse dictionary learning, is a way of modeling data by decomposing it into sparse linear combi- nations of elements of a given basis set [9], [11]. That is, a data vector x ∈ Rm may be approximated as multiplying a dictionary matrix D ∈ Rm×k with a sparse representation vector ∈ Rk: x ≈ Dr. Here, a vector is said to be sparse when only a small fraction of the entries are nonzero. Although it is called sparse dictionary learning, the dictionary D is not necessarily sparse. When dictionary D is known (e.g. a wavelet basis), Fig. 1. Comparison of representative dictionaries learned from patches of natural images to that of byte data. a common method for finding an associated sparse vector is through regression analysis, which may be formulated as the 2-dimensional space, the “patch” equivalent for byte data L1-regularized optimization problem, would be n-grams, which are 1-dimensional sequences of ( , ) = 1  − 2 + λ   continuous bytes (resized in Fig. 1 to 2D for ease of visu- l x D min x Dr 2 r 1 (1) r∈Rk 2 alization). Once the patches are extracted from the original where the cost may be understood as the contributions of the data, the algorithm for sparse coding is the same in either 1 || − ||2 λ|| || domain. Differences are made apparent in the entries of the reconstruction error 2 x Dr 2 and a sparsity penalty r 1, and λ is a regularization parameter. This particular formulation dictionaries that are learned, corresponding to the differences is also known as the lasso [12]. in how data is structured between the two different domains. When dictionary D is not known, or it is more desirable to While the features that are extracted through sparse coding learn a dictionary more representative of the data, an extended may potentially span relatively large spatial scales, the total formulation to the above optimization problem gives the sparse number of elements in the dictionary may be relatively small coding cost function, in comparison. For example, due to redundancies in the n   signal, even for highly overcomplete dictionaries in image ( ) = 1 1  − 2 + λ   fn X min xi Dri 2 ri 1 reconstruction applications, the number of elements in the D∈Rm×k,R∈Rk×n n 2 i=1 dictionary is only greater by a moderate ratio to the number (2) of pixels in the image patch (e.g. 2000 elements for a patch where X ={x1,...,xn} are the data vectors from which size of 20 × 20 = 400 pixels gives a ratio of 5 : 1) [9]. we want to learn a dictionary. Because the optimization We may leverage this property to extend features such occurs over both the dictionary D and the set of sparse as n-gram frequencies beyond standard n-gram sizes without representation vectors R ={r1,...,rn}, most approaches to suffering from a combinatorial explosion in the feature space. sparse coding iteratively fix one variable while minimizing To illustrate, if we take unigram frequencies as an example, the other. For learning dictionaries from large input data it requires 256 elements for each possible byte. In order to sets, online or streaming methods have been developed [13]. extend this to pairs of bytes, or bigram frequencies, we may Regarding initialization, a common method for initializing naively expand the feature space to 2562 = 65, 536 elements, the dictionary D is seeding it with random elements from and for trigram frequencies, this is expanded even more (2563 the training set, and the sparse representation vectors R are elements). At the scale of a modest image patch size of initialized as zero vectors. 8 × 8 = 64 pixels that might be used in the sparse coding of Sparse coding is typically used in domains where the images, expanding the feature space using the same technique information being coded contains a great deal of redundancy, for bytes (resulting in 25664 elements) is highly impractical. such as representing images and other natural signals [14]. Using sparse coding, however, where we instead model an In the digital domain, where information is stored as byte n-gram as sparse linear combinations of representative byte data, one caveat is that the “signal” we wish to analyze is patterns, we may drastically reduce the size of the feature often represented in a relatively concise format for efficiency. space (i.e. to the size of the dictionary) while retaining relevant We find that in spite of this, sparse coding is still capable information about the byte composition at larger spatial scales. of providing sufficient decompositions. That is, even in the digital domain, the sparse coding approach is able to extract B. Support Vector Machine representative byte patterns to provide a rich basis set with Support vector machines (SVMs) are a standard super- respect to the original byte data. vised machine learning method used to perform classification A comparison of the application of sparse coding of natural of labeled inputs [15]. To be consistent with the standard images to that of byte data is shown in Fig. 1. In con- formulation, we reuse a bit of notation from the previous m trast to image patches, where pixels are representative of section on sparse coding. Here, input vectors xi ∈ R for WANG et al.: SPARSE CODING FOR N-GRAM FEATURE EXTRACTION AND TRAINING FOR FILE FRAGMENT CLASSIFICATION 2555 i = 1,...,n are represented as transformed feature vectors ratio of 60 : 40. For each common file type, we first took a ran- m yi = φ(xi ) ∈ R corresponding to points in a high dom subset of 750 files and then sampled 20 file fragments per dimensional space, where typically m  m. For clarification, file. For the OOXML file types, due to their reduced number, although the input vectors used by SVMs could potentially be we sampled over multiple randomly ordered passes through the same as the data vectors for sparse coding, this is generally the available files to reach same 750 number. All together, not the case. Rather, they are often composed of preprocessed we obtained a total of 270,000 file fragments across the 18 file features from the raw data. types. Sampling from each file was performed by randomly For the separable two-class case, with class labels li =±1, choosing continuous subsequences of 512 bytes from the file training an SVM equates to computing a hyperplane in the without replacement, but allowing overlap between different transformed high dimensional space, subsequences. A fragment size of 512 bytes was selected in order to compare our approach with similar studies in the ( ) = + wT φ( ) g yi b xi (3) literature [3]. During sampling, subsequences containing bytes  where b ∈ R and w ∈ Rm give the offset and weight from the first 512 bytes of the file were omitted as they parameters, respectively, that satisfies g(yi )>0forli = 1and frequently contain header information that identifies the file g(yi )<0forli =−1. Additionally, we choose the parameters type. To prevent bias during classification, the file fragments such that the hyperplane has the largest margin, or distance, used for training were drawn from different files altogether from the nearest training points, and normalize ||w|| = 1for than those file fragments that were used for testing. Per the uniqueness of solution. 60 : 40 ratio, training was performed on the file fragments Regarding the transformation function, φ(·), common can- of 450 files per file type, for a total of 162,000 file fragments, didates include polynomial, Gaussian, and various radial basis with the remaining 108,000 file fragments set aside for testing. functions (RBFs). The reason a transformation is used is because the inputs are typically not linearly separable in the B. Learning Dictionaries of N-Grams input space, and a projection onto a higher dimensional space is needed. For file fragment classification, however, existing In sparse coding, or sparse dictionary learning, the goal work has shown that training an SVM directly in the input is often to obtain an overcomplete basis set from the data space (equivalent to using an identity transformation function) vectors [14]. From the file fragment datasets described above, performs the best in practice, at least with respect to n-gram- we learn a dictionary over n-grams similar to that of learning like inputs [2]. a dictionary from image patches or slices of other natural For treatment of SVMs where the labeled points are not signals. Unlike natural signals which contain a great deal perfectly separable, as well as generalization to multiple of redundancy, byte data is often represented in a relatively classes, we encourage the reader to consult [15]. concise format for efficiency. As a result, the ratio of the elements in our learned dictionaries to the size of the n-grams is significantly higher than what is more commonly found in III. EXPERIMENTAL SETUP the sparse coding literature [9]. For example, given an n-gram A. File Fragment Dataset size of 64 bytes, a dictionary size of 1024 would give a ratio We used our sparse coding approach to perform feature of 16 : 1. extraction and classification on a dataset derived from the gov- For learning the dictionary, only n-grams from file frag- docs1 file corpus, which aims to provide a more standardized ments set aside for training were used. The dictionaries were dataset for digital forensics research [16]. Our dataset was learned using the spams sparse coding library which provides derived in that we did not use the entirety of the govdocs1 a scalable online algorithm for dictionary learning based on file corpus, even though it contained a much larger number stochastic gradient descent [13]. Learning occurred over two of file types, 63 total. This was because the distribution of epochs (i.e. passes over the training set) and was performed files in the original corpus is rather unbalanced, with some by iterating over randomly shuffled and balanced batches of file types (e.g. pdf) accounting for a substantial portion of n-grams using a sparsity penalty of λ = 0.15 (a default value the dataset, whereas other file types (e.g. py (python script)) in spams). For each batch, 25 files per file type were randomly only consisted of a handful of files. Rather, we randomly sampled, for a total of 500 file fragments per file type. From sampled from 15 common file types from the original corpus each file fragment, 64 randomly sampled n-grams were used that had at least 1,000 files per type. These 15 file types for dictionary learning, giving a total of 32,000 n-grams per were: csv, doc, gif, gz, html, jpg, pdf, png, ppt, ps, file type, or 576,000 n-grams per batch. The sampling of rtf, swf, txt, xls,andxml. Although certain OOXML n-grams from file fragments was performed without replace- container file types fell below this 1,000 files per type cutoff, ment, but allowing for overlap between different n-grams. The due to their increasing relevancy, we also included them in our choice of randomly sampling 64 n-grams, as opposed to using experiments, increasing the total to 18 file types. These are file all possible (overlapping) n-grams from each fragment, was types were: docx, pptx,andxlsx, at 163, 215, and 37 files done more as a practical matter to save time during training. per type, respectively. This resulted in a dataset containing a In total, 15 dictionaries were learned, differentiated by total of 15,415 files consisting of 18 different file types. the number of elements in the dictionary (1,024, 1,296, From this dataset, we obtained a balanced set of file and 1,444 elements) as well as the size of the n-gram (4, 8, fragments with respect to file type and used a training to testing 16, 32, and 64 bytes). A sample learned dictionary is shown 2556 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 13, NO. 10, OCTOBER 2018

Fig. 3. Sparse coding analysis pipeline for constructing n-gram feature sets.

Fig. 2. Sample learned dictionary of 1,024 elements by 16-byte n-grams (reshaped as 4×4 grids for visual clarity). construction, we used three different tiers with maximum n-gram size of 16, 32, and 64 bytes. Taking into consideration each of the three dictionary in figure fig. 2. Although there are a few elements with sizes (1,024, 1,296, and 1,444 elements), this gives a total identifiable patterns, the vast majority of the elements appear of 9 different feature sets. Although the number of dictionary to be random. This result is not surprising as the dictionary elements could be adjusted arbitrarily with respect to the is learned from byte patches from files that have both low n-gram sizes, we chose to use the same size dictionaries entropy (e.g. html) and high entropy (e.g. gz) with respect for each of the concatenated feature vectors for simplicity. to their byte distributions. We found that in spite of fine tuning at this level, we were able to achieve significantly better file fragment classification C. N-Gram Frequencies results compared to existing methods. An illustration of this analysis pipeline from learning the dictionaries to constructing Once the dictionaries have been learned, we can use the the different feature sets is presented in Fig. 3. standard lasso method (see Section II-A) to transform the We did not forgo existing methods altogether, however. file fragments from a collection of n-grams to a collection Namely, we explored concatenating features such as unigram of sparse feature vectors. Here, unlike when sampling for dic- andbigramfrequenciesastheyhavebeenshowntobevery tionary learning, all n-grams from the file fragment are used. effective for file fragment classification [2]. Unlike the n-gram The transformed sparse feature vectors may then be averaged frequencies that were obtained through the proposed sparse together to provide an estimate of n-gram frequencies based coding approach, these features simply counted and normal- on the dictionary elements, allowing for significantly larger ized the number of occurrences of each unigram or bigram n-gram sizes than are typically found in existing methods found in the file fragment, out of 256 and 65,536 possibil- which suffer from combinatorial explosion. Here, instead of ities, respectively. By concatenating the unigram and bigram representing each possible n-gram combination as a “one-hot” frequencies, as a whole, to the feature sets above, we obtained entryina256n feature vector, the n-gram frequencies are 9 additional feature sets, for a total of 18 feature sets for approximated according to the dictionary elements. This is comparison during classification. because the sparse feature vectors provide a measure of how much each dictionary element contributes to the reconstruction of the n-grams in the file fragment. D. Multiclass Classification To construct the feature sets to be used in classification, Classification was performed using a support vector the n-gram frequencies for different n-gram sizes (4, 8, 16, 32, machine (SVM) with a linear kernel using the liblinear and 64 bytes) were concatenated. Instead of examining each library [17]. That is, the features vectors that were obtained possible combination, we constructed feature sets according to above were used to train the classifier directly as opposed multiple tiers ordered by maximum n-gram size. This was done to first projecting them onto a higher dimensional space as such that each progressively larger tier contained a superset of typically performed during SVM training (see Section II-B). features with respect to the tiers before it (e.g. a feature set It has been shown that for features consisting of n-gram with n-gram frequencies up to 32 bytes also contained the frequencies, training an SVM directly using a linear kernel frequencies for n-grams of 4, 8, and 16 bytes). Following this generally outperforms using the default radial basis function WANG et al.: SPARSE CODING FOR N-GRAM FEATURE EXTRACTION AND TRAINING FOR FILE FRAGMENT CLASSIFICATION 2557 kernels [2], [18]. Preliminary studies with our feature sets were IV. EXPERIMENTAL RESULTS consistent with this trend. A. Classification Results In order to attribute the differences in performance to the feature set being used during training as opposed to classifier The entire experimental setup (detailed in Section III) from parameterization, we compared the results across a few dif- sampling file fragments from the dataset to classification was ferent parameterizations. Fortunately, there are very few para- performed over 10 independent runs. From the classification meters relevant for training linear kernel SVMs. In particular, results, we evaluated the performance of our approach using a a regularization parameter, C, is used to manage classification number of metrics. In addition to the raw prediction accuracy, error for non-separable training points. Incidentally, solvers we also computed the F1 score, which provides a weighted in liblinear are not very sensitive to C, especially when average of the precision and recall of the classifier [19]. C is above a certain value [17]. The other parameters, such Because our dataset was balanced, both the micro-averaged as the choice of loss function and , a stopping criterion (computed globally over all classes) and macro-averaged for the solver, are typically left at their default values. For (computed individually for each class and then weighted) our parameterization, we varied C between values of 1, 32, F1 scores were identical. and 256, where C = 1 is the default parameterization for The classification results averaged over 10 runs for each of the solver and is what was used in a previous study by the different feature sets and parameterizations are presented Fitzgerald et al. and C = 256 is consistent with the “winning” in TABLE I. The different feature sets used are as follows. classifier identified in a previous study by Beebe et al. [2] and The first row (Uni/Bigram Only) gives results for the classifier Fitzgerald et al. [3]. The value C = 32, also used in the trained solely on the unigram and bigram frequencies, repli- previous study by Beebe et al., was chosen as an intermediate cating the study by Beebe et al. for our dataset [2]. The second value. For the other parameters, we simply used the default row (Uni/Bigram + NLP) gives results for the classifier trained  = 0.01 and L2 regularized L2 loss function. on unigram and bigram frequencies along with various NLP inspired features, replicating the study by Fitzgerald et al. for our dataset [3]. The remaining rows give results for feature sets E. Replication Studies containing concatenated n-gram frequencies obtained using In order to attribute differences in performance across our proposed sparse coding approach, both standalone (Sparse studies to the methods being used as opposed to the ease of N-Gram Only) and alongside unigram and bigram frequencies classification between datasets, we also replicated a couple of (Uni/Bigram + Sparse N-Gram). The sizes of the dictionaries existing studies with respect to our dataset. Perhaps the most used are given (1,024, 1,296, or 1,444 elements), as are the straightforward is a study by Beebe et al., which found that maximum n-gram sizes (16, 32, or 64 bytes) according to the simply training a linear kernel SVM on a feature set consisting tiered composition outlined in Section III-C. of only unigram and bigram frequencies produced good classi- The best performing feature set was the concatenation fication results on a variety of file and data types [2]. For this, of the unigram and bigram frequencies and the full set of we effectively replicate their “winning” classifier by training n-gram frequencies (4 through 64 bytes) with a dictionary size according to the consistent parameterization with C = 256. of 1,444 elements. With respect to the parameterization of the Although we do not use their software tool, Sceadan,which linear kernel SVM, performance was split between C = 32 enables systematic comparison among several additional fea- and C = 256, with the highest average prediction accuracy tures, we use the same underlying library, liblinear,aswellas of 61.31% from C = 256, and the highest average F1 score identical normalization of the unigram and bigram frequencies. of 60.74% from C = 32. Looking at the performance on the In doing so, we may provide a more direct comparison to other feature sets, and given that the F1 score provides a rough our approach because although the original study also used measure of how balanced a classifier is, we may consider the files from the govdocs1 file corpus, it was supplemented by parameterization of C = 32 to be the best overall. A sample additional file and data types for a total of 30 different file confusion matrix (with a prediction accuracy of 60.99%)from types as well as 8 different data types. Furthermore, some file one of the runs of this classifier is given in TABLE II. When types such as rtf and swf were excluded from their dataset. omitting the unigram and bigram frequencies and only using We also replicated the method outlined in a study by the features obtained through our proposed sparse coding Fitzgerald et al. [3], which also used files drawn from the approach, we achieved an average prediction accuracy of govdocs1 file corpus, with a total of 24 different file types. 56.59% and an averaged F1 score of 55.86%. In addition to unigram and bigram frequencies, this study also Compared to existing work on file fragment classification used features inspired by natural language processing (NLP). as well as found during our replication studies, we found To briefly summarize, these additional features included the that the features obtained using our sparse coding approach Shannon entropy of the bigram counts, the Hamming weight performed significantly better, especially when the features (i.e. the percentage of ‘one’ bits), the mean byte value, were used in supplement to existing hand-engineered features. the compressed length of the file fragment (using the bzip2 From the literature, an early study by Veenman using Fisher’s algorithm), the average contiguity between bytes, and the linear discriminant on features such as byte histograms and longest continuous streak of repeating bytes. Here, the most information theoretic measures such as entropy achieved an direct comparison to the original study comes from using the average prediction accuracy of 45% on 11 file types [6]. default parameterization of C = 1 for the solver. Introducing the normalized compression distance as a feature, 2558 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 13, NO. 10, OCTOBER 2018

TABLE I AVERAGED CLASSIFICATION METRICS FOR DIFFERENT FEATURE SETS AND SVM TRAINING PARAMETERIZATION

TABLE II CONFUSION MATRIX OF CLASSIFIER USING THE BEST PERFORMING FEATURE SET (CONCATENATION OF THE UNIGRAM AND BIGRAM FREQUENCIES AND THE FULL SET OF N-GRAM FREQUENCIES (4 THROUGH 64 BYTES)WITH A DICTIONARY SIZE OF 1,444 ELEMENTS) AND THE BEST OVERALL PARAMETERIZATION OF C = 32 FOR THE LINEAR KERNEL SVM

a study by Axelsson using k-nearest neighbor achieved an concatenated unigram and bigram frequencies to train a linear average prediction accuracy of 34% on 34 file types [4]. kernel SVM, achieved an average prediction accuracy of Although we do not compare against these studies directly, 53.31% for C = 256 over the 18 file types of our dataset. an indirect comparison may be considered by way of the Notably, this result is significantly lower for our dataset replicated studies [2], [3]. than what was found in the original study of 73.4%, which With respect to the replicated studies, we found that the consisted of a dataset containing 30 file types and 8 data “winning” approach found by Beebe et al.,whichsimply types [2]. Due to the identical construction with respect to the WANG et al.: SPARSE CODING FOR N-GRAM FEATURE EXTRACTION AND TRAINING FOR FILE FRAGMENT CLASSIFICATION 2559 unigram and bigram features from file fragments, as well as it will also scale inversely with additional computational the identical parameterization of the linear kernel SVM used resources (e.g. additional CPU cores). The development of for training and testing, we posit that the main difference in next-generation computational hardware accelerators such as performance was due to the differences in the composition of resistive memory crossbar arrays may further mitigate this the datasets used more than anything else. Another possibility, issue due to their ability to accelerate linear algebra operations mentioned in the limitations of the original study, is biasing used by sparse coding [20], [21]. due to the lack of separation of files used during the extraction of fragments between training and test sets, whereas for our experiments, file fragments between training and test sets were C. Discussion chosen from different files altogether. From these results, we see that fragments more likely For the approach proposed by Fitzgerald et al.,which to contain identifying patterns, such as the tags in xml, introduced features inspired from NLP, we achieved an average the commas in csv, or commands in ps, were those most prediction accuracy of 53.59% for C = 1. Also notably, easily classified. With respect to the sparse coding approach, this result is higher for our dataset than what was found in the fragments of these file types were most likely to be recon- the original study of 49.1%, which consisted of a dataset structed using recurring dictionary elements. At the other end, containing 24 file types [3]. Although the algorithms used to the fragments that gave the classifier the most difficulty were construct the NLP type features were provided in the original those that possessed relatively higher entropy with respect to study, we cannot quite claim the replicated features were the byte distribution, act as containers for other files, or both identical as specifics on how the normalization was performed (e.g. gz, png, pdf, ppt, swf). In particular, we note that were not made explicit. In spite of this, as with the replicated file types such as gz and png employ the common DEFLATE study above, we posit that the main difference in performance compression algorithm for their encoding. As a result, these was simply due to the differences between the datasets used. fragments were more often confused with each other than Primarily, we found that when performing a cross comparison confused with those more easily classified file types. This with respect to our dataset, the performance when using this general behavior when confronted with high entropy fragments latter feature set containing NLP inspired features was similar is similar to what has been noted in previous studies [3], [7]. to, or even slightly better than, the performance of just the Drawing comparisons on a per file type basis, we contrast unigram and bigram frequencies on their own. confusion matrices (for the same experimental run) between the best performing feature set in TABLE II with the replicated method from Beebe et al. (with a prediction accuracy of B. Timing Analysis 53.10%) in TABLE III, which only used the unigram and We also performed a rough timing analysis to provide a bigram frequencies [2]. The differences between these two measure of the scalability and practicality of the sparse coding classifiers is given in TABLE IV, where better prediction accu- approach. All of our experiments were run on a multi-core racy relative to the replicated method is indicated by positive machine with 16 physical processing cores and a clock rate values along the diagonal and negative values elsewhere. of 1.2 GHz per core. Because timing is heavily dependent on For brevity during the comparison, we will refer to the fea- the computational resources, these values are only capable of ture set containing the both the extended n-gram frequencies providing a rough measure. Here, we give values for the most and the unigram and bigram frequencies as “full”, and the computationally expensive feature set, which consisted all five feature set only containing the unigram and bigram frequencies levels of n-gram frequencies (4 through 64 bytes) with a size as “reduced”. At a glance, we see that performance per file of 1,444 elements per dictionary. type is better for the classifier trained on the full feature Given a file fragment, there are two main computational set almost across the board. In fact, performance along the phases needed to produce its classification. First, transforming diagonal ranged from comparably to significantly better for file fragments into a collection of sparse feature vectors, every single file type except for jpg. In the case of jpg, the lasso method provided by the spams sparse coding library we note that the classifier trained on the reduced feature set averaged roughly 840min to process all 270,000 file fragments traded error on the off-diagonal, resulting in an overall worse from both the training and testing sets, which corresponds to performance. To contrast, when training on the full feature roughly 187ms per file fragment [13]. Second, training the set, we see that these errors are much better balanced with linear kernel SVM using liblinear averaged roughly 130min classification biased toward the diagonal. for the 162,000 file fragments in the training set, which corre- To expand a bit on the issue of classification balance, espe- sponds to roughly 48ms per file fragment [17]. Subsequently, cially in the case of higher entropy file types (e.g. gif, gz) the classification from the feature set of a file fragment to its where discernible features are more difficult to identify, file type is very quick. a noticeable difference in performance between the two feature In practice, the biggest drawback currently to this approach sets was the misclassification biasing toward one or two of is the amount of time it takes to extract the feature set from a the high entropy file types when otherwise uncertain. For the given file fragment, especially considering the size of storage reduced feature set, this bias is most pronounced in the gif available (in the terabyte range). Fortunately, this phase is file type, with other file types such as png misclassifying highly data parallel, and although this means that computa- to it even more than classifying correctly to itself. This bias tion time scales linearly with the number of file fragments, toward file types is consistent with the results of the replicated 2560 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 13, NO. 10, OCTOBER 2018

TABLE III CONFUSION MATRIX OF CLASSIFIER USING A FEATURE SET CONTAINING ONLY UNIGRAM AND BIGRAM FREQUENCIES

TABLE IV DIFFERENCE OF CONFUSION MATRICES COMPARING THE CLASSIFICATION PERFORMANCE BETWEEN TRAINING USING FULL AND REDUCED FEATURE SETS (TABLES II AND III, RESPECTIVELY)

study [2]. For the full feature set, although the errors are Regarding the container file types (e.g. pdf, pptx), we find spread around more evenly than with the reduced feature that misclassification errors generally biases toward the more set, there is still noticeable biasing toward the gz file type, “fundamental” file types which may be embedded in them with file types such as png misclassifying to it at a greater (e.g. images such as jpg or png), potentially as compressed rate than to other high entropy file types. Overall, the results data. This can be especially noted in the case of the presen- here indicate that training the classifier using the extended tation file types of ppt and pptx, which commonly include n-gram frequencies results in more favorable classification images. Here, we see misclassification to potentially embedded performance. file types at much higher rates than with other file types. WANG et al.: SPARSE CODING FOR N-GRAM FEATURE EXTRACTION AND TRAINING FOR FILE FRAGMENT CLASSIFICATION 2561

To revisit the misclassification errors biasing toward jpg from ignores the relationships between consecutive n-grams. To this earlier, we see that when using the reduced feature set, the ppt end, an area to explore are methods that may capture more file type is misclassified to jpg at about the same rate as hierarchical or global information about the file fragment. classifying correctly to itself, and the pptx file type is over- Here, algorithms that are recurrent in nature and efficient whelmingly misclassified to jpg. Another significant biasing in classifying sequences, such as long short-term memory is with the xlsx file type being overwhelmingly misclassified networks, may prove useful [22], [23]. Alternative methods to gif, likely due to both resembling compressed data. for the automated extraction of n-gram features should also Although we see a similar trend when using the full feature set be explored. For example, the use of sparse may to train the classifier, we note that the effects are significantly result in reduced computation time, but perhaps at the cost of reduced, with the ppt file type being classified more correctly prediction accuracy [24]. to itself than jpg,andthepptx file type slightly above equal classification to misclassification rate to jpg.Fromthis, REFERENCES we can conclude that the additional information (potentially [1] S. L. Garfinkel, “Carving contiguous and fragmented files features related to metadata) made available by the extended with fast object validation,” Digit. Invest., vol. 4, pp. 2Ð12, Sep. 2007. [Online]. Available: http://www.sciencedirect.com/science/ n-gram frequencies for these container file types are helpful article/pii/S1742287607000369 in differentiating them from the file types they may contain. [2] N. L. Beebe, L. A. Maddox, L. Liu, and M. Sun, “Sceadan: Using Overall, we see that the most significant improvements concatenated n-gram vectors for improved file and data type classifica- tion,” IEEE Trans. Inf. Forensics Security, vol. 8, no. 9, pp. 1519Ð1530, when using the extended n-gram frequencies over just the Sep. 2013. unigram and bigram frequencies are for file types with more [3] S. Fitzgerald, G. Mathews, C. Morris, and O. Zhulyn, “Using complex internal structure such as the container file types. NLP techniques for file fragment classification,” Digit. Invest., vol. 9, pp. S44ÐS49, Aug. 2012. [Online]. Available: http://www. Regarding other file types with internal structure such as sciencedirect.com/science/article/pii/S1742287612000333 html or xml, where common byte sequences (e.g. tags) [4] S. Axelsson, “The normalised compression distance as a may extend to more than just two bytes, we find that our file fragment classifier,” Digit. Invest., vol. 7, pp. S24ÐS31, Aug. 2010. [Online]. Available: http://www.sciencedirect.com/science/ approach yields moderate improvement, as well as reducing article/pii/S1742287610000319 misclassification error between them. Taken together, what [5] G. Conti et al., “Automated mapping of large binary objects using these results indicate is that our approach is able to capture primitive fragment type classification,” Digit. Invest., vol. 7, pp. S3ÐS12, Aug. 2010. [Online]. Available: http://www.sciencedirect.com/science/ information on a broader scale due to the estimation of larger article/pii/S1742287610000290 n-grams, and, furthermore, that this information is useful for [6] C. J. Veenman, “Statistical disk cluster classification for file carving,” improving classification performance. in Proc. 3rd Int. Symp. Inf. Assurance Secur., Aug. 2007, pp. 393Ð398. [7] P. Penrose, R. Macfarlane, and W. J. Buchanan, “Approaches to the classification of high entropy file fragments,” Digit. Invest., V. C ONCLUSIONS vol. 10, no. 4, pp. 372Ð384, Aug. 2013. [Online]. Available: We have proposed an approach for file fragment classi- https://doi.org/10.1016%2Fj.diin.2013.08.004 [8] J. A. Cox, C. D. James, and J. B. Aimone, “A signal processing approach fication using sparse dictionary learning to estimate n-gram for cyber data classification with deep neural networks,” Procedia frequencies for a given file fragment. Notably, the dictionaries Comput. Sci., vol. 61, pp. 349Ð354, Nov. 2015. [Online]. Available: learned using this approach can be used to extend the feature http://www.sciencedirect.com/science/article/pii/S1877050915029865 [9] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse cod- of n-gram frequencies without suffering from combinatorial ing algorithms,” in Advances in Neural Information Processing Sys- explosion because n-grams are represented as the reconstruc- tems 19, B. Schölkopf, J. Platt, and T. Hoffman, Eds. Cambridge, tion of sparse vectors as opposed to being expressed exactly. MA, USA: MIT Press, 2006, pp. 801Ð808. [Online]. Available: http://books.nips.cc/papers/files/nips19/NIPS2006_0878.pdf Although the features extracted by this approach are done [10] V. Roussev and S. L. Garfinkel, “File fragment classification- so in an unsupervised manner, they are able to capture a the case for specialized approaches,” in Proc. 4th Int. IEEE significant amount of information present in the byte patterns Workshop Syst. Approaches Digit. Forensic Eng. (SADFE), Washington, DC, USA, May 2009, pp. 3Ð14. [Online]. Available: without needing prior domain knowledge. In fact, due to the http://dx.doi.org/10.1109/SADFE.2009.21 sparsity constraint, the approach tends toward the extraction [11] V. M. Patel and R. Chellappa, “Sparse representations, compressive of redundant byte patterns specific to individual file types. sensing and dictionaries for ,” in Proc. Asian Conf. Pattern Recognit. (ACPR), Nov. 2011, pp. 325Ð329. Experimentally, we found that these features yielded signif- [12] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Roy. icantly better classification results with respect to existing Statist. Soc., B (Methodological), vol. 58, no. 1, pp. 267Ð288, 1996. methods, especially when the features were used in supplement [13] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding,” J. Mach. Learn. Res., vol. 11, to existing hand-engineered features. pp. 19Ð60, Mar. 2010. Although the proposed sparse coding approach achieves [14] B. A. Olshausen and D. J. Field, “Sparse coding with an over- competitive file fragment classification on its own, leveraging complete basis set: A strategy employed by V1?” Vis. Res., vol. 37, no. 23, pp. 3311Ð3325, 1997. [Online]. Available: domain expertise may yield still more accurate classifiers. http://www.sciencedirect.com/science/article/pii/S0042698997001697 For example, domain knowledge may inform the fine tuning [15] V. N. Vapnik, Statistical Learning Theory. Hoboken, NJ, USA: Wiley, of various model parameters, such as the ratio between the 1998. [16] S. Garfinkel, P. Farrell, V. Roussev, and G. Dinolt, “Bringing size of n-grams to estimate to the size of the dictionaries to science to digital forensics with standardized forensic corpora,” learn. With respect to algorithmic design choices, one of the Digit. Invest., vol. 6, pp. S2ÐS11, Sep. 2009. [Online]. Available: limitations of the approach presented in the paper is that even http://www.sciencedirect.com/science/article/pii/S1742287609000346 [17] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, though we may extend the size of n-grams beyond traditional “LIBLINEAR: A library for large linear classification,” J. Mach. Learn. methods, simply using features such as n-gram frequencies Res., vol. 9, pp. 1871Ð1874, Jun. 2008. 2562 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 13, NO. 10, OCTOBER 2018

[18] Q. Li, A. Y. Ong, P. N. Suganthan, and V. L. L. Thing, “A novel support Felix Wang, photograph and biography not available at the time of vector machine approach to high entropy data fragment classification,” publication. in Proc. South African Inf. Security Multi-Conf. (SAISMC), 2010, pp. 236Ð247. [19] C. J. V. Rijsbergen, Information Retrieval, 2nd ed. Newton, MA, USA: Tu-Thach Quach, photograph and biography not available at the time of Butterworth-Heinemann, 1979. publication. [20] P. M. Sheridan, C. Du, and W. D. Lu, “Feature extraction using memristor networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 11, pp. 2327Ð2336, Nov. 2016. [21] S. Agarwal et al., “Energy scaling advantages of resistive memory Jason Wheeler, photograph and biography not available at the time of crossbar based computation and its application to sparse coding,” publication. Frontiers Neurosci., vol. 9, p. 484, Jan. 2016. [Online]. Available: http://journal.frontiersin.org/article/10.3389/fnins.2015.00484 [22] Z. C. Lipton, J. Berkowitz, and C. Elkan. (2015). “A critical review of James B. Aimone, photograph and biography not available at the time of recurrent neural networks for sequence learning.” [Online]. Available: publication. https://arxiv.org/abs/1506.00019 [23] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735Ð1780, 1997. [Online]. Available: http://dx.doi.org/10.1162/neco.1997.9.8.1735 Conrad D. James, photograph and biography not available at the time of [24] A. Ng, “Sparse ,” Lect. Notes, vol. 72, pp. 1Ð9, Jul. 2011. publication.