Sparse Coding for N-Gram Feature Extraction and Training for File Fragment Classiﬁcation

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 13, NO. 10, OCTOBER 2018 2553 Sparse Coding for N-Gram Feature Extraction and Training for File Fragment Classification Felix Wang , Tu-Thach Quach, Jason Wheeler, James B. Aimone, and Conrad D. James Abstract— File fragment classification is an important step in Because the search space of where each fragment may poten- the task of file carving in digital forensics. In file carving, files tially belong to is so large, automated tools are indispensable must be reconstructed based on their content as a result of their in making this time-consuming process practical. fragmented storage on disk or in memory. Existing methods for classification of file fragments typically use hand-engineered The main goal of file fragment classification is to determine features, such as byte histograms or entropy measures. In this which type of file (e.g. doc, html, jpg, pdf) a fragment paper, we propose an approach using sparse coding that enables belongs to from its content. In the literature, a variety of automated feature extraction. Sparse coding, or sparse dictionary machine learning algorithms have been used for classification learning, is an unsupervised learning algorithm, and is capable of file fragments to their respective file types. These include of extracting features based simply on how well those features can be used to reconstruct the original data. With respect to file support vector machines (SVMs), k-nearest neighbor (kNN), fragments, we learn sparse dictionaries for n-grams, continuous linear discriminant analysis (LDA), and artificial neural sequences of bytes, of different sizes. These dictionaries may then networks (ANNs) [2]–[8]. A common trend across these be used to estimate n-gram frequencies for a given file fragment, existing methods is the development of a large number of but for significantly larger n-gram sizes than are typically found hand-engineered features over the file fragments which are in existing methods which suffer from combinatorial explosion. To demonstrate the capability of our sparse coding approach, then used for classification. Example features are histograms we used the resulting features to train standard classifiers, such as of the bytes (unigrams) within a file fragment, as well support vector machines over multiple file types. Experimentally, as histograms of pairs of bytes (bigrams). More global we achieved significantly better classification results with respect features include the Shannon entropy over these unigrams and to existing methods, especially when the features were used in bigrams, or the compressed length of the file fragment [4]–[6]. supplement to existing hand-engineered features. Techniques originally from other areas of machine learning Index Terms— File fragment classification, file carving, auto- such as natural language processing (NLP) have also been mated feature extraction, sparse coding, dictionary learning, adapted to file fragments, producing features such as the unsupervised learning, n-gram, support vector machine. contiguity between bytes and the longest continuous streak of repeating bytes [3]. For distinguishing between higher I. INTRODUCTION entropy file types, features derived from statistical tests for N DIGITAL forensics, content-based analytics are often randomness have been applied [7]. Irequired to classify file fragments in the absence of other In contrast to hand-engineering features, we propose an identifying information. During data recovery, for example, approach for automated feature extraction from file fragments the fragmentation of files on damaged media or memory using sparse coding, also known as sparse dictionary learning dumps results in files that appear corrupted or missing even (details in Section II-A). This approach has a number of though the data is present [1]. For file carving, in order to advantages over hand-engineered features. Primarily, the fea- reconstruct the complete file from these fragments, analysts tures that are extracted through sparse coding are found in an must determine which fragment might go with which file. unsupervised manner, as opposed to features that might need to be laboriously constructed to be suitable for a particular Manuscript received August 25, 2017; revised January 8, 2018 and domain [9]. Furthermore, because the features extracted by February 20, 2018; accepted March 25, 2018. Date of publication this approach minimize reconstruction error, they capture a April 5, 2018; date of current version May 14, 2018. This work was supported by Sandia National Laboratories’ Laboratory Directed Research and Develop- significant amount of information about the domain without ment (LDRD) Program and in part by the Hardware Acceleration of Adaptive needing prior domain knowledge. Another benefit of this Neural Algorithms Grand Challenge Project. Sandia National Laboratories approach, due to the sparsity constraint, is the extraction of is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia LLC, a wholly owned subsidiary of specialized features targeted to each specific file type, or even Honeywell International Inc., for the U.S. Department of Energy’s National within file types containing more complex internal structure Nuclear Security Administration under Contract DE-NA0003525. The asso- (e.g. doc, pdf, zip) [10]. ciate editor coordinating the review of this manuscript and approving it for publication was Prof. Xinpeng Zhang. (Corresponding author: Felix Wang.) The remaining sections of the paper are organized as The authors are with Sandia National Laboratories, Albuquerque, follows: in Section II we provide background on the sparse NM 87123 USA (e-mail: [email protected]). coding algorithm as well as the SVM classifier; in Section III Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. we describe our experiments on how we use our sparse Digital Object Identifier 10.1109/TIFS.2018.2823697 coding approach to extract features from file fragment data and 1556-6013 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 2554 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 13, NO. 10, OCTOBER 2018 train our classifier to differentiate among multiple file types; in Section IV we provide and discuss the results from our experiments; and in Section V we make concluding remarks. II. BACKGROUND A. Sparse Coding Sparse coding, or sparse dictionary learning, is a way of modeling data by decomposing it into sparse linear combi- nations of elements of a given basis set [9], [11]. That is, a data vector x ∈ Rm may be approximated as multiplying a dictionary matrix D ∈ Rm×k with a sparse representation vector r ∈ Rk: x ≈ Dr. Here, a vector is said to be sparse when only a small fraction of the entries are nonzero. Although it is called sparse dictionary learning, the dictionary D is not necessarily sparse. When dictionary D is known (e.g. a wavelet basis), Fig. 1. Comparison of representative dictionaries learned from patches of natural images to that of byte data. a common method for finding an associated sparse vector is through regression analysis, which may be formulated as the 2-dimensional space, the “patch” equivalent for byte data L1-regularized optimization problem, would be n-grams, which are 1-dimensional sequences of ( , ) = 1 − 2 + λ continuous bytes (resized in Fig. 1 to 2D for ease of visu- l x D min x Dr 2 r 1 (1) r∈Rk 2 alization). Once the patches are extracted from the original where the cost may be understood as the contributions of the data, the algorithm for sparse coding is the same in either 1 || − ||2 λ|| || domain. Differences are made apparent in the entries of the reconstruction error 2 x Dr 2 and a sparsity penalty r 1, and λ is a regularization parameter. This particular formulation dictionaries that are learned, corresponding to the differences is also known as the lasso [12]. in how data is structured between the two different domains. When dictionary D is not known, or it is more desirable to While the features that are extracted through sparse coding learn a dictionary more representative of the data, an extended may potentially span relatively large spatial scales, the total formulation to the above optimization problem gives the sparse number of elements in the dictionary may be relatively small coding cost function, in comparison. For example, due to redundancies in the n signal, even for highly overcomplete dictionaries in image ( ) = 1 1 − 2 + λ fn X min xi Dri 2 ri 1 reconstruction applications, the number of elements in the D∈Rm×k,R∈Rk×n n 2 i=1 dictionary is only greater by a moderate ratio to the number (2) of pixels in the image patch (e.g. 2000 elements for a patch where X ={x1,...,xn} are the data vectors from which size of 20 × 20 = 400 pixels gives a ratio of 5 : 1) [9]. we want to learn a dictionary. Because the optimization We may leverage this property to extend features such occurs over both the dictionary D and the set of sparse as n-gram frequencies beyond standard n-gram sizes without representation vectors R ={r1,...,rn}, most approaches to suffering from a combinatorial explosion in the feature space. sparse coding iteratively fix one variable while minimizing To illustrate, if we take unigram frequencies as an example, the other. For learning dictionaries from large input data it requires 256 elements for each possible byte. In order to sets, online or streaming methods have been developed [13]. extend this to pairs of bytes, or bigram frequencies, we may Regarding initialization, a common method for initializing naively expand the feature space to 2562 = 65, 536 elements, the dictionary D is seeding it with random elements from and for trigram frequencies, this is expanded even more (2563 the training set, and the sparse representation vectors R are elements).

Sparse Coding for N-Gram Feature Extraction and Training for File Fragment Classiﬁcation

10 Oriented Principal Component Analysis for Feature Extraction

Feature Selection/Extraction

Feature Extraction (PCA & LDA)

Feature Extraction for Image Selection Using Machine Learning

Time Series Feature Extraction for Industrial Big Data (Iiot) Applications

Machine Learning Feature Extraction Based on Binary Pixel Quantiﬁcation Using Low-Resolution Images for Application of Unmanned Ground Vehicles in Apple Orchards

Towards Reproducible Meta-Feature Extraction

Generalized Feature Extraction for Structural Pattern Recognition in Time-Series Data Robert T

Feature Extraction Using Dimensionality Reduction Techniques: Capturing the Human Perspective

Unsupervised Feature Extraction for Reinforcement Learning

Alignment-Based Topic Extraction Using Word Embedding

Deep Learning Feature Extraction Approach for Hematopoietic Cancer Subtype Classiﬁcation