Project Report

Fast Processing of Large (Big)

Forensics Data

Indian Academy of Sciences Summer Research Fellowship Program - 2014

By. Pritam Dash

School of Computing Science and Engineering, Vellore Institute of Technology, Chennai Campus

Guide: Dr. B.M Mehtre Associate Professor Institute of Development and Reaearch in Banking Technology, Hyderabad, India

1

Declaration

I hereby declare that the project entitled Fast Processing of Large (Big) Forensics Data submitted for the Indian Academy of Sciences Summer Research Fellowship Program 2014 is my original work and the project has not formed the basis for the award of any degree, associateship, fellowship or any other similar work.

Signature of the Student: Place: Hyderabad Date: 25th July 2014

2

Certificate

This is to certify that this project report entitled Fast Processing of Large (big) Forensics Data is submitted to Indian Academy of Sciences, Bangalore, is a bonafide record of work done by Mr. Pritam Dash, undergraduate student at Vellore Institute of Technology, Chennai Campus under my supervision from 29th May 2014 to 25th July 2014.

Signature of Guide

Place: Hyderabad

Date: 25th July 2014

3

Acknowledgement

I take this opportunity to express my profound gratitude and deep regards to my guide Dr. B.M. Mehtre for his exemplary guidance, monitoring and constant encouragement throughout the course of this work. The help and guidance given by him time to time shall carry me a long way in the journey of life on which I am about to embark.

I also take this opportunity to express a deep sense of gratitude to Mr. Sandeep K, Research Associate, IDRBT for his support, valuable information and guidance, which helped me in completing this task through various stages.

I am obliged to staff members of (IDRBT), for the valuable information provided by them in their respective fields. I am grateful for their cooperation during the period of my summer research.

Pritam Dash

Hyderabad, 25th July 2014

4

Abstract

The magnitude of potential has grown exponentially. The increasing amount of data, especially unstructured data is becoming a major challenge for forensic investigators. Hence, in cyber forensics the necessity of sifting through vast amount of data quickly is now paramount. To speed up the processing, it is essential in the triage process to first eliminate those files that are clearly unrelated to the investigation. An effective method for supporting this work is matching files against black and white lists. We compare 5 different methods of finding additional uninteresting files: frequent hash values, frequent paths, frequent size, clustered creation, and uninteresting extensions [1]. Tests were run on data sources of different volumes collected from Windows and systems. In this work we propose a new strategy for faster processing of large forensics data. And provides a comparison between the total no. of uninteresting files found by and Hash set matching with the above mentioned 5 methods. In our initial test we could eliminate additional 2.37% and 3.4% uninteresting files in Windows and Linux data sources respectively.

Keywords: , data source, hash set, metadata, black-list, white-list, uninteresting files.

5

Table of Contents Declaration ...... 2 Certificate ...... 3 Acknowledgement ...... 4 Abstract ...... 5 Table of Contents ...... 6 1 Digital Forensics Overview ...... 7 1.1 Introduction ...... 7 1.2 Phases in Digital Forensics Analysis ...... 7 1.3 Branches of Digital Forensics ...... 8 1.3.1 ...... 8 1.3.2 ...... 9 1.3.3 ...... 9 1.3.4 ...... 9 1.3.5 forensics ...... 9 1.4 Challenges in Digital Forensics ...... 9 1.4.1 Large (big) data as a challenge ...... 10 2 Literature survey ...... 11 2.1 Introduction ...... 11 2.2 Strategies and tools for fast processing ...... 11 2.2.1 Metadata analysis ...... 12 2.2.2 Super clustering using Dirim ...... 12 2.2.3 Using Jumplist to identify fraudulent documents ...... 13 2.2.4 OpenLV ...... 13 2.2.5 Hash set matching...... 13 2.2.6 Indexing through piece wise hash signature ...... 15 2.2.7 Indexing image hashes ...... 15 2.2.8 Methods for finding additional uninteresting files ...... 16 3 Proposed Method ...... 17 3.1 Step-1: Finding drives of interest...... 17 3.2 Step- 2: Eliminating uninteresting files ...... 17 3.3 Experimental setup ...... 17 3.4 Results and discussions ...... 18 4 Conclusion and Future Work ...... 21 4.1 Conclusion ...... 21 4.2 Future work ...... 21 References ...... 22

6

1 Digital Forensics Overview 1.1 Introduction Digital Forensics is the use of scientifically derived and proven methods toward the preservation, collection, validation, identification, analysis, interpretation, documentation and presentation of digital evidence derived from digital sources for the purpose of facilitating reconstruction of events found to be criminal. The technical aspect of an investigation is divided into several sub-branches, relating to the type of digital devices involved; computer forensics, network forensics, forensic data analysis and mobile device forensics. The typical forensic process encompasses the seizure, forensic imaging (acquisition) and analysis of digital media and the production of a report into collected evidence. As well as identifying direct evidence of a crime, digital forensics can be used to attribute evidence to specific suspects, confirm alibis or statements, determine intent, identify sources (for example, in cases), or authenticate documents. Investigations are much broader in scope than other areas of forensic analysis (where the usual aim is to provide answers to a series of simpler questions) often involving complex time-lines or hypotheses.

1.2 Phases in Digital Forensics Analysis

Fig- 1 Phases In Digital Forensics

 Identification. This process includes the search, recognition and documentation of the physical devices on the scene potentially containing digital evidence.  Collection – Devices identified in the previous phase can be collected and transferred to an analysis facility.

7

 Acquisition – This process involves producing an image of a source of potential evidence, ideally identical to the original.  Preservation – Evidence integrity, both physical and logical, must be ensured at all times.  Analysis – Interpretation of the data from the evidence acquired. It usually depends on the context, the aims or the focus of the investigation and can range from malware analysis to image forensics, , and a lot more of application-specifics areas. On a higher level analysis could include content analysis via for instance forensics linguistics or sentiment analysis techniques.  Reporting - Communication and/or dissemination of the results of the digital investigation to the parties concerned.

1.3 Branches of Digital Forensics Digital forensics includes several sub-branches relating to the investigation of various types of devices, media or artifacts.

Fig- 2. Digital forensics - branches

1.3.1 Computer forensics The goal of computer forensics is to explain the current state of a digital artifact; such as a computer system, storage medium or electronic document.The discipline usually covers computers, embedded systems (digital devices with rudimentary computing power and onboard memory) and static memory (such as USB pen drives).

8

Computer forensics can deal with a broad range of information; from logs (such as history) to the actual files on the drive. 1.3.2 Mobile device forensics Mobile device forensics is a sub-branch of digital forensics relating to recovery of digital evidence or data from a mobile devices. Investigations usually focus on simple data such as call data and communications (SMS/Email) rather than in-depth recovery of deleted data. Mobile devices are also useful for providing location information; either from inbuilt gps/location tracking or via cell site logs, which track the devices within their range. 1.3.3 Network forensics Network forensics is concerned with the monitoring and analysis of traffic, both local and WAN/internet, for the purposes of information gathering, evidence collection, or intrusion detection. Traffic is usually intercepted at the packet level, and either stored for later analysis or filtered in real-time. Unlike other areas of digital forensics network data is often volatile and rarely logged, making the discipline often reactionary. 1.3.4 Forensic data analysis Forensic Data Analysis is a branch of digital forensics. It examines structured data with the aim to discover and analyse patterns of fraudulent activities resulting from financial crime. 1.3.5 Database forensics Database forensics is a branch of digital forensics relating to the forensic study of and their metadata. Investigations use database contents, log files and in- RAM data to build a timeline or recover relevant information.

1.4 Challenges in Digital Forensics In recent years, forensic computing has evolved beyond recognized discipline with certified practitioners and guidelines pertaining to the conduct of their activities. With the ubiquity of computer-based devices in everyday use, forensic techniques are increasingly being applied to a broad range of digital media and equipment, thus posing many challenges for experts as well as for those who make use of their skills. Here are some common challenges which is now a major concern in forensics investigation.  Increase in numbers of devices per person  Larger storage devices  Increased turnaround time  Case backlogs  Pressure to accelerate reporting time  Technology changes 9

 New apps  Cloud forensics

1.4.1 Large (big) data as a challenge Digital Forensics Investigation is facing new challenges because of dramatic increase in storage size per computer or device and substantial increase in usage of solid-state removable media. Worldwide huge usage of personal mobile devices like smartphones, tablets and increasing adoption of cloud services by individuals and businesses has added more complexity to the forensics investigation process. The dimensions of potential digital evidence sources has grown exponentially. Studies have shown that a forensic investigation may involve dealing with a wide range of distinct drives images and files collected from various computers and users. The increasing amount such unstructured data is becoming a major challenge for forensic investigators. Detailed forensics analysis of all the drives and files won’t be a smart step as most of the files in the corpus would not provide any forensically interesting information. Hence, in cyber forensics fast processing of the vast amount of data quickly has become an important concern.

1. Increasing amount of data per PC 2. Increasing amount of data per user

3. Increase in volume of data in servers 4. Increase in usage of smart phones and removable USB devices

Fig- 3. Large (big) data as a challenge for digital forensics

10

2 Literature survey

2.1 Introduction Inspecting all the files on a drive will take a long time. Hence, inspecting the directory metadata is a good and time saving alternative. Metadata provides sufficient description and statistics about the files in the drive and is usually 1000 times smaller than the contents in the drive [2]. Clues that a drive is of interest: sign of anti-forensic techniques used to hide data on the computer, erasing contents of a drive, steganography hidden messages, rootkits: used to hide extension of certain processes. To speed up subsequent processing, eliminating forensically uninteresting files would be a wise rather than examining each and every file in the drive. This elimination process may result in significant reduction in the size of the corpus. Those files whose contents do not provide useful information about users of a drive can be labelled “uninteresting” [1]. These are usually the operating-system files and the applications- software files that do not contain user-created information, and it also includes common internet-document downloads that does not provide user-discriminating information [3]. Forensics Investigators often utilize forensic software tools for matching files based on cryptographic hashes of types like MD5 and SHA-1 for matching files against white-list and black-list. The National Software Reference Library (NSRL) provides a huge collection of such hashes. Forensic tools like SleuthKit can extract directory metadata of the given drive images. We can eliminate files whose hash values match those in published sets [4]. This also has the benefit of detecting modified files since their hash values are different [3]. However, the coverage of the published hash values is limited only to known software and operating systems [5], it do not provide hash values for the files created dynamically. To confirm if a file in uninteresting we can open the file and inspect them for user- created and user-discriminating data. Moreover the uninteresting files occupy most of a drive’s space, so eliminating them in the initial phase of the investigation significantly reduces the time. Unfortunately, the disk contains some software directories that do contain interesting user files, hence finding the uninteresting is not always straightforward. This report will discuss a comparison between existing approaches to detect uninteresting files. And propose method for improving the performance, in particular by correlating files on drives and across drives on a corpus. And provide a baseline comparison to find that the files are uninteresting.

2.2 Strategies and tools for fast processing Forensic investigators suffer from ever increasing amount of unstructured data. Therefore, techniques were found to support quick identification of suspicious information. Some techniques to support this work are presented below:

11

 Certain techniques like clustering related data, metadata analysis and hash set matching helped in eliminating uninteresting files and thereby, reducing the corpus size and the processing time.  Since, the volume of data involved in such investigations is so huge, it is essential to automate as much work as possible. For this purpose specialized automated tools were developed, E.g. Encase, Autopsy, Sleuthkit, Fiwalk, OpenLV. A brief description of these tools and techniques are presented in the following sections.

2.2.1 Metadata analysis Inspecting the drive metadata is a better alternative than inspecting all the files in a drive. The directory metadata of a computer drive contains the listing of the stored files and directories and their properties which provides significant information about user of the drive. Importantly examining it requires much less time than examining file contents. The following clues can be sought in the metadata of a file system to detect possible action of hiding something or preventing it from being known and anti- forensics activities: Encrypted files and directories, suspicious file extensions, malicious software, and deletions of files [2].

2.2.2 Super clustering using Dirim Anomalies are found both from comparing overall drive statistics and from comparing clusters of related files using a novel approach of "superclustering" of clusters. It is desirable to have ways to summarize drives quickly to determine if a drive and any particular files on it are worth investigating. One aspect of interestingness is the degree to which files on a drive are "suspicious" or appear to be out of character with similar files on other drives or appear to be concealing information. The contents of each file can be examined individually to find clues, but this takes time. Suspiciousness can be automatically determined by directory metadata (file names, extensions, paths, size, times, fragmentation, status flags and hash codes) that can suggest which file contents might be worth looking at.  Metadata analysis won’t catch sophisticated methods like steganography or putting data into slack space. But such things are rare in criminal investigation. Because drives are seized unexpectedly and criminal don’t have time to hid things in this manner.  In a digital forensic investigation anomalousness can be measured by comparing statistics over a set of drives and within a drive.  Deceptiveness can be found in specific clues to concealment or misdirection.

12

Dirim tool will analyze the file metadata for a set of disks and report anomalies and suspiciousness. It is built in python, it takes about a day to analyze 1467 drives on modern 64 bit machines and 10 mins to analyze a new drive [6].

2.2.3 Using Jumplist to identify fraudulent documents The collection of electronic evidence involves its acquisition, investigation, analysis, and event reporting. Those services from a computer forensic analyst may cost about $350 per hour. Depending on the nature of the case, total costs can quickly reach $10,000 or more. Therefore, if there is a reliable forensic method available with the potential for decreasing investigative costs, it should be a consideration in a digital investigation. JumpLister is a free software program that has the potential for reducing such investigative expenses [7].

2.2.4 OpenLV OpenLV is a free GPL licensed easy to use forensics application tool. It is tool developed in Java that meets the needs of faster initial triage. It also provides an option to review the digital evidence at later stages of investigation. The user of OpenLV have to load the digital forensics images into a virtual platform. The user have to configure certain settings like: the amount of memory, system time, operating system, and the location of the input evidence and output virtualization files. For supporting this work OpenLV provides a straight forward GUI. Some settings, such as the size of the virtual hard disk, are inferred from the input evidence. Once configured, the user invokes the virtualization software and interacts with the evidence system. Since OpenLV works without modifying the evidence, its use in triage does not preclude subsequent, in-depth forensic analysis. Unlike many popular forensics tools, OpenLV requires little training and facilitates an unprecedented level of interaction with the evidence [9].

2.2.5 Hash set matching Matching files against black- and whitelists using software tools for matching files based on cryptographic hashes of types like MD5 and SHA-1 can remove a significant no. of uninteresting files. The NIST Information Technology Laboratory provides a huge reference collection of such hashes through the National Software Reference Library (NSRL). There are commercial vendors like hashset.com and Bit9.com who provide hashes. Forensic tools like SleuthKit (Autopsy) extract directory metadata from drive images. We can eliminate files whose hash values match those in published sets [1]. However, published hash values miss many kinds of software files [5], especially files created dynamically. There are certain published hash sets that provides hash valus for known softwares and operating system files. The content distribution of files by extension for five hash sets:

13

NSRL-RDS, Hashsets.com, Bit9.com, Real Data Corpus, RDC filtering is presented in Table 1 below.

Table 1: Distribution of files by extension type for five hash sets. [1]

Types of Real Data NSRL RDS Hashsets.com Bit9.com RDC filtering extension Corpus

None 10.56% 13.78% 9.62% 21.85% 10.21%

Oper. system 3.74% 4.55% 1.53% 6.89% 0.00%

Graphics 16.23% 13.64% 13.86% 13.03% 13.14%

Camera images 3.14% 0.80% 2.36% 6.13% 22.11%

Temporaries 0.08% 0.02% 0.06% 2.20% 4.25%

Web 8.25% 8.83% 17.56% 4.45% 6.82%

Misc. 1.71% 1.74% 1.46% 2.00% 4.69% documents

MS Word 0.17% 0.03% 0.16% 0.71% 2.98%

Presentations 0.26% 0.02% 0.07% 0.13% 0.51%

Database 0.29% 0.18% 0.21% 0.73% 1.04%

Other MS 0.09% 0.11% 0.05% 0.21% 0.15% Office

Spreadsheets 0.43% 0.38% 0.14% 0.46% 1.60%

Email 0.11% 0.03% 0.09% 0.12% 0.33%

Links 0.01% 0.04% 0.05% 1.08% 2.00%

Compressed 1.33% 7.05% 2.22% 0.65% 1.23%

Help 0.94% 0.28% 0.51% 1.01% 0.00%

Audio 1.47% 0.38% 0.71% 3.21% 4.42%

Video 0.20% 0.04% 0.16% 0.35% 0.79%

Program source 7.16% 11.44% 8.98% 2.20% 4.11%

Executables 18.70% 14.51% 18.59% 12.90% 0.00%

Disk images 0.78% 1.87% 1.40% 1.15% 0.52%

14

XML 0.94% 2.17% 1.24% 1.00% 0.61%

Logs 0.04% 0.05% 0.06% 0.76% 2.29%

Copies 0.09% 0.04% 0.28% 0.40% 0.33%

Integers 1.03% 1.80% 0.83% 2.17% 4.59%

Configuration 5.32% 5.10% 3.66% 5.14% 2.35%

Update 0.06% 0.01% 0.07% 0.16% 0.00%

Virtual 0.12% 0.04% 0.09% 0.08% 0.08% machine

Multipurpose 2.79% 3.76% 3.48% 2.46% 5.64%

Miscellaneous 2.39% 5.32% 7.55% 2.41% 3.14%

2.2.6 Indexing through piece wise hash signature The general strategy is to divide a file into pieces and to produce a piece hash for each piece. Piece wise hash signature (PHS) of a file is the concatenation of all piece hashes. In this case small change in the file affect only a small portion of the PHS. Piecewise hashing is a block hashing, where all pieces are consecutive blocks of a fixed size. A variation of piecewise hashing uses pieces of fixed size, but triggers the starting points of pieces based on the content of the file. Hence, pieces may overlap and there may be gaps between pieces. Indexing Strategy: The key idea of our strategy is to build an index over the n-grams contained in the PHS instead of an index of the PHS themselves. For each n-gram in a query the index shall provide a list of all hashes which contain the same n-gram. Such hashes are suitable candidates for neighbors because having common n-grams correlates to being neighbors [8].

2.2.7 Indexing image hashes Similarity preserving hashes can provide means to recognize known content and modified versions of known content. However to support similarity search efficient indexing strategies are essential. [10] Provides two indexing strategies for robust image hashes created by a tool called ForBild. The first strategy uses a vantage point tree, a VP-tree is constructed in a top-down manner. After choosing a vantage point for the root node, the distance between each data point and the vantage point is calculated. Each child of the root node receives a subset of the data points corresponding to a certain range of distance values. This procedure is applied recursively to the children until a cancel criterion is reached, e.g. a desired depth or at most one data point per node. The processing of similarity queries against the tree exploits the triangle 15

inequality: If the query point has distance d1 from a vantage point and a data point distance d2 from this vantage point, then the distance between query point and data point is at least |d1 _ d2|. The second strategy uses locality-sensitive hashing (LSH). An LSH scheme can be used for estimating the similarity of two items by randomly picking a certain number of hash functions from the LSH family and evaluating these functions on the two items. The approximate similarity score is the relative frequency of hash matches. This is useful if the evaluation of the actual similarity function is significantly more expensive than the evaluation of several hash functions. Moreover, a similarity hash can be constructed by concatenating the hash values produced by the selected hash functions.

2.2.8 Methods for finding additional uninteresting files This is very close to our work. Published black and white lists do not provide hash values for dynamically created files. [1] Provides methods to find additional uninteresting files beyond the coverage of published hash sets:  Frequent hash values: It denotes files on different directories having same hash values computed on their contents. Such files could be uninteresting because there is a possibility that the same file is copied into different directories  Frequent paths: It denotes files with same full path occurring in different directories. Frequently occurring paths may be due to mass distribution and are unlikely to be forensically interesting.  Clustered creation time: It shows files created in the same time or within a short period. This suggest automated copying, and hence such files are unlikely to be forensically interesting.  Unusually frequent file sizes: Files with the same size and extension. This shows a possibility of certain kind of log records. Which are again not forensically interesting.  Uninteresting extensions: Files which are categorized as white-lists such as Operating system files, application software files, database files, executable, disk images, XML etc. Files with no extension or more than one extensions are excluded from this category as they may be potentially suspicious content. Limitations: Running the above tests eliminates uninteresting files beyond published hash sets. But since these methods are executed on the entire corpus, time is spent on examining files in those drives which do not record any suspicious activity. Hence, eliminating non suspicious drives in the first phase will significantly reduce the size of corpus and hence save computation time.

16

3 Proposed Method Simply eliminating files from hash set matching against published sets wont give us the best results. Since, both NSRL hashes and commercial hashes like Bit9.com and hashset.com do not provide hash sets for dynamically created files. Hence, we use 5 additional methods to eliminate additional uninteresting files. We propose the following method for faster processing of large (big) forensics data by eliminating non-suspicious drives and eliminating additional uninteresting files beyond the coverage of published hash databases.

3.1 Step-1: Finding drives of interest. Doing the interesting drive search in the phase-1 will save much of our time. Since the corpus contain some thousands of drive but millions of files. This can be done by examining the drive metadata, examining the metadata will reduce our effort 1000 times than examining all the files in the drive. Examining the metadata will let us know if any changes has been observed in any file in the drive. If we find any user created activities in the drive we can label that drive as drive of interest. In our experiment we found drive C having no forensically interesting files. Which was detected by hash set matching. In such cases we can save time by checking for suspicious activity in a drive by analyzing the drive metadata. Once we discover the drive have not being a part of any anti-forensics activity, we can save time by skipping analyzing the files in the drive.

3.2 Step- 2: Eliminating uninteresting files After examining the corpus to find drives having anti-forensics content the next step is to analyze the files in the interesting drives. Again examining each and every file won’t make much sense, as most of the files in the drive will not provide any user created or user discriminating information. To eliminate such uninteresting files [1] suggested a protocol:  Run methods frequent hashes, frequent paths, and frequent sizes to eliminate uninteresting in the corpus.  Eliminate the files whose hash values match with the uninteresting lists available. E.g. NSRL-RDS, hashset.com, Bit9.com hash databases  On the remaining files, run methods clustered creation times, contextually uninteresting files and unknown extensions to further eliminate uninteresting files.

3.3 Experimental setup Our experiment consist of two parts first is the extraction of metadata and hash values using Autopsy and eliminating uninteresting files by matching the hash values against NSRL-RDS and hashsets.com hash values. In the second part of the experiment we execute our proposed methods to eliminate additional uninteresting files. To perform these tests we collected data from Windows and Linux systems. We used NSR-RDS 2.44 hash database released in March 2014, November 2012 version of known malware 17

hash datasets from hashsets.com and known hash sets representing more than 253,00 notable hash values derived from Internet file sharing activities (executables, binaries, documents, compressed files, etc) from January 2011 version of hashsets released in hashsets.com. Autopsy was used to extract metadata and MD5 hash values for the files in the given input image file (data source). Table.1: hash set sources used in our experiment (NSRL-RDS) March Hashset.com,November Bit9.com, April

2014 2012 2013 Number of 95,909,483 17,774,612 321,847 entries Number of distinct hash 29,311,204 6,464,209 321,847 values Fraction distinct 0.306 0.364 1.0

3.4 Results and discussions We collected data sets of size each 7GB, 14 GB and 50 GB from both Windows and Linux systems and the tests were conducted on both the data sets separately to identify uninteresting files. We used Autopsy tool to generate metadata and MD5 hash value in windows server 2012 R2 64 bit operating system, with Intel Zeon having 2.60 GHz ( 2 processors) and 128 GB ram. The properties and description of the data sources collected from different PCs are presented in the tables below.

Table 2. Shows the description of the data sets given as input to Autopsy

Test-1 Test-2 Test-3 Windows Files Linus File Windows Linus File Windows Linus File System System Files System System Files System System (A) (C) (E) (B) (D) (F)

Data Source size 7 GB 7GB 14 GB 14 GB 50 GB 50GB

Total No. of 14 15 5 18 25 31 folders Total No. of sub 111 601 352 834 967 1210 folders Total No. of files 70497 372719 133584 489099 441313 1017254 analyzed Time taken for 3 mins 6 mins 4 mins 6 mins 34 mins 48 mins analysis

18

No. of suspicious 303 4980 0 18733 12750 54128 files identified by hash set matching

Table 3. After applying the methods of eliminating additional uninteresting files to 7 GB data source

Windows File System Linux File System Total no. of files before applying elimination 70497 372719 methods Methods No. of files removed No. of files No. of files No. of files left left removed Frequent hash 22557 47922 59173 313454 File with same full 89 47833 47 313498 path Frequent Size 37675 10158 269955 43590

Clustered creation 9231 927 39638 3952 time Uninteresting 927 0 2752 1200 extensions

Table 4. After applying the methods of eliminating additional uninteresting files to 14 GB data source.

Windows File System Linux File System Total no. of files before applying 133584 489099 elimination methods Methods No. of files removed No. of files left No. of files No. of files left removed Frequent hash 14024 119559 97900 391198 File with same full 4 119555 0 391198 path Frequent Size 99261 20294 339637 51561

Clustered creation 20294 0 47362 4199 time Uninteresting - - 3822 337

19

extensions

Table 5. After applying the methods of eliminating additional uninteresting files to 50 GB data source.

Windows File System Linux File System Total no. of files before applying elimination methods 441313 1017254 Methods No. of files removed No. of files left No. of files No. of files left removed Frequent hash 14024 119559 317564 699690 File with same full path 4 119555 321 699369

Frequent Size 99261 20294 456210 243159

Clustered creation time 17239 3055 213425 29734

Uninteresting extensions 2344 711 24232 5502

Hence, the above results shows that using the methods: frequent hash values, frequent paths, frequent size, clustered creation, and uninteresting extensions we can eliminate files beyond NSRL-RDS and hashsets.com coverage. We combined the final results obtained from the different data sets to summarize the efficiency of our method in Windows and Linux systems. Using the above stated methods we could eliminate additional 2.37 files in the data set from Windows systems and 3.40 files in data set from Linux system. The comparative results obtained in our experiment is summarized in the table below.

Table 6. Combined report of percentage of files eliminated by hashsets matching against the additional 5 methods.

File Systems Total No. of files Percentage of file Percentage of Difference analyzed eliminated by files eliminated (in percentage) Autopsy by additional methods

Windows 645394 97.41 99.78 2.37 Size: 71 GB Linux 1879072 96.23 99.63 3.40 Size: 71 GB

20

4 Conclusion and Future Work

4.1 Conclusion Our experiment has showed that we can eliminate files beyond the NSRL hash database and hashsets.com hashsets using certain additional methods. In the forensics investigation of a corpus containing thousands of drives, certain clues can distinguish a suspicious drive from an uninteresting drive just by examining its metadata. We can also look for general evidence of concealment representing targets of interest. Uninterestingness of a file is usually based on whether a file contains user-created or user-discriminating information. We can use relatively simple methods to eliminate considerable numbers of uninteresting files beyond the coverage of published black and white list hash databases. The proposed strategy will save significant amount of time and attention in the investigation process by not dealing with detailed analysis of forensically uninteresting drives and files.

4.2 Future work The future work will involve execution of this approach on international corpus with more no. of drives collected from different users. Which will prove the validity of this approach in real time forensics investigation. We even envisage to publish the hash values of the additional uninteresting files. Which will be useful for forensic investigators as they can eliminate additional uninteresting files using hash set matching method with the support of available forensics software tools.

21

References 1. Rowe C. N, Identifying forensically uninteresting files in a large corpus. 5th International Conference on Digital Forensics and Computer Crime, Moscow, Russia, (2013) 2. Rowe, N., Garfinkel, S. Finding anomalous and suspicious files from directory metadata on a large corpus. 3rd International ICST Conference on Digital Forensics and Cyber Crime, Dublin, Ireland, October 2011. 3. Pennington, A., Linwood, J., Bucy, J., Strunk, J., Ganger, G.: Storage-Based Intrusion Detection. ACM Transactions on Information and System Security, Vol. 13, No. 4, 30 (2010) 4. Kornblum, J.: Auditing Hash Sets: Lessons Learned from Jurassic Park. Journal of Digital Forensic Practice, Vol. 2, No. 3, 108--112 (2008) 5. Rowe, N.: Testing the National Software Reference Library. Digital Investigation, Vol. 9S (Proc. Digital Forensics Research Workshop 2012, Washington, DC, August), pp. S131-S138 (2012) 6. Neil C. Rowe and Simson L. Garfinkel, Finding suspicious activity on computer systems, 11th European Conf. on Information Warfare and Security, Laval, France, (2012). 7. G. Stevenson Smith, Using jump lists to identify fraudulent documents, Digital Investigation 9 (2013) 8. Christian Winter, Markus Schneider, York Yannikos, F2S2: Fast forensic similarity search through indexing piecewise hash signatures, Digital Investigation 10 (2013) 9. Timothy Vidas, Brian Kaplan, Matthew Geiger, OpenLV: Empowering investigators and first-responders in the digital forensics process, Digital Investigation 11 (2014) 10. Christian Winter, Martin Steinebach, York Yannikos, Fast indexing strategies for robust image hashes, , Digital Investigation 11 (2014) 11. Neil C. Rowe, Testing the National Software Reference Library, Digital Investigation 9 (2012)

22