SADFE 2015

Proceedings of the 10th International Conference on Systematic Approaches to Digital

Carsten Rudolph Nicolai Kuntze Barbara Endicott-Popovsky Antonio Maña

Editors Carsten Rudolph Nicolai Kuntze Monash University Huawei European Research Center Melbourne, Victoria, Australia Frankfurt Am Main Area, Germany

Barbara Endicott-Popovsky Antonio Maña University of Washington University of Malaga Seattle, WA, USA Malaga, Spain

Proceedings of the 10th International Conference on Systematic Approaches to Digital Forensic Engineering (SADFE 2015)

ISBN: 978-84-608-2068-0 Safe Society Labs (Spain)

© Copyright remains with authors of each publication. Authors retain the right to reproduce, distribute, display, adapt and perform their own work for any purpose. The proceedings of SADFE 2015 conference are published by Safe Society Labs as open access, and licensed under a Creative Commons Attribution- NonCommercial 4.0 International License1.

Typeset & Cover Design: Hristo Koshutanski (Safe Society Labs)

1 http://creativecommons.org/licenses/by-nc/4.0/

1

Preface

This volume constitutes the proceedings of the 10th International Conference on Systematic Approaches to Digital Forensic Engineering (SADFE 2015). Over the years, SADFE has been a venue that established new interdisciplinary relations and connections and has been the source of new initiatives and collaborations. One example of such an activity was the 2014 Dagstuhl Seminar "Digital Evidence and Forensic Readiness" with participants from 4 continents. This year, the SADFE steering committee took two risks. Most importantly, it is the first SADFE since 2007 that is not co-located with another event. Second, it is the first SADFE in Europe highlighting the necessity of international co-operation in the area of . Nevertheless, SADFE will continue to have the character of a workshop. Single track, so that all participants share the same information and sufficient time and space for interaction and discussions. In response for the 2015 SADFE call for papers, 39 submissions from 16 different countries on 5 continents were received and reviewed. Of the papers submitted, 18 were accepted for presentation at the conference, of those 12 selected for publication in the Journal of Digital Forensics, Security and Law (http://www.jdfsl.org). The program also included key-note talks by Michael M Losavio on "Smart Cities, Digital Forensics and Issues of Foundation and Ethics" and by Klaus Walker on "The careless application of digital evidence in German criminal proceedings". In addition, a panel on the topic of "Digital Forensics: Future Challenges for Security Forces and Government Agencies" was held with the participation of representatives from different law enforcement agencies from around the world, such as The Netherlands, UK, United Arab Emirates and Spain. Many people contributed to the organisation and preparation of this conference, including the program committee and the SADFE steering committee. A special thanks goes to the host and General Chair Antonio Maña. He took care of countless tasks including the overall organisation of the conference, the SADFE 2015 website, publication and proceedings, venue, social events, final program, and many others. SADFE 2015 would have been impossible without his commitment and experience. Last, but certainly not least, thanks go to all the authors who submitted papers and all the attendees. We hope this year's program will once again stimulate exchange and discussions beyond the conference, and we look forward to the next 10 years of SADFE.

September 2015 Carsten Rudolph, Nicolai Kuntze, Barbara Endicott-Popovsky

Program Co-chairs SADFE 2015

2

Organization

Steering Committee: Deborah Frincke, (Co-Chair), Department of Defense, USA Ming-Yuh Huang, (Co-Chair), Northwest Security Institute, USA Michael Losavio, University of Louisville, USA Alec Yasinsac, University of South Alabama, USA Robert F. Erbacher, Army Research Laboratory, USA Wenke Lee, George Institute of Technology, USA Barbara Endicott-Popovsky, University of Washington, USA Roy Campbell, University of Illinois, Urbana/Champaign, USA Yong Guan, Iowa State University, USA

General Chair: Antonio Maña, University of Malaga, Spain

Program Committee Co-Chairs: Carsten Rudolph, Huawei European Research Center, Germany Nicolai Kuntze, Huawei European Research Center, Germany Barbara Endicott-Popovsky, University of Washington, USA

Publication Chair: Ibrahim Baggili, University of New Haven, USA

Publicity Chair Europe: Joe Cannataci, University of Malta, Malta Publicity Chair North-America: Dave Dampier, Mississippi State University, USA Publicity Chair Asia: Ricci Ieong, University of Hong Kong, Hong Kong

3

Program Committee

Sudhir Aggarwal Florida State University, USA Galina Borisevitch Perm State University, Russia Frank Breitinger University of New Haven, USA Joseph Cannatacci University of Groningen, Netherlands Long Chen Chongqing University of Posts and Telecommunications, China Raymond Choo University of South Australia, Australia K.P. Chow University of Hong Kong, Hong Kong David Dampier Mississippi State University, USA Hervé Debar France Telecom R&D, France Barbara Endicott-Popovsky University of Washington, USA Robert Erbacher Northwest Security Institute, USA Xinwen Fu UMass Lowell, USA Simson Garfinkel Naval Postgraduate School, USA Brad Glisson University of Glasgow, UK Lambert Großkopf Universität Bremen, Germany Yong Guan Iowa State University, USA Barbara Guttman National Institute of Standards and Technology, USA Brian Hay University of Alaska Fairbanks, USA Jeremy John British Library, UK Ping Ji John Jay College of Criminal Justice, USA Andrina Y.L. Lin Ministry of Justice Investigation Bureau, Taiwan Pinxin Liu Renmin University of China Law School, China Michael Losavio University of Louisville, USA David Manz Pacific Northwest National Laboratory, USA Nasir Memon Polytechnic Institute of New York University, USA Mariofanna Milanova University of Arkansas at Little Rock, USA Carsten Momsen Leibniz Universität Hannover, Germany Kara Nance University of Alaska Fairbanks, USA Ming Ouyang University of Louisville, USA Gilbert Peterson Air Force Institute of Technology, USA Slim Rekhis University of Carthage, Tunisia Golden Richard University of New Orleans, USA Corinne Rogers University of British Columbia, Canada Ahmed Salem Hood College, USA Viola Schmid Technische Universität Darmstadt, Germany Clay Shields Georgetown University, USA Vrizlynn Thing Institute for Infocomm Research, Singapore Faculty of Engineering and Computing at University of Technology, Sean Thorpe Jamaica William (Bill) Underwood Georgia Institute of Technology, USA

4

Wietse Venema IBM T.J. Watson Research Center, USA Hein Venter University of Pretoria, South Africa Xinyuan (Frank) Wang George Mason University, USA Kam Woods University of North Carolina, USA Yang Xiang Deakin University, Australia Fei Xu Institute of Information Engineering, Chinese Academy of Sciences Alec Yasinsac University of South Alabama, USA SM Yiu Hong Kong University, Hong Kong Wei Yu Towson University, USA Nan Zhang George Washington University, USA

5

Sponsoring Institutions

Safe Society Labs, S.L. http://www.safesocietylabs.com/

The University of Malaga http://www.uma.es

Journal of Digital Forensics, Security and Law http://www.jdfsl.org

6

Table of Contents

UFORIA - A Flexible Visualisation Platform for Digital Forensics and E-Discovery………………….. 8 Arnim Eijkhoudt, Sijmen Vos, Adrie Stander Dynamic Extraction of Data Types in Android’s Dalvik Virtual Machine……………………………… 13 Paulo R. Nunes de Souza, Pavel Gladyshev Chip-off by Matter Subtraction: Frigida Via…………………………………………………………….. 19 David Billard, Paul Vidonne The EVIDENCE Project: Bridging the Gap in the Exchange of Digital Evidence Across Europe……… 25 Maria Angela Biasiotti, Mattia Epifani, Fabrizio Turchi A Collision Attack on Sdhash Similarity Hashing……………………………………………………….. 36 Donghoon Chang, Somitra Kr. Sanadhya, Monika Singh, Robin Verma An empirical study on current models for reasoning about digital evidence…………………………….. 47 Stefan Nagy, Imani Palmer, Sathya Chandran Sundaramurthy, Xinming Ou, Roy Campbell Data Extraction on MTK-based Android Mobile Phone Forensics……………………………………… 54 Joe Kong Open Forensic Devices…………………………………………………………………………………… 55 Lee Tobin, Pavel Gladyshev A study on Adjacency Measures for Reassembling Text Files…………………………………………... 56 Alperen Şahin, Hüsrev T. Sencar An integrated Audio Forensic Framework for Instant Message Investigation…………………………... 57 Yanbin Tang, Zheng Tan, K.P. Chow, S.M. Yiu Project Maelstrom: Forensic Analysis of the Bittorrent-powered Browser……………………………… 58 Jason Farina, M-Tahar Kechadi, Mark Scanlon Factors Influencing Digital Forensic Investigations: Empirical Evaluation of 12 Years of Dubai Police Cases……………………………………………………………………………………………………… 59 Ibtesam Al Awadhi, Janet C Read, Andrew Marrington, Virginia N. L. Franqueira PLC Forensics based on CONTROL Program Logic Change Detection………………………………... 60 Ken Yau, Kam-Pui Chow Forensic Acquisition of IMVU: A Case Study…………………………………………………………... 61 Robert van Voorst, M-Tahar Kechadi, Nhien-An Le-Khac Cyber Black Box/Event Data Recorder: Legal and Ethical Perspectives and Challenges with Digital Forensics………………………………………………………………………………………………….. 62 Michael Losavio, Pavel Pastukov, Svetlana Polyakova Tracking and Taxonomy of Cyberlocker Link Sharers based on Behavior Analysis……………………. 63 Xiao-Xi Fan, Kam-Pui Chow Exploring the Use of PLC Debugging Tools for Digital Forensic Investigations on SCADA Systems… 64 Tina Wu, Jason R.C. Nurse The Use of Ontologies in Forensic Analysis of Smartphone Content…………………………………… 65 Mohammed Alzaabi, Thomas Martin, Kamal Taha, Andy Jones

7

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

UFORIA - A FLEXIBLE VISUALISATION PLATFORM FOR DIGITAL FORENSICS AND E-DISCOVERY Arnim Eijkhoudt & Sijmen Vos Amsterdam University of Applied Sciences Amsterdam, The Netherlands [email protected], [email protected]

Adrie Stander University of Cape Town Cape Town, South Africa [email protected]

ABSTRACT With the current growth of data in digital investigations, one solution for forensic investigators is to visualise the data for the detection of suspicious activity. However, this process can be complex and difficult to achieve, as there few tools available that are simple and can handle a wide variety of data types. This paper describes the development of a flexible platform, capable of visualising many different types of related data. The platform's back and front end can efficiently deal with large datasets, and support a wide range of MIME types that can be easily extended. The paper also describes the development of the visualisation front end, which offers flexible, easily understandable visualisations of many different kinds of data and data relationships.

Keywords: cyber-forensics, e-discovery, visualisation, cyber-security, , digital forensics, big data, data mining ! forensic tools to integrate visualisation with 1. INTRODUCTION automated analysis, allowing investigators to With the growth of data that can be encountered in interactively guide their investigations (Garfinkel, digital investigations, it has become difficult for 2010). investigators to analyse the data in the time Many computer forensic tools are not ideally suited available for an investigation. As stated by Teerlink for identifying correlations among data, or for the & Erbacher (2006) “A great deal of time is wasted finding of and visually presenting groups of facts by analysts trying to interpret massive amounts of that were previously unknown or unnoticed. These data that isn’t correlated or meaningful without limitations of digital forensic tools are similar to the high levels of patience and tolerance for error”. forensic analysis of logs in . For Data visualisation might help to solve this problem, example, logs residing in routers, webservers and as the human brain is much faster at interpreting web proxies are often manually examined, which is images than textual descriptions. The brain can also a time-consuming and error-prone process (Fei, examine graphics in parallel, where it can only 2007). Similar considerations apply to E-mail process text serially (Teerlink & Erbacher, 2006) analysis as well. According to Garfinkel (2010), existing tools use Another issue with current tools is that they do not the standard WIMP model (Window, Icon, Menu, always scale well and will likely have problems Pointing device). This model is poorly suited to dealing with the growth of data in digital representing large amounts of forensic data in an investigations (Osborne, Turnbull, & Slay, 2010). efficient and intuitive way. Research must improve Currently, there are few affordable tools suited to

8

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

and available for these use-cases or situations. 2. ADVANTAGES Additionally, the available tools tend to be Uforia offers many advantages, of which the first is complex, requiring extensive training and very low cost. configuration in order to be used efficiently. A second advantage is that the system scales well Investigative data visualisation is used to assist due to its use of multiprocessing and distributed viewers with little to no understanding of the technologies such as ElasticSearch, so that subject matter, in order to reconstruct a crime or extremely large numbers of artefacts can be item and to understand what is being presented, for handled in a very short time. The processing of the example an investigator which is not familiar with a Enron set, consisting of roughly 500 000 E-mails particular scenario. On the other hand, analysis without attachments, typically takes less than ten visualisations can be used to review data and to minutes to complete on contemporary consumer- assess competing scenario hypotheses for grade hardware. This pre-processing step also investigators that do have an understanding of the ensures that little to no processing needs to be done subject matter (Schofield & Fowle, 2013). at the time of visualisation. A timeline is a valuable form of visualisation, as it Thirdly, the Uforia's development heavily focused greatly assists a digital forensic investigator in on making it as user- and developer-friendly as proving or disproving a hypothetical model possible. Many forensic tools need a substantial proposed for the investigation. A timeline can also amount of training and configuration to accomplish provide support for the mandate the digital forensic meaningful tasks. As this makes the systems investigator received prior to commencing the difficult and expensive to use and develop for, it investigation (Ieong, 2006). Interaction between was considered paramount during Uforia's role players can normally also be shown in network continued development to address these issues. diagrams, so that the combination of a timeline and Although a full UX study has not been conducted network diagram can generally answer many who yet, the UI and feature set was developed using and when answers. mock-ups and feedback from UX- and graphical The aspects of what and where can often be designers, as well as potential users from several answered by examining the contents of evidence fields of expertise, such as process, compliance and items, such as E-mails or the positional data of risk auditors, forensic investigators and law mobile phone calls. It is therefore important to be enforcement officers, where none of the able to display the details of data with ease as well. participants were given prior usage instructions. This paper describes the development of a flexible Another advantage is the extreme flexibility of the platform, Uforia (Universal Forensic Indexer and system. It is very easy to add new modules, e.g. for Analyser), that can be used to visualise many handling new MIME types, as the programming of different types of data and data relations in an easy such a module can normally be accomplished in a and fast way. very short time using simple Python programming. Additionally, the front end is completely web The platform consists of two sections, a back end based, and no special software needs to be installed and a front end, and is based on readily available to use it. This, combined with the following open source technologies. The back end is used to common web design and UX standards, suggests pre-process the data in order to speed up the that even novice users can achieve meaningful indexing and visualisation process handled by the results with little to no training. front end. The resulting product is a simple and extremely flexible tool, which can be used for 3. BACK END many types of data with little or no configuration. Very little training is needed to use Uforia, making 3.1 START-UP PHASE it accessible and usable for forensic investigators without a background in digital investigations or Uforia's back end is used to process the files systems, such as auditors. containing the data that will eventually be indexed and used in the visualisation process.

9

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

The back end's first step is to create a MySQL table module and processed accordingly. for the files. This table contains all metadata Uforia can also deal efficiently with flat-file common to any digital file, as well as calculated database(-like) formats by having modules return metadata (such as NIST hashes). their results as a multi-dimensional array. Uforia's A second database table is then generated, and it database engine turns these into multiple-row contains information about the supported MIME inserts into the appropriate modules' tables. types. This table is built by looking at a Examples of modules that deal with flat-files in this configurable directory containing the modules for fashion, are the modules that handle the mobile the MIME types that can be handled by the system. phone data (CSV-format) and the simple PCAP-file parser. Every module that can handle a specific MIME type is identified and added to this table. The table Due to its highly-threaded operation, the back end eventually contains zero, one or more 1:n key/value can pre-process large volumes of data efficiently in pairs for each of the supported MIME types and relatively little time. Once the processing steps are their respective module handlers. The module completed, the stored data needs to be transferred handlers are themselves stored as key/value pairs, from the back end storage in JSON-format to the with their original name as keys to the matching ElasticSearch engine for use by the visualisation unique table name. front end. These tables are then created for each module, so 4. FRONT END that Uforia can store the returned, processed data from each particular module in its unique table. The front end uses ElasticSearch, AngularJS and D3.js for the visualisation and administration Modules are self-contained files and extremely easy interface. to develop. They only require the structure of their database table to be stored as a simple Python The first step during the visualisation process is to comment line in the particular module, starting with select the modules or file types that need to be # TABLE: …, and a predefined process function visualised in the admin interface. which should return the array of the data to be The next step is to select (and possibly group any stored. identical) fields that need to be indexed by the ElasticSearch engine. The administration interface 3.2 PROCESSING will hint at similar field names in other supported Once all tables are created, the processing of the data types to allow for the merging of data types files that need to be analysed can start. into one searchable set. This makes it possible to correlate the timing of e.g. cell phone calls and E- The first step is to build a list of the files involved. mails. This is read from the config file. Once this list is completed, every file in the list is processed. During or after the indexing and storing in ElasticSearch, one or more visualisations must then The MIME type of the file is determined and then be assigned to the mapping in the admin interface. the relevant processing modules (0, 1 ... n) are This also includes specifying the fields that should called to process the file. The results returned by be laid out on the visualisation's axes. each module are then stored in the database table that was generated earlier for that particular The data in ElasticSearch can then be searched and module. visualised, even if the index process has not been completed yet. Because the front-end uses When Uforia encounters a container format, it can ElasticSearch, searches are fast and highly scalable. deal with it efficiently by recursively calling itself. Only when full detail views of selected evidence For instance, the Outlook PST module will unpack items are necessary, the underlying back-end encountered PST files to a temporary directory and database needs to be accessed. then call Uforia recursively for that temporary location. The unpacked individual E-mails are then automatically picked up by the normal E-mail

10

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

5. USER INTERFACE 6. EXAMPLES The interface is designed with the goal of In this section, an examples can be seen of how optimizing user-friendliness and ease of Uforia is can be used to quickly determine the E- understanding. The user interface sports a mail contacts of suspects. Despite limited available 'responsive design', with UI elements automatically space in this paper, it is nevertheless possible to resizing and repositioning themselves for different recreate similar scenarios for other data types. screen sizes, such as with laptops, tablets and Figure 2 shows an example set of a network graph mobile phones, as can be seen in Figure 1. derived from a sample set of PST-files, where the E-mail content was searched for the words 'investigate', 'books', 'suspect' or 'trading' and shown as a network graph indicating which individuals communicated about these words, with the size of the node indicating the amount of communication received. This immediately indicates the links between several possible suspects, including one whose PST mailbox was not included in the dataset and processed by Uforia.

Figure 1: Mobile Interface

1) The user selects an 'evidence type', which is the name used for the collection, as it was generated in the admin interface 2) Uforia then loads the module fields that have been indexed for that evidence type, e.g. 'Content' for E-mails or documents. 3) The user selects whether the field should 'contain[s]' or 'omit[s]' the information in Figure 2: Network Graph the last field.

4) Finally, the user selects one of the visualisations that have been assigned to Another example is creating a timeline, as seen in the evidence type. Figure 3, to determine when messages were sent 5) Uforia will now render the requested and which were sent around the time of the possible information using the selected transgression. visualisation, with some of the It is easy to determine the times of the E-mail visualisations offering additional messages by hovering over the intersections on the manipulation (such as a network graph). timeline, and to investigate the original E-mails by Lastly, all visualisations have one or more clicking on the intersections (see Figure 4). 'hot zones' where the user can 'click- through' to bring up a detailed view of the selected evidence item(s).

11

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

Uforia was tested on a number of real life scenarios, and in all cases it was able to produce real results in a fast and efficient way, requiring hardly any operator training. In conclusion, Uforia is fast, flexible and low cost solution for investigating large volumes of data.

REFERENCES

Fei, B. K. (2007). Data Visualisation in Digital Forensics. Pretoria, South Africa: Maters Figure 3: Timeline Dissertation, University of Pretoria.

Garfinkel, S. L. (2010). Digital forensics research: The timeline visualisation can handle multiple The next 10 years. Digital Investigation, items like calls from a large number of mobile 64-73. phones. Figure 4 shows anonymised data from a Ieong, R. S. (2006). FORZA - Digital forensics real case, illustrating how contacts and time can investigation Framework that incorporate easily be determined. The horizontal axis indicates legal issues. Digital Investigation(3), 29- the flow of time, while the graph nodes and 34. coloured lines indicate the moment of contact between the two phone numbers. By clicking on the Osborne, G., Turnbull, B., & Slay, J. (2010). The intersections, the original data can once again be ‘Explore, Investigate and Correlate’ (EIC) displayed. conceptual framework for digitalforensics Information Visualisation. International

Conference on Availability, Reliability and Security, (pp. 630 - 634). Schofield, D., & Fowle, K. (2013). Visualising Forensic Data : Evidence (Part 1). Journal of Digital Forensics, Security and Law, Vol. 8(1), 73-90. Teerlink, S., & Erbacher, R. F. (2006). Foundations for visual forensic analysis. 7th IEEE Workshop on Information Assurance. Westpoint, NY: IEEE.

Figure 4: Mobile Phone Timeline

7. CONCLUSION Uforia shows that it is possible to create a simple, user-friendly product that is nevertheless powerful enough to use in the most demanding investigations. It is easy to extend if any new MIME types are encountered or new features are needed.

12

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

DYNAMIC EXTRACTION OF DATA TYPES IN ANDROID’S DALVIK VIRTUAL MACHINE Paulo R. Nunes de Souza, Pavel Gladyshev

Digital Forensics Investigation Research Laboratory, University College Dublin, Ireland

ABSTRACT This paper describes a technique to acquire statistical information on the type of data object that goes into volatile memory. The technique was designed to run in Android devices and it was tested in an emulated Android environment. It consists in inserting code in the Dalvik interpreter forcing that, in execution time, every data that goes into memory is logged alongside with its type. At the end of our tests we produced Probability Distribution information that allowed us to collect important statistical information that made us distinguish memory values between references (Class, Exception, Object, String), Float and Integer types. The result showed this technique could be used to identify data objects of interest, in a emulated environment, assisting in interpretation of volatile memory evidence extracted from real devices.

Keywords: Android, Dalvik, memory analysis.

1. INTRODUCTION 2. BACKGROUND In digital forensic investigations, it is sometimes Traditional digital forensics relies on evidences necessary to analyse and interpret raw binary data found in persistent storages. This is mainly due to fragments extracted from the system memory, the need to both sides of the litigation to reproduce pagefile, or unallocated disk space. Event if the and verify every forensic finding. The persistent precise data format is not known, the expert can storage can be forensically copied, providing a often find useful information by looking for human controllable way to repeat the analysis, getting to readable ASCII strings, URLs, and easily the same results. identifiable binary data values such as Windows An alternative way is to combine the traditional FILETIME timestamps and SIDs. Figure 1 shows forensics with the so called live forensics. The live an example of a memory dump, where a forensics relies on evidences found in volatile FILETIME timestamp can be easily seen (a memory to draw conclusions. This type of evidence sequence of 8 random binary values ending in 01). features a lesser level of control and repeatability if To date, the bulk of digital forensic research compared with traditional evidences. On the other focused on Microsoft Windows platform, this paper hand, live evidences may unravel key information describes a systematic experimental study to find to the progress of a case. However, the question (classes of) easily identifiable binary data values in regarding the reliability of the live evidence Android platform. remains in place, mainly in two moments: the memory acquisition and the memory analysis. In the memory acquisition front, law enforcements and researchers are working to establish standard procedures to be used. These procedures could be based on physical or logical extraction. The physical extraction could need disassembling of the Figure 1: Hexadecimal view of a memory dump device or the use of JTAG as done by Breeuwsma

13

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

[2006]. The logical extraction can be more diverse, that data. from interacting with the system with user privileges as done by Yen et al. [2009]; it could 3. ANDROID STRUCTURE also gain system privileges through a kernel The Android OS is an Operating System based on module as done by Sylve et al. [2012]; even use a Linux, with extensions and modifications, virtual machine layer to have free access to the maintained by Google. The OS was designed to run memory like done by Guangqi et al. [2014], among on a large variety of devices sharing same common others. Regardless of the extraction method, there characteristics [Ehringer, 2010]: (1) limited RAM; will be the need to analyse the extracted data. (2) little processing power; (3) no swap space; (4) One challenge faced when analysing a memory powered by battery; (5) diverse hardware; (6) dump is that application data is stored in memory sandboxed application runtime. following the algorithms of the program owning that memory space. Being aware of the variety of software running on nowadays devices, the task of interpreting the device’s extracted memory is complex. Some researchers are tackling this challenge taking different approaches. Volatility [2015] provides a customizable way to identify kernel data structures from memory dumps; Lin et al. [2011] used graph-based signatures to identify kernel data structures, Hilgers et al. [2014] uses the Volatility framework to identify structures beyond the kernel ones, identifying static classes in the Android system.

A deeper memory analysis tool that would consistently interpret data structures from Figure 2: Architecture of Android OS application software has not yet being developed. The in-depth memory analysis is normally done in To provide a system that could run on such diverse a adhoc basis, interpreting the memory dump from and resource limited devices, they decided to build the light of the reversed engineered application’s a multi-layered OS(Figure 2). The 5 layers are: (1) source code, as done by Lin [2011]. A broader Linux kernel; (2) Hardware Abstraction Layer approach, that would not depend on the (HAL); (3) Android runtime and Native libraries; application’s source code, could be powerful to (4) Android framework; (5) Applications. deep memory analysis. The Android OS is an hybrid of compiled and This approach, not based on the application source interpreted system. The boundary between code, would have advantages and disadvantages. compiled and interpreted execution is the Android As an advantage, this approach could be used in runtime. The versions of the Android used in our situations where the source code is unknown, experiments (android-2.3.6 r1 and android-4.3 unavailable, or legally disallowed to be reversed r2.1) feature Dalvik Virtual Machine (Dalvik VM) engineered. On the other hand, without the source in the runtime package. All the programs running in code to deterministically assert the meaning of each the layers underneath Dalvik VM are compiled and memory cell, this method would need to take a all programs running in the layers above Dalvik probabilistic approach. The foundation for such VM are interpreted. The Dalvik VM hosts approach is a probabilistic understanding of the programs that were written in a Java syntax, memory data associated with their respective type. compiled to an intermediary code level called This paper uses the Android OS as environment to bytecode and then packed to be loaded into Dalvik. present a technique to gather the memory When the software is launched inside Dalvik VM, information associated with its type, making each line of bytecode is interpreted into the possible to have an probabilistic understanding of machine code, normally in ARM architecture.

14

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

The Dalvik VM is implemented as a registerbased data collecting code. The features that most suit our virtual machine. This mean that the instructions needs are: (1) there is an different entry for each operate on virtual registers, being those virtual bytecode instruction, called opcode; (2) several of registers memory positions in the host device. The the opcodes of the Dalvik VM are type related. instruction set provided by the Dalvik VM consists Therefore, it is a good point to place the code of a maximum of 256 instructions, being some of designed to collect the data, relating values and them currently unused. Part of the used instructions types that goes to memory. is type specific, being those the ones chosen to be Even though the Dalvik interpreter is conceptually used to collect data and type information. the central point from where every single line of The Dalvik VM instruction set is grouped in some Dalvik bytecode should pass through, there is one categories: binop/lit8 is the set of binary operations exception. The Android OS features an receiving as one of the arguments a literal of 8 bits; optimization element called Just In Time (JIT) binop/lit16 is the set of binary operations receiving compilation that can bypass the Dalvik interpreter as one of the arguments a literal of 16 bits; [Google, 2010]. The JIT compiler is designed to binop/2addr is the set of binary operations with identify the most demanded tracks of code that run only two registers as arguments, being the result over the Dalvik VM. After identified, those tracks stored in the first register provided; binop is the set would be compiled and, next time they were of binary operations with three registers as demanded, the JIT would call the compiled track, arguments, two source registers and one destination instead of calling the interpreter. This way, the code register; unop is the set of unary operations with we use to collect our data would not be executed two registers as arguments, one source register and and the collected data would not be accurate. one destination register; staticop is the set of operations that perform over static object fields; JIT configuration # of instructions logged instanceop is the set of operations that perform WITH JIT = true 2,676,540 over instance object fields; arrayop is the set of WITH JIT = false 3,643,739 operations that perform over array fields; cmpkind is the set of operations that perform comparison Table 1: Number of instructions logged during the between two floating point or long; const is the set Android booting process of operations that move a given literal to a register; In our tests, the JIT compiler would skip, on move is the set of operations that move the content average, 26.5% of the type bearing instructions of a register to another register. during the Android booting process(Table 1). To Each of those categories has a number of avoid this source of error, it was necessary to instructions specifically designed to operate over deactivate the JIT compiler on our test Android OS. some data type. The whole instruction set The Android system contains an environment distinguishes 12 data types, namely: (1) Boolean; variable WITH JIT that is used to deploy an (2) Byte; (3) Char; (4) Class; (5) Double; (6) Android system with or without JIT. In order to Exception; (7) Float; (8) Integer; (9) Long; (10) deactivate the Just In Time compilation, we edited 3 Object; (11) Short; (12) String. the makefile Android.mk and forced the WITH JIT to be set to false. 4. MODULAR INTERPRETER (MTERP) Having deactivated the JIT, it is necessary to insert As the Android OS is open source, the source code the logging code into the interpreter. The interpreter of the OS [Google, 2015], including the Dalvik source code is put together in a modular fashion, VM, is available to be downloaded and modified. for this reason it is called modular interpreter By inspecting the Dalvik VM source code in (mterp). For each target architecture variant there 4 details, it was possible to identify that the will be a configuration file in the mterp folder . The interpreter2 would be a strong candidate to host the 3 The Android.mk is located on the following directory of the Android source tree: /android/dalvik/vm 2 The interpreter is located on the following directory of the 4 The mterp folder is located on the following directory of the Android source tree: /android/dalvik/vm/mterp Android source tree: /android/dalvik/vm/mterp

15

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

configuration will define, for each Dalvik VM instruction, which version of ARM architecture will Android be used and where the corresponding source code is Emulator located. In order to log all the designed instructions, Boolean.log several ARM source code files, scattered in the mterp folder, will need to be edit accordingly, and Extraction any extra subroutine could be inserted in the file Byte.log footer.S. After all the codes are edited, it is required Log to run a script called rebuild.sh, located in the mterp.log . Processing . mterp folder, that will deploy the interpreter5. Finally, the Android system, that will contain the modified interpreter, need to be built. String.log When executing the deployed Android OS, the data extraction takes place. The extracted data is stored in a single file with one entry per line as shown in Figure 3: Log processing Listing 1. The key information we can find in each entry are the two last columns, containing the type Summing up, to extract the memory values and the hexadecimal value stored in memory. associated with their respective types we needed to do the following: Listing 1: Unprocessed log sample D(285:298) Object = <0x41a1fc68> • deactivate the JIT Compiler from an Android OS; D(285:298) Int = <0x00034769> D(285:298) Object = <0x41a1fc68> • inject code in the Dalvik Interpreter to log D(285:298) Int = <0x00011db5> types and values on each interpreted D(285:298) Byte = <0x2f> typebearing instruction ; D(285:298) Int = <0x00000000> • run the adjusted Android OS to collect data D(285:298) Int = <0x0000002f> on the logs; D(285:298) Char = <0x2f> • process the logged data;

The deactivation of the JIT compiler and the Having this file, we process it to separate one data modification in the Dalvik interpreted code, type on each file and exclude any extra information expectedly, generated an execution overhead. apart from the hexadecimal value, as depicted in Considering the average booting time, the logging the Figure 3. procedure seems to have effected more the response time than the JIT deactivation. The Table 2 shows the average booting times with and without JIT, as well as with and without the logging code. Log = off Log = on

WITH JIT = true 62s 2176s WITH JIT = false 62s 3026s Table 2: Average booting time in seconds

5. RESULTS Having all the processed logs, it was possible to extract some statistical information from them. The 5 The interpreter is located on the following directory of the Table 3 shows in what proportion each type appear Android source tree: /android/dalvik/vm/mterp/out!

16

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

in the logs. The table makes clear that the Int type prevail over the other types, with 54.3% of the appearances. Other types with a rather common rate of occurrence are Byte (8.17%), Char (13.19%) and Object (24.00%). The remainder of the types have a percentage lower than 1%. Type # of occurrences % of total Bool 6,512 0.1787% Byte 297,578 8.1668% Char 444,163 12.1898% Class 1,454 0.0399% Double 836 0.0229% Exception 168 0.0046% Float 6,374 0.1749% Int 1,978,652 54.3028% Long 7,837 0.2151% Object 874,196 23.9917% Short 3,034 0.0833% String 22,935 0.6294% Total 3,643,739 100.0000% Table 3: Booting time in seconds At this point, the 32-bit types are being highlighted. They are: (1) Class; (2) Exception; (3) Float; (4) Integer; (5) Object; (6) String. Each of those 6 types have its own probability distribution of values plotted on the Figure 4. From the distributions it is possible to spot the similarity among the types: (1) Class; (2) Exception; (3) Object; (4) String. All 4 of them Figure 4: Probability distribution of values by 32- have a predominant peak a little after the value bit type (Log scale) 0x4000000. This similarity can be explained by the fact that those 4 types are indeed references, 6. CONCLUSION therefore, pointers to a memory address. If focusing This paper explained a technique to capture only on the values around 0x40000000, the Float memory data along with their corresponding data type could be confused with the reference ones, because it also displays a peak around 0x40000000, type in an emulated Android OS. This technique however a much broader one, moreover, it has an required deactivation of the optimization process called Just In Time compilation and the second lower peak around 0xc0000000. The Int type displays occurrences along the whole spectrum modification of the interpreter ARM code. The of values, featuring two more relevant peaks. One technique creates an expected overhead on the Android execution time. As this technique was only peak around 0x00000000 and the other peak around designed to run in emulated Android, this overhead 0xffffffff. Those two peaks could be explained by an greater occurrence of integer with small absolute is not an issue. The technique allowed us to collect important statistical information that made us values, being them of positive and negative signal, distinguish memory values between references respectively. (Class, Exception, Object, String), Float and Integer

17

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

types. Beyond this specific test case, this technique C. Hilgers, H. Macht, T. Muller, and M. could be use to build an statistical data corpus of Spreitzenbarth. Post-mortem memory analysis Android memory content. This data corpus may of cold-booted android devices. In IT Security become a tile on the work of paving the ground to Incident Management IT Forensics (IMF), the development of a consistent deep memory 2014 Eighth International Conference on, analysis tool. pages 62–75, May 2014. doi: 10.1109/IMF.2014.8. 7. ACKNOWLEDGEMENTS Zhiqiang Lin. Reverse Engineering of Data This work was supported by research grants (BEX Structures from Binary. PhD thesis, CERIAS, 9072/13-6) from Science Without Borders Purdue University, West Lafayette, Indiana, implemented by CAPES Foundation, an agency August 2011. under the Ministry of Education of Brazil. Zhiqiang Lin, Junghwan Rhee, Xiangyu Zhang, REFERENCES Dongyan Xu, and Xuxian Jiang. Siggraph: brute force scanning of kernel data structure Ing. M.F. Breeuwsma. Forensic imaging of instances using graph-based signatures. In 18th embedded systems using JTAG (boundary-scan). Annual Network & Distributed System Security Digital Investigation, 3 (1):32 – 42, 2006. ISSN Symposium Proceedings, 2011. 1742-2876. doi: http://dx.doi.org/10.1016/j.diin.2006.01.003. Joe Sylve, Andrew Case, Lodovico Marziale, and Golden G. Richard. Acquisition and analysis of David Ehringer. The dalvik virtual machine volatile memory from android devices. Digital architecture, 2010. Investigation, 8(34):175–184, 2012. ISSN Google. Google i/o 2010 - a jit compiler for 1742-2876. doi: android’s dalvik vm. Google Developers, May http://dx.doi.org/10.1016/j.diin.2011.10.003. 2010. URL www.youtube.com/watch?v=Ls0tM- Volatility. The volatility framework, 2015. URL c4Vfo. Accessed 6th March 2015. http://www.volatilityfoundation.org/. Accessed 18th March 2015. Google. Android source code repository. repo, 2015. URL https://android.googlesource.com/ Pei-Hua Yen, Chung-Huang Yang, and TaeNam plataform/manifest. Accessed 11th February Ahn. Design and implementation of a live- 2015. analysis digital forensic system. In Liu Guangqi, Wang Lianhai, Zhang Shuhui, Xu Proceedings of the 2009 International Shujiang, and Zhang Lei. Memory dump and Conference on Hybrid Information forensic analysis based on virtual machine. In Technology, ICHIT ’09, pages 239–243, New Mechatronics and Automation (ICMA), 2014 York, NY, USA, 2009. ACM. ISBN 978-1- IEEE International Conference on, pages 60558-662-5. doi: 10.1145/1644993.1645038. 1773–1777, Aug 2014. doi: 10.1109/ICMA.2014.6885969.

18

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

CHIP-OFF BY MATTER SUBTRACTION: FRIGIDA VIA David Billard1, Paul Vidonne2 1University of Applied Sciences in Geneva, Switzerland [email protected]

2LERTI, France [email protected]

ABSTRACT This work introduces an unpublished technique for extracting data from flash memory chips, especially from Ball Grid Array (BGA) components. This technique does not need any heating of the chip component, as opposed to infrared or hot air de-soldering. In addition, it avoids the need of re-balling BGA in case of missing balls at the wrong place. Thus it enhances the quality and integrity of the data extraction. However, this technique is destructive for the device motherboard and has limitations when memory chip content is encrypted. The technique works by subtracting matter by micro-milling, without heating. The technique has been extensively used in about fifty real cases for more than one year. It is named frigida via, compared to the calda via of infrared heating.

Keywords: Chip-off forensics, data extraction, BGA, data integrity preservation, micro-milling, infrared heating.

second case, some balls of the BGA will stay on the 1. INTRODUCTION motherboard and the practitioner will have to re- Forensics laboratories are daily facing the challenge ball the chip in order to extract data using a BGA of extracting data from embedded or small scale reader. digital devices. In the better case, the devices are As an example, the BGA component shown in already known from commercial vendors of figure 1 comes from a cell phone motherboard. The extraction tools and a proved method is available to labeling on the chip is very clear: it’s a NAND chip the practitioner. In most cases, the devices are and the edges of the chip are sharp. unknown, or broken, and then begins the fastidious search of a method to extract data from the device without jeopardizing the judicial value of the – hypothetical – concealed evidence. When no software-based method exists, the desoldering of the chip holding the data is accomplished. The chip is often a flash memory component, more and more of Ball Grid Array (BGA) technology. The de-soldering, even when routinely executed, is no error prone and induces a heavy stress on the component. Furthermore, the controlling of the heating is based on temperature Figure 1: BGA from a cell phone motherboard probes which are not always accurate enough. This The chip has been heated using infrared and the leads to chips being heated too much or chips being result is shown in figure 2. The component changed teared off. In the first case, the data content may be color (no more labeling visible) and the edges are altered, even destroyed in some occasion. In the blurred. The ball grid is also a bit wavy: the heating

19

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

has a dramatic effect on the component. However, later in this work. the component is still readable and data can be extracted. The ruler (in millimeters) has been added to give the reader of this paper a better idea of the component’s size.

Figure 4: Milled Micron BGA recto and verso The paper is organized as follows: section 2 is a review of literature about data extraction from flash components; section 3 presents the principle of the milling process, the machine and the interaction with precision bar turning; section 4 lists some lessons learned in using this technique compared to Figure 2: Heated BGA recto and verso infrared heating and presents a comparative table of pros and cons. In this paper we propose a new method for taking off BGA chips from motherboard, without heating 2. RELATED WORKS them. In fact, instead of taking the chip off, we An extensive literature exists about extracting data remove the motherboard from under the chip. We from flash (or eeprom) memory chips. Most of this use micro-milling technology and we subtract literature assumes that the device is in working matter from the motherboard on the other side of order. For instance, (Breeuwsma, 2006) addresses the chip, until we reach the ball grid. The process is the use of JTAG (boundary-scan) in order to bypass constantly monitored and controlled and it stops or trick the processor or the memory controller. In when reaching the balls. A result of this process is (Sansurooah, 2009), the author is addressing the shown below. use of flasher tools in order to load a bootloader The Micron chip presented in figure 3 is still into the device memory; this bootloader is designed attached to the motherboard. The labeling is clear, to gain access to low-level memory management, and the edges of the chip are sharp. thus enabling the reading of all memory blocks. Some papers, like (Fiorillo, 2009) are using hot air de-soldering to compare the content of flash memory chips before / after some writing of data. In (Willassen, 2005), several ways of desoldering chips are mentioned, all based on heating the component (hot air, infrared, ...). In a remarkable presentation, (van der Knijff, 2007) presents an

overview of most techniques for chipoff and JTAG access. Figure 3: Micron BGA on the motherboard Commercial products like (Cellebrite, 2015) or Once the milling process is done, the chip labeling (Microsystemation, 2015) are based on several is still as clear on the recto, and the grid balls are all techniques in order to gain access to the low-level present at the verso, as shown in figure 4. memory. Although these tools are not suited for Since no heating has been applied, the chip content chip-off, they provide the ability to decode memory has been cleared of any stress and is intact. We dumps extracted from flash memory chips. have been using and refining this technique for To our knowledge, the memory reading of broken / about one year on fifty real cases. We had an issue dismantled digital devices is done either by heating with only one particular case which is presented

20

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

and chip-off or sometimes by entirely Drawing the chip shape at the verso reconstructing the device around the flash memory. Figure 5: Localization step Our paper brings an unpublished approach, 3. Peeling step: using a milling cutter to cut, requiring no heating, thus enhancing the integrity and quality of the data extraction. It is especially layer by layer, the motherboard, until short designed for broken devices but also for running of arriving to the grid balls. Sometimes it devices, with some limitations, discussed in Sec. 4. means also cutting layers of BGA components when the grid balls are lightly 3. SUBTRACTING MATTER encased into the chip. Figure 6 presents a photography of the milling cutter sawing 3.1 PRINCIPLE through the motherboard until the grid balls The aim of the technique is to subtract matter are exposed. around the component. Concerning a BGA component, it sums up to obliterate the motherboard and its other components, leaving the BGA component alone. The technique can be summarized into the following steps: 1. Localization step: since the motherboard is milled, at its verso, just under the memory chip, the cutting tool has to be directed to the Milling to the grid balls localization of the chip, while the chip is hidden by the motherboard. Thus it is Figure 6: Peeling step necessary to locate the chip on the verso side For this milling step, it is of utmost of the motherboard by measuring distances importance that the milling cutter head and from the board sides to the chip sides on the the motherboard be perfectly aligned at 90◦. recto side. Then using the measures to draw Even a very small angle deviation may lead the shape of the chip on the verso of the to a catastrophic bite of the milling cutter into motherboard. the BGA component. In that case, the component may be utterly destroyed. Figure 5 presents a photography of the drawing of the shape of the chip, on the 4. Cleansing step: removing the last bits of verso of the motherboard. motherboard layer and epoxy that may still adhere to the grid balls. 2. Revolving step: turning on itself the BGA component, still attached to its part of Once those steps are finished, there is no need of motherboard, in order to have the re-balling the component, since no ball has been motherboard facing up (thus the component lost. The component can be used straight away in a flash reader, provided that the practitioner has the facing down). right pinout module. The upper image in figure 7 represents a sectional view of a BGA, taken from (Guenin, 2002). The lower image represents the working of the milling cutter, subtracting the motherboard and leaving the grid balls exposed.

21

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

3.4 PRECISION BAR TURNING The idea to implement this frigida via technique comes from interaction with specialists of precision bar turning. These people are specialized in Sectional view of BGA, soldered to the motherboard manufacturing tiny pieces of hardware, like gear wheels one can find in mechanical watches or complex components with special alloys used in space satellites. We were facing more and more devices locked to Sectional view of BGA, detached by milling investigation due to their poor condition: cell phone with a bullet hole, GPS retrieved from a sunken Figure 7: Process illustrated boat or tablet barely surviving a plane crash. Using commercial tools or flash boxes was not an option 3.2 VARIANT and infrared heating was adding additional stress on components already submitted to heavy stress. In some case, in particular when processor and Therefore, instead of thinking like repairing firms memory are piled one on top of the other, before whose job is to detach an object in order to repair the localization step, the motherboard has to be or analyze the failure of the whole device, we cuted all around the component, either by drilling thought about isolating the memory from its holes close to the four sides (like old fashioned external surroundings. In other words: obliterating stamps) or by drilling one hole and using a fretsaw the surrounding area, in order to leave the all around the BGA component. This operation is component exposed. called the punching step and figure 8 presents a photography of such step. One of the first case prompting us to use the milling was the investigation of a cell phone, retrieved after 3.3 MACHINE a car chase between the police and three drug dealers. The motherboard was badly damaged and The machine used for the milling is a standard we feared that using infrared on the memory chip precision micro milling machine from Proxxon may inadvertently damage further the chip. After (Proxxon, 2015). It must be capable of 0.05 extensive testing on spare devices, the milling millimeter steps (0.002 inch) with a rotating speed process was applied to the device remnants and varying from 5,000 to 20,000 rpm (revolutions per information was successfully extracted. minute). The milling cutters have usually a diameter between 1 and 3 millimeters (0.04 to 0.12 4. LESSONS LEARNED AND METHOD inch). A watchmaker grade magnifier, or a digital COMPARISON magnifier, is needed to control and verify the peeling step. 4.1 ENCRYPTION The technique explained in this paper has to be used with prudence when dealing with encrypted devices. In a real case about narcotics, a BlackBerry 9720 was seized. It had a keyboard lock that the owner was not willing to depart from. The frigida via was successfully used and figure 9 presents the recto and verso images of the SKhynix chip.

Separating the component from the others

Figure 8: Punching step

22

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

motherboard, even if the grid balls are melted. We did not find if the epoxy glues together the chip and the motherboard at heating time or if it is done during the assembly of the motherboard. In that case, even a heavy heating cannot desolder the chip, and will more likely destroy the content of the component. In table 1 we summarize the main differences Figure 9: Milled SKhynix BGA recto and verso between calda via and frigida via. But after reading the chip, it appeared that all the Table 1: Comparison Infrared vs Milling component content was encrypted. Finally, after Calda via: Infrared Frigida via: Milling some weeks, the password is supplied. Unfortunately, this password alone is not sufficient Heat damage No heat applied to decrypt the content: it must be used in Re-balling necessary No need of re-balling conjunction with some hardware information, Extensive cleansing Light cleansing contained in other components of the motherboard. Resoldering possible No resoldering Thus, even with the password, the memory remains encrypted. Same process duration The table 1 shows the most obvious differences 4.2 PROCESS DURATION & between infrared and milling. But even if milling COMPARISON seems superior in many aspects with respect to The milling technique takes between thirty minutes infrared, we are still using the two techniques on to one hour, depending on the quality of the the cases. The choice of the technique to apply is motherboard. Namely, if the motherboard is flat, dictated by several factors, among which: without any deformation, it takes less than thirty 1. the availability of the machines; minutes, and if the motherboard has been retrieved after a helicopter crash, it takes about one hour. 2. the risk of finding encrypted data linked Once the chip is off the motherboard, it is tohardware components; immediately available for reading and the first 3. the risk of damaging the chip by heating; contact in the reader socket is usually the good one. 4. the likeness of epoxy presence gluing The infrared (or hotair) method is usually shorter in time for the chip-off, thirty minutes being the upper thememory chip and the motherboard; limit of the process. However, the process can be 5. the training of the practitioner. impeded in many ways. First, the chip can loose grid balls during the When facing a chip-off, we are applying a process; some of them staying attached to the riskbased decision matrix in order to decide motherboard. After cooling the chip, many tries are between calda and frigida via. needed to find which grid balls are missing and additional time is needed to re-ball the chip, even if 5. CONCLUSION not all the grid balls need to be present, only the In this paper, we present a new technique for “useful” ones. extracting data from flash memory chips, especially The heating process also leaves residues of matter from Ball Grid Array (BGA) components. that have to be scrapped off using toothbrushes or This technique, called frigida via (or milling), is special treatment. Then several tries are also needed complementary of infrared or hot air chip-off to place the chip correctly into the reader socket, processes and offers many new possibilities. since the edges of the chip are no more rectilinear. Instead of relying on the heating of the solder of Furthermore, the epoxy layer between the chip and BGA component, in the hope that the component the motherboard can glue the chip to the

23

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

will detach cleanly from the motherboard, the Guenin, B. (2002, February). The many flavors of technique presented in this paper rely on ball grid array packages. Electronics Cooling. subtracting the motherboard from the component. Retrieved from http://www.electronics2 The motherboard is milled under the chip until cooling.com/42002/02/the2many2flavors2ofball2 exposing the grid balls. At the end of the process, grid2array2packages/ the chip is freed from the motherboard and can be placed on a reader socket for further analysis. Microsystemation. (2015). Xry mobile forensics. Retrieved from https://www.msab.com/ Since this technique does not require any heating of the chip component, as opposed to infrared or hot Proxxon. (2015). Precision lathe and milling air de-soldering, it avoids the inadvertent systems. Retrieved from degradation of the memory. As a matter of fact the http://www.proxxon.com component may be already weakened by external causes, or simply of fragile design, and using Sansurooah, K. (2009, December). A forensics heating, even with careful controlling of overview and analysis of usb flash memory temperature, may lead to the destruction of the devices. Proceedings of the 7th Australian memory content. Digital Forensics Conference, 99-108. van der Therefore, the frigida via is more respectful of the data integrity since it does not impose additional Knijff, R. (2007). 10 good reasons why you should stress on the memory chip, and the quality of the shift focus to small scale digital device forensics. data extraction is enhanced. Retrieved from http://www.dfrws.org/2007/4 proceedings/vanderknijff4 pres.pdf In addition, the frigida via avoids the need of reballing BGA in case of missing balls at the wrong Willassen, S. Y. (2005). Abstract forensic analysis place. It also eliminate the problem of the epoxy of mobile phone internal memory. Retrieved gluing memory chip and motherboard in some from http://digitalcorpora.org/corpora/4 devices. files/Mobile2Memory2Forensics.pdf However, this technique is destructive for the device motherboard and re-soldering of the chip component is impossible. That impossibility is a severe limitation when the memory content is encrypted by a combination of password and hardware-related information. The technique works and has been used in about fifty real cases, for more than one year.

REFERENCES Breeuwsma, I. M. (2006). Forensic imaging of embedded systems using jtag (boundaryscan). Digital Investigation, 3(1), 32 - 42. doi: DOI: 10.1016/j.diin.2006.01.003

Cellebrite. (2015). Ufed mobile forensics. Retrieved from http://www.cellebrite.com

Fiorillo, S. (2009, December). Theory and practice of flash memory mobile forensics. Proceedings of the 7th Australian Digital Forensics Conference, 52-84.

24

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

THE EVIDENCE PROJECT: BRIDGING THE GAP IN THE EXCHANGE OF DIGITAL EVIDENCE ACROSS EUROPE

Maria Angela Biasiotti, Mattia Epifani, Fabrizio Turchi

Institute of Legal Information Theory and Techniques of the Italian National Research of Council

Florence , Italy, 50127

[email protected], [email protected], [email protected]

ABSTRACT Based upon the assumption that the very nature of data and information held in electronic form makes it easier to manipulate than traditional forms of data, that all legal proceedings rely on the production of evidence in order to take place and that electronic evidence is no different from traditional evidence in that is necessary for the party introducing it into legal proceedings, to be able to demonstrate that it is no more and no less than it was, when it came into their possession the EVIDENCE Project aims at providing a road map (guidelines, recommendations, technical standards) for realising the missing Common European Framework for the systematic and uniform application of new technologies in the collection, use and exchange of evidence. This road map incorporating standardized solutions aims at enabling all involved stakeholders to rely on an efficient regulation, treatment and exchange of digital evidence, having at their disposal as legal/technological background a Common European Framework allowing them to gather, use and exchange digital evidences according to common standards, rules, practises and guidelines. EVIDENCE activities will also aim at enabling the implementation of a stable network of experts in digital forensics communicating and exchanging their opinions and contributing as well to the building up of a stable communication channel between the public and the private sectors dealing with electronic evidence.

Keywords: digital evidence, digital evidence exchange, metadata, formal languages. different, uncertain, regulations are not 1. THE CONTEXT harmonized and aligned and therefore exchange All legal proceedings rely on the production of among EU Member States jurisdictions and at evidence in order to take place. Electronic transnational level is very hard to be realized. Evidence is no different from traditional evidence What is missing is a Common European in that is necessary for the party introducing it Framework to guide policy makers, law into legal proceedings, to be able to demonstrate enforcement agencies and judges when dealing that it is no more and no less than it was, when it with digital evidence treatment and exchange. came into their possession. In other words, no The EVIDENCE project interpreted this request changes, deletions, additions or other alterations by defining it as: have taken place. The very nature of data and information held in electronic form makes it • the need for a common background for all easier to manipulate than traditional forms of actors involved in the Electronic Evidence life- data. When acquired and exchanged integrity of cycle: Policy makers, LEAs, Judges and Lawyers; the information must be maintained and proved. • the need for a common legal layer devoted to Legislations on criminal procedures in many the he regulation of Electronic Evidence in European countries were enacted before these Courts technologies appeared, thus taking no account of them and creating a scenario where criteria are • the need for standardized procedures in the use, collection and exchange of Electronic 25

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

Evidence (across EU member States). The project is now at its halfway mark and step 1- 5-7 are completed whilst step 2-3-4-6 are on the In response to the above needs and gaps the way to produce final assessment. EVIDENCE project aims at providing a Road Map (guidelines, recommendations, technical 2. PRELIMINARY REMARKS ON THE standards) for realizing the missing Common CONCEPT OF ELECTRONIC European Framework for the systematic and EVIDENCE uniform application of new technologies in the collection, use and exchange of evidence. This Before going for any kind of classification the Road Map incorporating standardized solutions very first issue at stake has been to set the right would enable policy maker to realize an efficient scenario and to fix the range and scope of the regulation, treatment and exchange of digital categorization task with respect to the Project evidence, LEAs as well as judges/magistrates and aims and goals. In this sense, our aim is to prosecutors and lawyers practising in the criminal develop a framework for the application of new field to have at their disposal as technologies in the collection, use and exchange legal/technological background a Common of evidence between Courts of the EU Member European Framework allowing them to gather, states. So, the main keywords to be considered use and exchange digital evidences according to are: Source of Evidence, Authenticity, Evidence, common standards and rules. ICT and Exchange. In order to produce this common, unique The use of ICT associated with evidence is often European way/ approach to the treatment and described utilizing two main expressions: exchange of electronic evidence, the EVIDENCE Electronic Evidence and Digital Evidence. Is the project has identified as relevant the following first one different from the second or are they just steps: synonyms? • Developing a common and shared We know for sure that both electronic and digital understanding on what electronic evidence is and evidence originate from the so called sources of which are the relevant concepts of electronic evidence and that there is a specific need to carry evidence in involved domains and related fields on a forensics analysis in order to identify the (digital forensic, criminal law, criminal evidence itself. We are also aware of the fact that procedure, criminal international cooperation); these sources might be electronic, or non electronic and that in the latter case it can acquire • Detecting which are rules and criteria utilized the status of “digital/electronic evidence ” if for processing electronic evidence in EU Member digitized. States, and eventually how is the exchange of evidence regulated; The analysis of the most significant sources of information demonstrated that there is no uniform • Detecting of the existence of criteria and use of the terms that identify this domain. Indeed, standards for guaranteeing reliability, integrity both digital evidence and electronic evidence are and chain of custody requirement of electronic accepted terms in the scientific community. For evidence in the EU Member States and eventually instance the International Standard Document, in the exchange of it; ISO/IEC 2703, “Guidelines for identification, • Defining operational and ethical implications collection, acquisition, and preservation of digital for Law Enforcement Agencies all over Europe; evidence”, prefers the term digital evidence, because it refers to data that is already in a digital • Defining implications on data Privacy issues; format and does not cover the conversion from • Identifying and developing technological analogical data into digital one. On the other functionalities for a Common European hand, authoritative sources such as the Council of Framework in gathering and exchanging Europe have opted for the term Electronic electronic evidence; evidence in the recently published “Electronic evidence guide” (Council of Europe, 2013). • Seizing the EVIDENCE market.

26

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

Moreover there are many different definitions of evidence is that electronic evidence which is “Electronic/Digital Evidence”, each of them generated or converted to a numerical format. highlighting some, but not all, essential features. Therefore, the EVIDENCE Project activities are The following are the main definition we have based upon its own core definition, capable, in collected/analysed so far (Mason, 2012): our opinion to catch all various sides, challenges of Electronic Evidence, relying on its very general abstraction level. • any data stored or transmitted using a computer that support or refute a theory of how Based upon this definition our statement is that an offense occurred or that address critical within the Electronic Evidence category both elements of the offense such as intent or alibi those evidence that are “born digital” and “not (Carrier, 2006); born digital” but that may have become such during their life-cycle are to be included. • digital evidence is any data stored or transmitted using a computer that support or As a matter of fact electronic evidence and digital refute a theory of how an offense occurred or that evidence in our conceptualization do coincide address critical elements of the offense such as (see Figure 1). Therefore, we will assume that intent or alibi (Casey, 2011). semantically speaking Electronic Evidence is the broader class including both those records “born None of the above cited definitions of digital digital” as well as those ones “not born digital” evidence or electronic evidence matched our but digitized afterwards. Once the digitization needs, therefore we finally decided to adopt the process has been carried out the Evidence following original definition: becomes “electronic” even if it was originally Electronic Evidence is any data resulting from the “non electronic” or analogical. output of an analogue device and/or a digital Figure 1 depicts the relationship between the device of potential probative value that are Electronic Evidence and the other forms in which generated by, processed by, stored on or it may appear, with a specific focus on: transmitted by any electronic device. Digital

Figure 1: From Sources of Evidence to Electronic Evidence

27

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

• Not Electronic items that should be digitized, 3. ELECTRONIC EVIDENCE LIFE CYCLE and therefore are afterwards treated as they were Starting from the relevant concepts extracted both born-digital, once the authenticity is assured as manually and semi-automatically, this step of the related to the original one; project was focused on the identification and • Electronic items - some sort of analogical form, classification of the building blocks of the which, as in the case of the Not Electronic items, conceptual model oriented to the description of the should be digitized. Electronic Evidence domain. The structuring is mainly based upon the electronic evidence life- In the same Figure 1 it is to be noted that: cycle as described in Figure 2. Having clarified • Arrows represent the process of transformation starting point of the conceptualization and the needed to generate the transition from “Non choice of the term preferred for the categorization, Electronic” or from “Analogical” to Digital items. it is worthwhile to describe which is the flow to which actions are referred in the digital forensics • Lines show that no process is needed and that domain. Therefore a brief description of the digital the evidence is per se electronic. forensics procedures will outline the process used Of course the transition from Analogical or Not to manage electronic evidence. The very first Electronic to Electronic is not an essential step; it milestone starts with an incident, an unlawful may happen but is not mandatory. In this way we criminal, civil or commercial act, and sets the scene can include every type of evidence present in paper for the electronic evidence life-cycle scenario. documents, objects, court hearings with witnesses Indeed an artefact or a record enters into the and other, that, due to the increasing use of ICT, are forensic process only if an incident forces it to do frequently objects of digitization. Therefore, we so. Otherwise, for all of its natural lifespan the prefer to use the term Electronic evidence that in artefact or record will remain outside the forensic our opinion comprises a larger range of process and thus forensically irrelevant – though it items/potential evidence. may continue to be very relevant to its user or owner.

Figure 2: Electronic Evidence management timeline/life cycle

28

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

The phases we have taken into consideration are • Evidence Reporting: this is one of he most chiefly based on already existing investigative critical steps. After the completion of identification, process models and ISO 27043 that represents a acquisition and analysis activities digital evidence point reference with the aim of creating a specialists have to complete their job producing a harmonized model on the basis of about other report with all the activities carried out and the existing models. outcome achieved. The report must contain all The digital evidence management timeline/life- details to allow the specialists to testify before a cycle consists of six main different phases, Court only relying on that document. regarding the handling of electronic evidence, starting from the incident event: The investigation process model depicted in Figure • Case Preparation: this is the first step of the 2 represents a simplified view of the whole process, digital evidence management timeline and it because some concurrent processes have not been comprises organizational, technical and represented, such as Obtaining authorization, investigative aspects Documentation, Managing information flow, Preserving chain of custody, Preserving digital Evidence Identification: this is the step • evidence. Furthermore it’s not a sequential flow, it consisting of examining/studying the in may be circular in some points and it might have order to preserve, as much as possible, the original back up to certain steps, Such example could be: state of the digital/electronic devices that are going to be acquire. • The analysis can reveal that some references to data sources have not been acquired. • Evidence Handling: this is the step where it is defined which specific standard procedures are to • During the acquisition phase it might be be followed, based on the kind of device is being possible to reconsider the acquisition plan to handled. include more data sources. • Evidence Classification: this is the step • During presentation some questions may arise consisting of identifying the main features and the requiring further analysis in order to provide status of the device, taking notes about Case ID, satisfactory answers. Evidence ID, Seizure place/date/made by/ Evidence More and more evidence may be generated in the type, picture, status, etc. course of most court hearings with witnesses being • Evidence Acquisition: this is one of the most recorded and their testimony entered into the critical phase within the digital evidence handle official court record, irrespective of whether a case processes: the forensics specialist must take care of is criminal or civil. Furthermore in our specific the potential digital evidence in order to preserve its view once the reporting phase is accomplished, the integrity during the following processes till to the electronic evidence may open to the scenario of presentation before a Court. Electronic Evidence Exchange. In this case the further step dedicated to the Presentation may take • Evidence Analysis: this is a process heavily place before a National Court or before another EU affected by the kind of case under investigation, the Member State. type of evidence to be handled and the features related to each of the evidence to be examined (e.g. installed operating system, type of file system, etc.).

29

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

Figure 3: Overview of exchange data between Legal Authorities

Figure 3 outlines at a high level of description the Legal Issues work package7, is the identification of exchange process that takes place between the a legal framework in the EU Member States, Requesting and Requested Legal Authorities governing the implementation of new technologies involved in the case after the analysis or in processing evidence, including trans- border interpretation is completed. exchange. Some general consideration have been achieved on the basis of a pilot comparative study: 4. MID-TERM RESULTS • There is no comprehensive international or In order to produce the Road Map a specific set of European legal framework relating to e-evidence, objectives have been considered essential and a only few relevant legal instruments (e.g.: group of mid-term results have been achieved. Cybercrime Convention);

4.1 ELECTRONIC EVIDENCE DOMAIN • Although some regulation exists at national CATEGORIZATION level, rules vary considerably even among countries with similar legal traditions (e.g. on admissibility It has been developed, within the activities carried issues); out in the Categorization work package6, a common and shared understanding on what • It has been gradually developing an electronic evidence is and which are the relevant interpretative evolution of the national criminal concepts of electronic evidence in involved laws so to apply (also) to e-evidence (amendments domains and related fields such as digital forensic, to existing norms); criminal law, criminal procedure and criminal • There has been a increase in knowledge and international cooperation. expertise of actors involved in the handling of e- A mind map representation of the whole evidence, but lack of specific standards is still categorization is visible via the following address: missing; http://www.evidenceproject.eu/categorization. • Several national data protection laws have been modified as a consequence of the introduction of 4.2 LEGAL ISSUES PRELIMINARY antiterrorism measures; RESULTS • Different Laws and practices of member states On of the main goal of the project, addressed by the contribute to create a situation of legal and practical

6 The activities have been developed by the CNR-ITTIG (Italy) 7 The activities have been developed by the University of and CNR-IRPPS (Italy), partners of the Evidence project. Groningen (The Netherlands), partner of the Evidence project.

30

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

uncertainty. 4.4 DIGITAL FORENSICS TOOLS CATALOGUE 4.3 DATA PROTECTION ISSUES Starting from the Digital Evidence life-cycle shown Another crucial goal of the project, addressed by in Figure 2, there are already standards for many of the Data Protection Issues work package8, is the phases depicted. In particular for the acquisition identification of data protection issues and remedies and investigative processes the ISO 27043, ISO regarding the process of gathering and using 27037 and ISO 27042 represent points of reference. electronic evidence. In composing the overview of existing standard for The following general consideration have been the handling of electronic evidence, within the activities related to the Standard Issues work determined: package9, a huge number of digital forensics tools Secondary law: there is no valid regulations have been gathered and there has been created a addressing data protection issues related to the Digital Forensics Tools Catalogue, concerning collection of electronic evidence tools for the Acquisitive and Analysis phases as Conventions: Cybercrime Convention contains described at different levels of details by the procedural regulations on the collection of ISO/IEC standards, above mentioned. electronic evidence and data protection safeguards The Catalogue represents the overview of forensics European Convention on Mutual Assistance in tools for handling digital evidence, generally Criminal Matters addresses the exchange of accepted in the EU member states. The Catalogue, evidence in its current version 1.0 dated February 2015, comprises over 1.200 tools divided into two main Art. 82 (2) TFEU: The EU has a legal competence branches: Acquisition and Analysis. to harmonise particular aspects of criminal procedure law such as: The Digital Forensics Catalogue is visible via the following URL: http://wp4.evidenceproject.eu • admissibility, which includes rules on means of collecting electronic evidence; 4.5 MARKET SIZE MAP OF ACTORS This competence could be used to set up a Another relevant goal of the project, addressed by minimum standard of privacy safeguards to be the Market Size work package10, is the established in relation to the use of certain means of identification and classification of the main types collecting electronic evidence. of actors involved in the "social arena" of electronic evidence. Moreover in most domestic legal frameworks rather few and not necessarily sufficient and/or congruent There are two type of actors having a direct interest privacy safeguards related to electronic evidence in electronic evidence: exist. Such examples could be: • Process Actors: public and private actors • Procedural Law: Structure and Rules - very few involved in handling the electronic evidence; definitions of electronic evidence exist; • Context Actors: actors providing technical • Cross-Border Scenarios & International Law - solution and assistance in this field. in Cloud computing environments legal issues are Furthermore there are nine typological areas of not sufficiently or not at all addressed by law; Process Actors, in turn comprising a total of 40 • Investigative Measures - Existing rules often types of actors: apply both to physical and electronic evidence • Public law enforcement and Intelligence • Admissibility - Not regulated specifically

9 The activities have been developed by the CNR-ITTIG (Italy), partner of the Evidence project 8 The activities have been developed by the Leibniz Universität 10 The activities have been developed by the Laboratory of Hannover (Germany), partner of the Evidence project. Citizenship Sciences (Italy), partner of the Evidence project.

31

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

agencies (e.g. Law enforcement officers, • The media (e.g. Traditional and Social media, Detectives, Intelligence agencies); etc.); • Actors of legal criminal trial (e.g. Judges, • Enterprises interested in the proper functioning Prosecutors, Lawyers, etc.); of justice (e.g. Individual firms, Business associations); • Notaries; • Transnational projects (e.g. Digital forensics • Public register actors (e.g. Business register research projects and training); actors , Civil acts register actors, Landregister actors, etc.); • Other actors collecting evidence (e.g. Public and Private actors that collect data / potential • Forensic examiners (e.g. Fraud examiner, evidence). Forensic laboratory staff member, Digital Evidence First Responder, etc.); 5. ELECTRONIC EVIDENCE EXCHANGE • Private investigators; STAUS QUO OVERVIEW • Hardware producers (e.g. Hardware producers As far as the Exchange process (see Figure 3) is for Computer Forensics, for Mobile Forensics, concerned, there is no standard published or etc.); proposed, furthermore it represents one of the essential points of the EVIDENCE Project that aims • Technology/software producers (e.g. Software to facilitate and foster the exchange between houses that produce complete commercial toolkits different authorities and across the EU Member for forensic analyses, that make software for States. The project aims at defining functional specific commercial analyses, etc.); specifications for exchanging digital evidence, in • Service providers (e.g. Major consulting firms, such way that no matter what forensic tool is being Associated professional studios, etc.). used by an examiner, the results of his or her examination must be verifiable by another Finally ten typological areas of Context Actors, in examiner, independent of the tool being used as turn containing twenty six types of actors, can be long as the tools are comparable in specification enumerated: and function. • Specialized International Organizations (e.g. On the basis of the information gathered so far, it UN agencies concerned with justice and seems that, at the moment, in cross-borders technological innovation, etc.); criminal cases, cooperation is mostly based upon • Law making bodies (e.g. European international agreement or letter rogatory to the organizations, National governments); foreign Court. Independently from the legal framework identified by the EU Member States, the • Technological innovation actors linked to the cooperation is mostly human based where the Internet (e.g. Internet service providers, Cloud electronic evidence exchange is carried out between technology providers); judicial stakeholders from a source EU authority to • Legal and forensic associations and networks another judicial authority in the target EU member (e.g. General legal and forensic associations and state. This approach is similar across countries and, networks, Associations and networks concerned at first glance, the Exchange does not appear based with issues linked to new technologies); on any electronic means at all. In most cases the forensics copy of the original • Research bodies, associations and networks (e.g. Organizations and associations concerned with source of evidence is exchanged: a judicial/police Internet and ICT , Academic institutions concerned authority from an EU member state A (requested with ICT, etc.); authority) requests an EU member state B (requesting authority) to generate a forensics copy, • Actors involved in the field of human rights based on mutual trust between the two competent (e.g. Civil rights organizations, Privacy protection authorities. Later the exchange of the forensic copy organizations, etc.); will be attained on human based: the authority from 32

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

country A instructs someone to take the copy or the enforcement community and based on Universal Message Format (UMF) standard . SIENA is used copy is delivered by a secure courier to the for exchanging personal information related to the requested authority . In any case it has to emphasize crime areas within the mandate of Europol, that no electronic means is involved in the including EU restricted information. exchange process. 7. ELECTRONIC EVIDENCE EXCHANGE: To facilitate human cooperation, institutions such PROPOSED STANDARDS as EuroJust, EuroPol, InterPol put in place systems or platform in order to communicate/share relevant The requirement upon a standard language to information. represent a broad range of forensics information and processing result has become an increasing There are two different cross-borders cooperation need within the forensics community. For the levels: electronic evidence exchange a similar need has to • the judicial cooperation based, almost exclusively, be addressed even though the aim of the exchange via the regular international procedures for mutual may address different issues, for example malware assistance in criminal matters, regulated by strict analysis, relevant artifacts exchange, tools result procedures, time-consuming and unpredictable, but comparison. Research activities conducted in this the only way for an evidence exchange, field have been used to develop and propose many • the investigation cooperation simpler and quicker languages. but only for operational, technical information or CybOX (Cyber Observable eXpression) is one of coordination activities. During investigations there the most important languages that have been may be an information exchange that cannot be recently proposed. It has been devised along with used during the trial over the pleading stage. other related languages, by Mitre.org such as CAPEC (Common Attack Pattern Enumeration and In many cases judicial authorities act relying on international agreement established through Classification), STIX (Structured Threat Eurojust to coordinate investigations and Information eXpression) and TAXII (Threat prosecutions between the EU Member States when Automated eXchange of Indicator Information). dealing with cross-border crime. The use of standard languages for the information The exchange of the electronic evidence should exchange has been dealt in recent scientific take place in a secure environment, relying on a contributions, published in 2014, by the European Union Agency for Network and Information service for exchanging the evidence in a secure manner. In order to achieve this goal such a service Security (ENISA) and in particular Actionable will rely on digital certificates in order to certify the information for Security Incident Response and in proprietary of a public key. This would allow any Standards and tools for exchange and processing of actionable information . judicial authorities (relying parties) to rely upon signatures or assertions made by the private key Another relevant resource is a recent document that corresponds to the certified public key. (Casey, 2014), that proposed DFAX (Digital Forensic Analysis eXpression), that leverages 6. ELECTRONIC EVIDENCE EXCHANGE: CybOX for representing the technical information. EXISTING PLATFORMS There are already existing platforms for the 8. ELECTRONIC EVIDENCE EXCHANGE CHALLENGES information exchange, but, for confidential reasons it has almost been impossible to collect detailed The regular international procedures for mutual information about their architecture and the kind of assistance in criminal matters are time-consuming information exchanged. The main important system and unpredictable, but they represent, at the in the evidence exchange is: SIENA, that stands for moment, the only way for the evidence exchange. Secure Information Exchange Network Nevertheless the current situation may pose Application. It is a secure communication system obstacles for fighting against serious cross-border managed by Europol, dedicated to the EU law 33

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

and organized crime especially in investigative case the exchange is managed through platforms where time is crucial. provided by ISPs via web. This scenario may pose serious issues: Furthermore, when it comes to Electronic Evidence Exchange, a group of questions are to be born in • exchange evidence procedures may be slow: it mind: must be especially born in mind in investigative cases where time is crucial for fighting against • What information should be exchanged? serious cross-border and organized crime; • When may the exchange take place? • exchange evidence procedures may involve big • How the information could be exchanged, even expenses, such as in the case of traveling abroad to taking into consideration security issues? take the original/copy source of evidence to be handled; • Which kind of stakeholders are involved? Judicial and Police authorities must invest lots The present situation raises three main issues: • of money to keep up with the development of • exchange evidence procedures may be slow. forensics technology: expenses related to software This aspect must be especially born in mind in updating and keeping up personnel competencies; investigative cases where time is crucial for exchange desperately needs trusted procedures fighting against serious cross-border and organized • and environments between involved stakeholders crime; So the way forward for the electronic evidence exchange evidence procedures may involve big • exchange would be introducing a cloud expenses, such in case of travelling abroad to take environment to be used from judicial and police the original/copy source of evidence to be handled; authorities and by private stakeholders in order to • Judicial and Police authorities must invest lots speed up the process, optimize costs and foster a of money to keep up with the development of more developed cooperation and trust among the forensics technology. involved competent authorities. Moreover, using this platform could be possible to carry out an In order to address the issues a possible solution electronic evidence exchange using specific meta could be using a cloud environment , centralized or data along with the data related to the source of distributed, for exchanging/sharing evidence where evidence. This meta data, expressed in an open the users could be competent authorities (e.g. standard language could describe the digital judicial, police, etc.) but private subjects as well. evidence in a unique way and be used by software This platform could speed up the exchange companies/producers to represent the widest range procedures and it could avoid, except for special of forensic information and forensic processing cases, travelling abroad to take the original source results in order to share structured information of evidence. Moreover, through a digital platform, a between independent tools and organizations. wider cooperation could be put in place and, for example, specific technical support could be REFERENCES requested through the same digital platform, from a police authority to another located in a different EU Carrier, B. (2006). Hypothesis-Based Approach to member state. A more developed technological Digital Forensic Investigations. Center for cooperation among the involved authorities could Education and Research in Information optimize costs and better distribute resources. Assurance and Security. Purdue University.

9. CONCLUSIONS Casey, E. (2011). Digital Evidence and Computer Crime. , Computers, and the At the moment, there is no standard for the Internet. Elsevier, Third Edition. exchange and it is mostly human based. Only in case of data held by third-parties there is a well- Casey, E., Back, G., Barnum, S. (2015). Leveraging established cooperation between judicial authorities CybOX to standardize representation and and Internet Service Providers (ISP). In this context

34

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

exchange of digital forensic information. Digital Investigation, 12S, 102-110. Elsevier.

Council of Europe. (2013). Electronic Evidence Guide. Retrieved on February 2015 from http://www.coe.int/t/dghl/cooperation/economicc rime/cybercrime/Documents/Electronic%20Evid ence%20Guide/default_en.asp

Daniel, L., Daniel, L. (2011). Digital Forensics for Legal Professionals. Syngress Media Inc.

ISO/IEC 27037. (2012). Guidelines for identification, collection, acquisition and preservation of digital evidence. Retrieved on March 2015 from http://www.iso.org/iso/home/store/catalogue_tc/c atalogue_detail.htm?csnumber=44381

ISO/IEC 27043. (2015). Incident investigation principles and processes. Retrieved on March 2015 from http://www.iso.org/iso/home/store/catalogue_tc/c atalogue_detail.htm?csnumber=44407

Garfinkel, S. L. (2012). Digital forensics XML and the DFXML toolset. Digital Investigation. Elsevier.

Mason, S. (2012). Electronic Evidence, third edition. LexisNexis Butterworths.

Peterson, G., Sujeet, S. (2012). Advances in Digital Forensics VIII, Editors: Peterson, Gilbert, Shenoi. Springer.

35

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

A COLLISION ATTACK ON SDHASH SIMILARITY HASHING Donghoon Chang, Somitra Kr. Sanadhya, Monika Singh, Robin Verma, Indraprastha Institute of Information Technology Delhi (IIIT-D), India. donghoon,somitra,monikas,robinv @iiitd.ac.in { } ABSTRACT Digital forensic investigators can take advantage of tools and techniques that have the capability of finding similar files out of thousands of files up for investigation in a particular case. Finding similar files could significantly reduce the volume of data that needs to be investigated. Sdhash is a well-known fuzzy hashing scheme used for finding similarity among files. This digest produces a ‘score of similarity’ on a scale of 0 to 100. In a prior analysis of sdhash, Breitinger et al. claimed that 20% contents of a file can be modified without influencing the final sdhash digest of that file. They suggested that the file can be modified in certain regions, termed ‘gaps’, and yet the sdhash digest will remain unchanged. In this work, we show that their claim is not entirely correct. In particular, we show that even if 2% of the file contents in the gaps are changed randomly, then the sdhash gets changed with probability close to 1. We then provide an algorithm to modify the file contents within the gaps such that the sdhash remains unchanged even when the modifications are about 12% of the gap size. On the attack side, the proposed algorithm can deterministically produce collisions by generating many di↵erent files corresponding to a given file with maximal similarity score of 100.

Keywords: Fuzzy hashing, similarity digest, collision, anti-forensics.

1. INTRODUCTION vestigator might be interested in looking only at files similar to a given file in order to investigate The modern world has been turning increasingly modifications to that file. digital: conventional books have been replaced Most forensic software packages contain tools by ebooks, letters have been replaced by emails, which check for ‘similarity’ between files. Auto- paper photographs have been replaced by digi- matic filtering is normally done by measuring the tal image and compact audio and video cassettes amount of correlation between files. However, have been replaced by mp3 and mp4 CD/DVD’s. correlation method does not work well if the ad- Due to the reducing costs of storage devices and versary deliberately modifies the file in such a their ever increasing size, people tend to store manner that the correlation value becomes very several (maybe, slightly di↵erent) versions of a low. For example, a C program can be mod- file. In case a person is suspected of some ille- ified by changing the names of variables, writ- gal activity, security agencies typically seize their ing looping constructs in a di↵erent way, adding digital devices for investigation. Manual foren- comments etc. Ideally, an investigator would like sic investigation of enormous volume of data is to eciently know the percentage change in two hard to complete in a reasonable amount of time. versions of a file so that he can concentrate on Therefore, it may be helpful for an investigator files which are slightly di↵erent from a desired to reduce the data under investigation by elimi- file. Using Cryptographic Hash Function (CHF) nating similar files from the suspect’s hard disk. as a digest of the file does not work in this situa- On the other hand, in some situations, the in- tion as even a single bit change in the file content 36 Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

is expected to modify the entire digest randomly The rest of the paper is organized as follows: by the application of a CHF. We discuss related literature in 2. Notations § ‘Approximate Matching’ is a technique for and definitions used in the paper are provided in finding similarity among given files, typically by 3. The sdhash scheme is explained in 4 and § § assigning a ‘similarity score’. An approximate existing analysis of the scheme is presented in 5. § matching technique can be characterized into 6 contains our analysis and attack on sdhash, § one of the following categories: Bytewise Match- followed by our proposed algorithm . Finally, we ing, Syntactic Matching and Semantic Match- conclude the paper in 7 and 8 by proposing § § ing (Breitinger, Guttman, McCarrin, & Roussev, solutions to mitigate our attack on sdhash. 2014). Bytewise Matching relies on the byte se- quence of the digital object without considering 2. RELATED WORK the internal structure of the data object. These The first fuzzy hashing technique, Context Trig- techniques are known as fuzzy hashing or simi- gered Piecewise Hashing (CTPH) was proposed larity hashing. Syntactic Matching relies on the by Kornblum (Kornblum, 2006) in his tool internal structure of the data object. It is also named ssdeep. The CTPH scheme is based called Perceptual Hashing or Robust Hashing. on the spamsum algorithm proposed by Andrew Semantic Matching relies on the contextual at- et al. (Tridgell, 2002) for spam email detection. tributes of the digital objects. Sdhash, proposed The ssdeep tool computes a digest of the given by Roussev (Roussev, 2010a) in 2010, is one of file by first dividing the file into several chunks the most widely used fuzzy hashing schemes. It and then by concatenating the least significant 6- is used as a third party module in the popular bits of the hash value of each chunk. A hash func- forensic toolkit ‘Autospy/Slueuth-kit’ 1 and in tion named FNV is used to compute the hash of another toolkit ‘BitCurator’ 2. each chunk. Breitinger et al. analyzed sdhash Chen et al. (Chen & Wang, 2008) and Seo et in (Breitinger, Baier, & Beckingham, 2012; al. (Seo, Lim, Choi, Chang, & Lee, 2009) pro- Breitinger & Baier, 2012) and commented that posed some modifications to ssdeep to improve “approximately 20% of the input bytes do its eciency and security. Baier et al. (Baier not influence the similarity digest. Thus it is & Breitinger, 2011) presented thorough security possible to do undiscovered modifications within analysis of ssdeep and showed that it does not gaps”. In this work, we show that this claim withstand an active adversary for blacklisting is not entirely correct. We show that if data and whitelisting. between the ‘gaps’ is randomly modified then Roussev et al. (Roussev, 2009, 2010a) pro- the digest changes even when the modifications posed a new fuzzy hashing scheme called sdhash. are only about 2% of the ‘gap size’. After that The basic idea of sdhash scheme is to identify we propose an algorithm which can generate statistically improbable features based on the en- multiple files having sdhash similarity score of tropy of consecutive 64 byte sequence of file data 100 corresponding to a given file, by modifying (which is called a ‘feature’) in order to generate upto 12% of the ‘gap size’. The proposed the final hash digest of the file. Breitinger et algorithm can also be used to carry out an anti- al. (Breitinger & Baier, 2012) showed some weak- forensic mechanism that defeats the purpose nesses in sdhash and presented improvements to of digital forensic investigation by filtering out the scheme. Detailed security and implementa- similar files from a given storage media. An tion analysis of sdhash was done in (Breitinger attacker could generate multiple dissimilar files et al., 2012) by the same authors. This work un- corresponding to a particular file with 100% covered several implementation bugs and showed matching sdhash digest using our technique. that it is possible to beat the similarity score by tampering a given file without changing the per- 1http://wiki.sleuthkit.org/index.php?title= Autopsy 3rd Party Modules ceptual behavior of this file (e.g. image files look 2http://wiki.bitcurator.net/?title=Software almost same despite the tampering). 37 Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

3. NOTATIONS 4. DESCRIPTION OF Following notations are used throughout this SDHASH work: We now describe the working of sdash using the notation defined in 3. Given a data object D D denotes the input data object of N bytes, § • of length N bytes (B0B1B2.....BN 1), a feature D=B B B .....B ,whereB is the ith byte 0 1 2 N i f is a subset of L (= 64) consecutive bytes of D, of D . k that is f :B B B ...B where 0 k

bf represents the number of features within H(nbfk)= (P [nbfk = x]log2P [nbfk = x]) • x ↵ bloom filter bf. 2 P H (nbf )= log ↵ = 8 and H (nbf )=0. bf denotes number of bits set to one within max k 2 min k | | H(nbfk) •| | Normalized entropy of nbfk is = the bloom filter bf. Hmax(nbfk) H(nbfk) 8 . Range of normalized entropy of nbfk t denotes some threshold (sdhash uses t = • is 0 to 1. It is being scaled up to the range 0 to 16). 1000 and represented by Hnorm(nbfk):

SFscore(bf 1,bf 2) represents the similarity H(nbf ) • H (nbf )= 1000 k score of bloom filter bf 1 and bf 2. norm k ⇤ 8 ⌫ 38 Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

After calculating the normalized entropy of each t1000=Pr[nenfdQ=1000]. feature, a precedence rank is assigned to the re- spective feature of the data object D based on We assign a rank ri to each ti as follows: ri=1000 the empirical observation of probability density if ti is the largest, and ri=0 if ti is the smallest. function for normalized entropy of experimental Now each feature fk of D is assigned a precedence data set. rank Rprec,D(fk) as follows: Let Q is the experimental data set of q data ob- jects D1D2D3.....Dq of same type and same size. fk of D, Rprec,D(fk)=ri,where Here the random variable is normalized entropy 8 Pr[nenfd Q=Hnorm(nbfk)]=ti of next data object’s nbfk of set Q, represented as nenfd Q. Let A is a set of integers from 0 to where D is the given data object, n is number of 1000 i.e. 0,1,2,....,1000. features of data object D and 0 kk; and num- ber of fi + number of fj =(Rpop,D(fk)+W-1). W-neighboring features of feature fk: A feature fnb is called a W-neighboring feature of feature fk if k-W

et al., 2012). Two of the implementation bugs, ‘Window size bug’ and ‘left most bug’ mentioned in (Breitinger et al., 2012) still exist in the latest version 3.4 of sdhash implementation. Listing 1 shows the implementation of above stated bugs. At line number 13, there is an error in first condition that causes incorrect identifica- Figure 2: Popularity Rank Calculation from tion of minimum precedence rank (Rprec,D(fk)), (Roussev, 2009, 2010a) referred to as the ‘Window size bug’. This error can be removed by replacing the first condition of while loop with ‘chunk ranks[i+pop win-1] Now features with R (f ) t(threshold) pop,D k min rank’. There is another error in the if con- are selected (in sdhash implementation t=16). dition at line number 14-15 and 26-27, that has Selected features are the least likely features been referred to as the ‘Left most bug’. to occur in any data object. These features If two features (fi,fj) have equal precedence are called “Statistically Improbable Features”. rank (Rprec,D(fi)=Rprec,D(fj)) and are lowest These Statistically Improbable Features will be within a popularity window, then this con- used to generate fingerprint of the data object dition will cause the selection of right most D. Let f ,f ,. . . .,f are the selected fea- { s0 s1 sx } feature that contradicts the proposed sdhash tures, where 0

24 min rank = Table 1: Di↵erent statistic on sdhash from (Bre- chunk ranks [ j ]; itinger & Baiber, 2012) 25 min pos = j ; 26 else if (minpos == j 1 Average improved original } && c h u n k ranks [ j ] == min rank) 1 filesize* 428,912 428,912 27 min pos = j{ ; 28 29 } 2 gaps count 2888 2889 } 3 min gap* 1.09 1.076 30 if(chunkranks [ min pos ] > 0) 4 max gap* 1834 1834 31 chunk{ scores [ min pos]++; 5 avg gap* 33.46 34.27 32 33 } 6 ratio to file size 20.65% 21.21% 34 }// Generate score histogram (for b sdbf signatures) 35 if(score histo) 36 for(i=0;i pop win size ; ing schemes is to filter similar or correlated files 5 uint64 tminpos = 0; corresponding to a given file that an investiga- 6 uint16 tminrank = chunk ranks [ tor needs to examine. These schemes reduce the min pos ] ; 7 search space and corresponding manual e↵ort of 8 memset( chunk scores , 0, chunk size analysis for the investigator. The process of fil- sizeof(uint16t)); ⇤ 9 if (chunk size > pop win) tering the files by matching them with a set of 10 for(i=0;i0&&minrank>0) We propose a scheme that can generate multi- { 13 while(chunkranks [ i+ ple similar files corresponding to a given file with pop win ] >=minrank && i

that might attract web advertisers to put their that random modification of only 2% of the gap ads on the website. A consistent viewership over bytes influences the sdhash digest with probabil- a period would result in high chances of adver- ity close to 1. In the second approach of ‘De- tisement hits and consequently monetary returns liberate modification’ we propose an algorithm for X. She would recover the membership cost for careful modifications in order to increase the gradually while the rest of the revenue is profit. available bytes for modification within gaps. Ex- The original website ‘A’ eventually comes to perimental analysis of the proposed algorithm know about the existence of website ‘B’ which shows that by using this algorithm, around 12% is hosting their proprietary content. Since the of the gap bytes can be modified with maximal owner of the domain name is registered as anony- similarity score of 100. mous on records, the only way to track her is her 6.1 Random Modification IP address. Fortunately, the country where the website is hosted follows anti-piracy and Intel- We randomly choose several byte positions lectual property protection laws. The physical within the gap and modify each with a randomly location of systems on which the data of web- chosen ASCII character to find the maximum site ‘B’ is stored can be determined. X uploads number of random modifications within the gap content downloaded from original website after that do not influence the sdhash digest of the en- putting a watermark of his own website on each tire document. We performed experiments on a image. The use of cryptographic hash functions data set of 50 text files of variable size from the is ruled out in that case and investigators would T5-corpus dataset. We found that even one byte need a similarity digest algorithm, possibly sd- of random modification within the gap would in- hash to find the files. fluence the sdhash digest with an average prob- Here, in this condition if X has any time to pre- ability of 0.22, and the modification of all bytes pare herself for such an investigation, she could in the 20% gap will impact the final hash digest use our tool to generate multiple similar files, with probability 1. So, we focused on finding the with same metadata, corresponding to each file. minimum number of modifications that would in- The approach is definitely heavy on storage but fluence the final sdhash digest with probability can help X in increasing the e↵ort of the inves- 1. tigation by forcing the investigators to analyze We started with single byte modifications and the files manually. Secondly, the investigation generated more than 5000 files with only one process could also be confused as by X’s claim byte tampering and evaluated its influence the that she is innocent and it is a work of some- hash digest. one else who has access to her system or even a We gradually increased number of modifi- malware. In both the cases, investigation e↵ort cations until the hashes for all 5000 files got is increased many folds. Moreover, the primary influenced. It was found that with a random purpose of a similarity digest to help investiga- modification of only around 2% bytes of the tors quickly filter out files of interest is defeated. gap there is an influence on the sdhash digest Breitinger et al. in (Breitinger et al., 2012) of each of the randomly generated file which is mentioned that 20% of the input data can be on an average 0.42% of the respective file size. modify without influencing the final sdhash Experimental results for a small sample of 8 files digest. We used two approaches to verify the is given in table 2. number of undiscovered modification within gaps. These are (1) Random modification and As described in 3, only the selected features § (2) Deliberate Modification. (statistically improbable features) participate in the generation of final similarity digest. In the random modification approach, gap Therefore gaps (the data bytes which are not bytes are filled with randomly chosen ASCII part of any selected feature) are expected not characters. Our experiments on text files show to influence the final hash digest. However, 42 Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

6.2 Deliberate Modification Table 2: Minimum number of random modification, that modifies final sdhash digest with probability 1. The experiment results from 6.1 show that the § S.No. File size Gap Random Modification entire 20% gap of any file cannot be modified (In KBs) (In Bytes) Bytes Gap% File% by random modification. We now propose an 1 1.5 354 45 12.70% 3% † 2 22.9 3948 50 1.26% 0.21% algorithm that performs careful modifications in 3 50 8917 70 0.78% 0.14% order to increase the number of changes within 4 81 14084 60 0.42% 0.07% the gaps while still ensuring no change in the 5 307 46296 80 0.17% 0.03% similarity digest. 6 841 215894 20 0.01% 0.00% 7 1095 139038 50 0.04% 0.00% 6.2.1 Algorithm Description 8 1554 378636 35 0.01% 0.00% On an avg. 51.25 1.92% 0.42% As discussed in 6.1, modification in any byte B will influence§ the rank of all features This file is not from T5-corpus database k † containing Bk. This might cause changes in the list of selected statically improbable features. In the sdhash construction, a feature with leftmost lowest rank gets selected in a popularity window. If the rank of a feature is leftmost lowest in t as we showed in the experiments, these bytes or more than t (threshold) popularity windows do influence the sdhash digest. This happens then it gets selected as a statistically improbable since each feature in the sdhash construction is feature. These selected statistically improbable highly correlated to its neighbors. Each feature features participate in the computation of the final sdhash digest. Let D be a data object di↵ers from its left and right neighbor by only with fS1 and fS2 as two consecutive statistically one byte. For example, let D be a data object improbable features. under investigation which has the following byte sequence and features. f0 f1 f2...fS fS +1 ...fS +63 fS +64...fS 1 fS ...fn 1 1 1 1 2 2 B0B1B2..BS BS +1..BS +63 BS +64..BS 1 BS ..Bn...BN B0B1B2B3B4B5.....B63B64B65B66B67...BN 1 1 1 1 2 2 f0 B0B1B2B3B4B5.....B63 B64B65B66B67...BN where fs1 :Bs1 Bs1+1Bs1+2.....BS1+L+1 f1 B1B2B3B4B5B6 .....B64 B65B66B67...BN fs2 :Bs2 Bs2+1Bs2+2.....BS2+L+1 f2 B2B3B4B5B6B7 ...... B65 B66B67...BN Data bytes BS1+64 to BS2 1 are not a part of . . any selected features. The aim is to modify these . bytes in such a way that modified features never get selected over fS1 and fS2 . For every data byte fn BN 63BN 62...... BN 2BN 1BN B ,whereS+L k S 1, a specific value k 1   2 among all possible ASCII characters satisfying where N is the number of bytes in the data the following two conditions is chosen: object D, and n is the number of features in D 1. R (f 0 ) > R (f ) AND R (f 0 ) (n=N-L+1). Each byte is part of atleast one and prec,D j prec,D S2 prec,D j Rprec,D(fS1 ) at-most L (i.e. 64) features. Each byte (Bk), ex- cept the first L-1 and the last L-1 bytes (L k N-   L+1), is part of exactly L features. Change in 0 2. Rprec,D(fj) Rprec,D(fj) any byte, Bk will reflect in a change in features fk to fk L+1, which may lead to a change in the precedence ranks Rprec,D(fk L+1)toRprec,D(fk). where (k-L+1) j k and (S1+L) k S2- 0     A change in the rank of any feature(Rprec,D(fk)) 1 and fj is modified feature fj obtained as the will reflect in a change in the popularity score result modification of byte Bk. The above two of features of D, which may a↵ect the list of se- conditions ensure that all the modified features 0 0 lected features. Any modification in the list of fj have rank Rprec,D(fj) greater than the rank selected features will lead to changes in the final of the right selected statistically improbable fea- hash digest. ture (j

R (f 0 ) is greater than or equal to the rank prec,D j Table 3: Number of modification with maximal sim- of the left selected (j>S1) statistically improb- ilarity score through proposed algorithm able feature, i.e. Rprec,D(fS1 ). It can be equal File size Gap Random Modification S.No. to this value because even if two features have (In KBs) (In Bytes) Bytes Gap% File% equal rank, the left most feature always gets se- 1 1.5 354 89 25.14% 5.99% † lected. Ultimately, no other feature gets selected 2 22.9 3948 552 13.98% 2.41% over both the statistically improbable features. 3 50 8917 1065 11.94% 2.13% 4 81 14084 1273 9.03% 1.50% The above mentioned conditions are not 5 307 46296 3357 7% 1.09% enough if (S -1)-(S +L) t, where L is the fea- 6 841 215894 31371 14% 3.73% 2 1 ture length and t is the threshold. Even if each 7 1095 139038 6211 4% 0.56% 8 1554 378636 13787 3.60% 0.88% modification satisfies both the conditions, still On an avg. 7185.25 11.08% 2.28% new features may get selected. The reason this This file is not from T5-corpus database happens is that if the distance between two se- † lected features is more than L+t, then after mod- ification, the rank of some modified features may for G = 2. become local minimum among their t or more We ran the proposed algorithms for the same neighbors. Since t is the threshold for a feature data set of 50 text files which were used for to get selected, it may get selected as a statisti- our earlier random experiment. We found that cally improbable feature and hence may influence around 12% of the gap bytes can be modified the final sdhash digest. In the case mentioned with maximal similarity score of 100 using the above, it needs to be verified that no modifica- proposed algorithm. This is a huge improvement tion causes any change in the list of selected sta- over the random modification case when even tistically improbable features. To mitigate this 2% of the gap bytes cannot be modified without problem, after modification of the gaps bytes be- changing the final sdhash digest. Experimental results for a small sample of 8 files are presented ing considered, the popularity score(Rpop,D) of all the features of D is calculated. If any new fea- in table 3. ture, fj0 contains the popularity score Rpop,D(fj) > t then all the previous modifications are dis- carded. Similarly the gaps between each ad- 7. COUNTER MEASURES jacent pair of selected improbable features are In order to reduce the amount of undiscovered modified. modifications, we propose the following two mit- Algorithm 1 and 2 generate the multiple col- igations. liding files corresponding to a given data object with maximal similarity score. Each execution of 7.1 Minimization of popularity score algorithm 1 produces a di↵erent file with dissim- threshold ilar modification and di↵erent number of modifi- Decrease in the threshold of popularity score in 256 cations. Therefore, we can generate G di↵er- selection of statistically improbable features will ent files with maximal similarity corresponding increase the number of selected features. This, to a given file, where G denotes the total num- in turn, will result in the reduction of gap bytes ber of gap bytes in the data object. The attacker that could be modified without a↵ecting the final can easily confuse the investigator by generating sdhash digest. a huge number of files corresponding to a mali- cious or desired file. Since our current implemen- 7.2 Bit level feature formation tation is focused on text files, so we have chosen In the sdhash scheme, each feature di↵ers from the characters only from the set of 95 printable its neighboring features by one byte. Therefore, ASCII characters, starting from char 32 till char the attacker has 28 possible choices to modify 126. The maximum number of files that can be the feature without influencing its neighboring generated are G95, which is suciently large even feature. If each neighboring feature di↵ers by 44 Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

Algorithm 2 Byte Modification algorithm 1: bu↵er . Input Data object 2: indx . index of the selected feature 3: lst indx . index of last selected feature 4: RANK(bu↵er,i) . function that returns rank of ith feature of data object bu↵er 5: SCORE(bu↵er,i,j) . function that calculates popularity score of ith feature to jth feature of data object bu↵er and returns an array containing popularity scores 6: flag . AbooleanVariable 7: rank indx . Unsigned int variable for rank of selected feature 8: rank lst indx . Unsigned int variable for rank of last selected feature 9: rank k, rank i . Unsigned int; Temporary variable 10: procedure modify bytes(bu↵er, indx, lst indx, pop win size) 11: bu↵er copy bu↵er . Creating one copy of data object 12: rank indx RANK(bu↵er,indx) . Rank of selected feature of bu↵er 13: rank lst indx RANK(bu↵er,lst indx) . Rank of last selected feature of bu↵er 14: for i indx 1tolst indx +1do . Run through all intermediate bytes between two selected features byte by byte 15: ch bu↵er[i] . ch is a char variable 16: rank i RANK(bu↵er copy,i) . Rank of ith feature of unmodified bu↵er 17: for j 0to255do . Run through all ASCII value 0 to 255 until all conditions are satisfied. 18: temp rand()% 256 19: bu↵er[i] temp . ith byte will be replaced by randomly chosen ASCII char temp 20: flag true 21: if RANK (buffer, i) >RANKindx AND RANK(buffer, i) >RANKlst indx AND RANK(buffer, i) rank i then . Rank of Modified features should be greater than rank of selected neighboring features 22: for k i (w 1) to lst indx do . Run through features those consist ith Byte 23: rank k RANK(bu ↵er copy,k) . Rank of kth feature of unmodified bu↵er 24: if RANK (buffer, k) RANK indx OR RANK(buffer, k) 16 then 42: high true break; 43: end if 44: end for 45: if high == true then 46: for z (indx 1) to lst indx do 47: bu ↵er[z] bu↵er copy[z] . Revert all the changes 48: end for 49: end if 50: end procedure

45 Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

Algorithm 1 REFERENCES 1: bu↵er . Data object Baier, H., & Breitinger, F. (2011). Security 2: chunk size . Size of data object aspects of piecewise hashing in computer 3: chunk score . Array of score of each feature forensics. In IT security incident manage- of the data object ment and IT forensics (IMF), 2011 sixth 4: pop win size . Window size: default is 64 intl. conference on (pp. 21–36). 5: t . Threshold: default is 16 Breitinger, F., & Baier, H. (2012). Properties of 6: indx . index of the selected feature a similarity preserving hash function and 7: lst indx . index of last selected feature: their realization in sdhash. In 2012 Infor- initialize with 0. mation Security for South Africa, johan- 8: for i 0tochunk size pop win size do nesburg, 2012 (pp. 1–8). . Run through input byte by byte Breitinger, F., Baier, H., & Beckingham, J. 9: if chunk scores[i] >tthen . Selected (2012). Security and implementation anal- features ysis of the similarity digest sdhash. In First 10: modify bytes(bu↵er, indx, lst indx, international baltic conference on network pop win size) security & forensics (nesefo). 11: . Processing is in next algorithm Breitinger, F., Guttman, B., McCar- 12: lst indx indx rin, M., & Roussev, V. (2014). 13: Approximate matching: definition 14: end if and terminology. URL http://csrc. 15: end for nist. gov/publications/drafts/800- 168/sp800 168 draft. pdf . Chen, L., & Wang, G. (2008). An ecient piece- only 1 bit (in place of the original one byte), it wise hashing method for computer foren- will reduce the number of possible choices with sics. In Knowledge discovery and data min- the attacker from 256 to 2. Hence the probability ing, 2008. first intl. workshop on (pp. 635– of modifying each bit without a↵ecting the final 638). hash will also get reduced substantially. How- Kornblum, J. (2006). Identifying almost iden- ever, it will increase the number of features and tical files using context triggered piecewise hence the selected features, thereby causing some hashing. Digital investigation, 3 , 91–97. loss in eciency. Increase in the number of se- Roussev, V. (2009). Building a better similarity lected improbable features will not only increase trap with statistically improbable features. the computation time, it will also cause an in- In System sciences, 2009. 42nd hawaii intl. crease in the size of the final sdhash digest. conference on (pp. 1–10). Roussev, V. (2010a). Data fingerprinting with similarity digests. In Advances in digital 8. CONCLUSION forensics vi (pp. 207–226). Roussev, V. (2010b). Data fingerprinting with Currently sdhash is one of the most widely used similarity digests. In Advances in digital byte-wise similarity hashing scheme. It is possi- forensics vi (pp. 207–226). ble to do undiscovered modification to a file and Seo, K., Lim, K., Choi, J., Chang, K., & Lee, yet obtain exactly the same sdhash digest. We S. (2009). Detecting similar files based have proposed a novel approach to do maximum on hash and statistical analysis for digital number of byte modification with maximal sim- forensic investigation. In 2009 2nd interna- ilarity score of 100. We also provided a method tional conference on computer science and to do an anti-forensic attack in order to confuse its applications. or delay the investigation process. Tridgell, A. (2002). Spamsum readme. 46

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

AN EMPIRICAL STUDY ON CURRENT MODELS FOR REASONING ABOUT DIGITAL EVIDENCE

Stefan Nagy1, Imani Palmer1, Sathya Chandran Sundaramurthy2, Xinming Ou2, Roy Campbell1

1Department of Computer Science 2Department of Computing and Information Sciences University of Illinois at Urbana-Champaign Kansas State University Urbana-Champaign, IL 61801, USA 234 Nichols Hall Manhattan, KS 66506, USA

ABSTRACT The forensic process relies on the scientific method to scrutinize recovered evidence that either supports or negates an investigative hypothesis. Currently, analysis of digital evidence remains highly subjective to the forensic practitioner. Digital forensics is in need of a deterministic approach to obtain the most judicious conclusions from evidence. The objective of this paper is to examine current methods of digital evidence analysis. It describes the mechanisms for which these processes may be carried out, and discusses the key obstacles presented by each. Lastly, it concludes with suggestions for further improvement of the digital forensic process as a whole.

Keywords: digital evidence, forensic reasoning, evidence reliability, digital forensics models. Section 5 discusses the formalization of 1. INTRODUCTION event reconstruction. In section 6, we consider a As the use and complexity of digital devices model that combines probabilistic reasoning with continues to rise, the field of digital forensics event reconstruction. Lastly, section 7 holds our remains in its infancy. The investigative process is conclusions and suggestions for additions to the currently faced with a variety of problems, ranging field. from the limited number of skilled practitioners, to the difficulty of interpreting different forms of 2. BACKGROUND evidence. Investigators are challenged with The standard for the admissibility of evidence leveraging recovered evidence to find a stems from the Daubert trilogy, which establishes deterministic cause and effect. Without reliable the requirements of relevancy and reliability [25]. scientific analysis, judgments made by investigators NIST describes the general phases of the forensic can easily be biased, inaccurate and/or unprovable. process as: collection, examination, analysis and Conclusions drawn from digital evidence can vary reporting [23]. Formalization is necessary to ensure largely due to differences in their respective consistent repeatability for all investigative forensic systems, models, and terminology. This scenarios. In recent years, literature has addressed persistent incompatibility severely impacts the the need for formalization of the digital forensic reliability of investigative findings as well as the process, but primarily focused on evidence credibility of the forensic analysts. Evidence collection and preservation [2]. Ieong [24] reasoning is a fundamental part of investigative highlights the need for an explicit, unambiguous efficacy, however, the digital forensic process representation of knowledge and observations. currently lacks the scientific rigor necessary to While a pedagogical investigative framework function in this capacity. This paper presents an exists, there is yet to be a congruous system for overview of several recent methods that propose a digital evidence reasoning within the examination deterministic approach to reasoning about digital and analysis phases. Currently, digital forensic evidence. Section 2 examines past discussion on the analysts use a variety of methods to develop digital forensic process. Section 3 discusses the conclusions about recovered evidence, yet the application of differential analysis. In section 4, we results are often marred with conflicting bias or are review several popular probabilistic reasoning shrouded in a veil of uncertainty. There have been 47

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

numerous proposed reasoning frameworks, As the context of investigation is expanded, so does typically relying on applied mathematics, statistics the difficulty to identify noise [9]. & probabilities as well as, logic. However, before we can employ any particular methodology, there is a need to examine, review and explore all options in order to carry out the investigative process with the utmost precision.

3. IFFERENTIAL ANALYSIS Differential analysis is described as a method of data comparison used for reporting differences between two digital objects. Historically, it has been part of computer science for quite some time. Unix’s diff command was implemented in the early 1970’s, and is commonly used for fast comparison of binary and text files [3]. Continued advancements in hashing and metadata have since paved the way for more thorough differential analysis. It is flexible and adaptable to nearly all Figure 1. Knowledge management understanding types of digital objects; Windows Registry hives, hierarchy [9]. binary files, and disk images can all be compared A potential form of noise presents itself as benign for evidence of modification or tampering [4]. Non- modifications made to digital objects resulting from forensic applications include security procedures of normal operation of a system. For example, an operating systems, such as Windows’ use of file investigator may wish to examine the presence of a signatures to verify integrity of downloaded driver suspicious binary on a particular system apart of an packages [5]. enterprise network. The investigator selects a disk Modern investigative tools such as EnCase [6], image of an identical, unmodified system from the FTK [7] and SleuthKit [8] have incorporated same enterprise network to serve as the baseline for modules for streamlining differential analysis of comparison. Differential analysis may reveal that collected evidence, although each require the image of the system in question is incredibly significant training to become competent with the anomalous compared to the baseline. This could software features. Garfinkel et al. [3] formalize a potentially lead to the injudicious assumption that model for differential analysis in the context of “the most anomalous system is the most malicious” digital evidence; two collected objects – a baseline [4], when in reality, it might have only been the object and a final object – are compared for result of benign modifications arising from evidence of modification both before and after differences in installed software. While files at the events of interest. Ideally, the process will highlight kernel level are generally protected from tampering, the most significant changes made from baseline ! files in user directories are much more vulnerable to final !, assuming those transformations resulted to modification. from actions taken by the suspect in question. In Although noise is often assumed to be this context, differential analysis is often used to unintentional, it is very possible that it could be detect malware, file and registry modifications [3]. inserted on purpose. When dealing with instances While the strategy of differential analysis is of steganography, differential analysis compares fundamentally the same regardless of which system objects that are known to be hiding information level is being examined, each level possesses a with those that do not. Fiore [10] describes a certain degree of noise. In discussing differential framework by which “selective redundancy analysis, will define “noise” as information removal” can be used to prepare HTML files for resulting from comparison between baseline and carrying out linguistic steganography. Since the final that is wholly irrelevant to the investigation. information is being hidden through the otherwise normal process of HTML file optimization, 48

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

differential analysis will only appear to reveal defendants intentionally downloaded various forms benign occurrences, such as differences in HTML of child pornography versus accidentally tag styling. downloading it among other benign content. In each case, the amount of child pornography seized Future research is needed to expand metrics for was very small compared to the total amount of identifying and accounting for different forms of miscellaneous benign content, and in both instances noise in digital evidence. Mead [1] explains the were found to have been downloaded over a long National Software Reference Library’s effort to period of time. In each case, it was determined that create a library of hashes of commercial software the probability of unintentionally downloading a packages. Through combining hashing with small amount of child pornography is significantly differential analysis, investigators can drill-down below 10% [15]. the scope of inquiry by cross-referencing evidence with a database of known hash values. Eliminating While this method can indeed provide a evidence matching existing hashes can reduce the quantitative assessment of the likelihood of guilt, it amount of noise arising from benign objects that is is limited to investigations where only few commonly problematic when dealing with larger characteristics of the evidential traces are known. In systems, and better isolates the few remaining both examples above, the defendants pleaded questionable objects. Further improvement of such guilty, and thus metadata was disregarded [15]. It databases, robust hashing algorithms, and perhaps a was assumed that the incriminating files had been formal technique would be of benefit to downloaded over long periods of time, but had investigators. metadata been collected, the original hypothesis may have changed entirely. An example would be 4. PROBABILISTIC MODELS the offending content timestamped to a one-hour browsing period, thus invalidating the original Conventional forensic analysis has long included models of statistical inference to assess the degree hypothesis of accidental download. The growing of certainty for which hypotheses and importance of preserving metadata creates the need corresponding evidence can be causally linked [11]. for probabilistic models that can integrate it into reasoning. This casual linkage is expressed by the following: if a cause ! is responsible for effect !, and ! has 4.2 BAYESIAN NETWORKS been observed, then ! must have occurred [12]. For example, researchers know that the probability of In the last decade, Bayesian inference has gained two identical DNA belonging to two popularity in the scientific community. Unlike different individuals is close to one in one billion Frequentist inference that reasons with frequencies [13]. If holding an item leaves fingerprints on it, of past events, Bayesian inference reasons with and fingerprints found on the weapon at a murder “subjective beliefs estimations”, and allows room scene match the suspect’s own, then investigators for new evidence to revise these beliefs [12]. Kwan can conclude there is over 99% certainty that the et al. [14] introduced the idea of reasoning about suspect held that weapon. Because criminal digital evidence in the form of Bayesian networks: investigations are ultimately abductive, directed acyclic graphs whose leaf nodes represent probabilistic techniques have become widely observed evidence and interior nodes represent accepted in the forensic reasoning process [14] unobserved causes. The root node represents the [12]. central hypothesis to which all unobserved causes serve as sub-hypotheses. The model uses Bayes’ 4.1 CLASSICAL PROBABILITY theorem to determine the conditional probability of evidence E resulting from hypothesis !: Several recent criminal investigations have seen classical probability used to reason about !(!|!) = !(!)!(!|!);! contradicting scenarios regarding the presence of !(!) is the prior probability of evidence !; !(!)!is incriminating digital evidence. Examining two the prior probability of ! when no evidence exists; cases originating in Hong Kong, Overill et al. [15] !(!|!) is the posterior probability such that ! has reasoned the likelihood that the respective occurred when ! is detected.

49

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

registry key being modified? As another example, how does one compute the conditional probability of a particular registry key being modified given that the malware did not gain privileged access? Bayesian analysis works very well when the reasoning structure is well known and the probabilities are easy to obtain. In the real world, it is very hard to obtain those numbers and there is a high degree of uncertainty in the obtained evidence. Figure 2. Bayesian network connections: (a) Serial; (b) Diverging; (c) Converging [14]. Dempster-Shafer theory (DST) is a reasoning technique that provides a way to encode uncertainty The construction of a Bayesian model begins with more naturally [17]. Contrasting with Bayesian analysis, DST does not require one to provide a the defining of a root hypothesis. An example would be “The seized computer was used to send prior probability for the hypothesis of interest. DST this malicious file.” The possible states of the also does not require the use of conditional hypothesis – Yes, No, and Uncertain – are assigned probabilities thus addressing the other major limitation of Bayesian analysis techniques. The equal probabilities. As more evidence is discovered, sub-hypotheses and their corresponding presence of certain evidence during forensic probabilities are added beneath the root hypothesis. analysis does not necessarily indicate a malicious activity. For example, a change in registry key The process is repeated until refinement produces a most likely hypothesis. could be either due to a malware or by a benign application. There is always a degree of uncertainty in the obtained evidence at any given stage of the However, Bayesian networks are dependent on the forensic analysis process. DST enables one to assignment of prior probabilities to posterior account for this uncertainty by assigning a number evidence [14]. In scenarios where uncertainty is to a special state of the evidence “don’t know”. For present, fuzzy logic methodology is incorporated to example, a sequence of registry key modifications quantify likelihood as a value between 1 (absolute might indicate that a malware of specific family truth) and 0 (false) [16]. The case study presented might have been downloaded. Based on empirical in [14] based its prior probabilities on results from evidence, let us assume one believes that with 10% questionnaires sent to several law enforcement confidence. A probabilistic interpretation would agencies. Since human-computer interactions are then mean that one would believe that there is a non-deterministic, there is no systematic way to 90% chance that the malware was not reason posterior evidential probabilities with downloaded—which is not intuitive. When using complete certainty; conditional probabilities DST one would assign 10% to the hypothesis that inferred from demonstrably normal behavior of one the malware was downloaded and 90% to the network might differ with those from another. hypothesis that I am not sure. Discrepancies in prior evidential probabilities can One can explain the difference between DST and significantly impact the overall outcome of the probability theory using a coin toss example. When Bayesian network, and thus, there is difficulty in tossing a coin with unknown bias probability theory soundly applying this method to digital forensic will assign a probability value of 0.5 to both the investigations. outcomes Head and Tail. This representation does not capture the inherent uncertainty in the outcome. 4.3 DEMPSTER-SHAFER THEORY DST, on the other hand, will assign 0 to the One of the limiting factors of using Bayesian outcomes {Head} and {Tail} while assigning a analysis in security is that it requires the assignment value of 1 to the set {Head, Tail}. This exactly of prior and conditional probabilities for the nodes captures the reasoning process of a human in that in the reasoning model. Often times, the numbers when you toss a coin (with unknown bias) the only are very hard to obtain. For example, how does one thing you are sure about the outcome is that it could compute the prior probability for a particular be either Head or Tail. In general, when calculating 50

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

the likelihood of a hypothesis DST allows formalization of this model into digital forensics. admittance of ignorance on the confidence of By back-tracing event states, investigators are evidence. DST provides rules for combining presented with a reconstruction of events and can multiple evidences to calculate the overall belief in thus select the timeline most relevant to the the hypothesis. The challenge of using DST is available evidence. analogous to Bayesian analysis, though much For finite state machine models to perform better, in that no prior values have to be assigned to accurately comprehensive event reconstruction, evidences. investigators must be able to account for all possible system states. Complex events, such as 5. EVENT RECONSTRUCTION MODELS those resulting from advanced persistent threats, are The ability to reconstruct events is of great incredibly difficult to analyze. In addition, importance to the digital forensic process. Al- changing factors such as software updates may Kuwari and Wolthusen [18] proposed a general affect the resulting machine states. Carrier [19] framework to reconstruct missing parts of a target proposes the development of a central repository trace. This can be used for various areas of an for hosting information about machine events. investigation. This algorithm graphs a multi-modal Likening it to existing forensic databases on gun scenario, determining all of possible routes cartridges, an exhaustive, continuously updated connecting the gaps of a specific trace. Additional library of system events would be of invaluable information may be included in the graph and aide to investigators performing event marked appropriately. The broadcast algorithm reconstruction. However, an investigator may wish used to determine all possible routes may require to explore other characteristics of events, such as exponential time, suggesting that the search area the odds of a particular investigative hypothesis, or should be bounded [18]. the real time distributions of reconstructed events. This approach relies on a specific target and would To compute answers to such questions, the best be used to determine if an attack on a system formalization of event reconstruction must be occurred. However, this approach poses problems extended with additional attributes that describe statistical and real-time properties of the system for the algorithm if a specific target is not identified. Event reconstruction is not unique to and incident [20]. digital forensics, and the ability to apply existing 6. COMBINING PROBABILITY WITH techniques could yield effective results. EVENT RECONSTRUCTION 5.1 FINITE STATE MACHINES Attack graphs are typically used for intrusion Modern computer systems are often modeled as a analysis, where each path represents a unique series of finite states, graphically presented as a method of intrusion by a malicious actor. It is possible to use attack graph techniques in the event Finite State Machine (FSM). It is expressed as the reconstruction process. Attack graphs are directed quintuple !=(!, !, !, !0, !), where: graphs where nodes represent pre and post • Q is the finite, non-empty set of machine states conditions of machine events, and directed edges • Σ is the finite, non-empty alphabet of event are conditions met between these nodes; the root symbols node represents the singular event of interest to • !: ! × !→! is the transition function mapping which all other nodes serve as precursors [21]. events between machine states in Q for each While attack graphs are helpful in identifying event symbol in Σ mechanisms of intrusion, their lacking of any • s0! is the starting state of the machine probabilistic inference hinders their usefulness in • !⊆! is the set of final machine states quantitative evidential reasoning. Investigators • Nodes represent possible system states presented with attack graphs must select the most • Arrows represent transitions between states probable attack scenarios, but there are currently no [19] clear metrics for assessing likelihood. To address this, Xie et al. [22] combined attack graphs with Gladyshev and Patel [20] introduced a

51

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

Bayesian networks. By transferring attack graphs in conducting such surveys, it is reckless to into acyclic Bayesian networks, this method utilizes underestimate the potential for entropy and reason conditional probability tables for nodes with that small samples of observed probabilities hold parents, and prior probabilities for nodes without true for all investigations. It can be concluded that parents. each of these techniques is only applicable to a small niche of forensic scenarios. Like in regular Bayesian networks, this approach relies on the investigator supplying accurate The increasing rate of software development places conditional and prior probabilities for each event. a burden on forensic examiners to keep up with the Estimating prior probabilities has traditionally latest software packages, both commercial and free. relied on feedback from the community in the form Each of the models discussed in this paper lacks a of surveys. This becomes incredibly difficult as comprehensive database of information to conduct scale increases; a large attack graph would require analysis with the highest accuracy. We highlight that the investigator survey and obtain probability the need for a community-driven, updated information for every unique event, making catalogue of file hashes, machine states, and analysis costly. probability metrics for use in forensic analysis. The changing nature of technology and software 7. FUTURE DIRECTION AND necessitates that researchers and law enforcement CONCLUSIONS collaborate to ensure the digital forensic process is Evidence reasoning models are an important part of as reliable as possible. the forensic process. Unlike traditional forensic 8. ACKNOWLEDGEMENT sciences, digital forensics deals almost exclusively with objects of nondeterministic nature; there is This research is partially supported by the National great difficulty in analyzing and scrutinizing digital Science Foundation under Grants No. 1314925 and evidence. Fundamental flaws hinder current 1241773. Any opinions, findings and conclusions evidence analysis models in their ability to assess or recommendations expressed in this material are accurately the likelihood of crime occurrence. those of the authors and do not necessarily reflect Furthermore, conclusions based on probabilities the views of the National Science Foundation. complicate explanations in the courtroom, as This material is based on research sponsored by the demonstrated in the legal arguments surrounding Air Force Research Laboratory and the Air Force Shonubi I-V [26]. These flaws must be identified Office of Scientific Research, under agreement and understood to avoid the possibility of number FA8750-11-2-0084. The U.S. Government injudicious assumptions resulting from the forensic is authorized to reproduce and distribute reprints process. for Governmental purposes notwithstanding any Differential analysis of digital evidence becomes copyright notation thereon. difficult when the scope of investigation is widened; unintentional noise in the form of benign REFERENCES modifications may lead to dubious conclusions [1] Mead, S. Unique File Identification in the about system integrity. Furthermore, recent National Software Reference Library. Digit. obfuscation techniques have successfully averted Investig. 3, 3 (September 2006), 138-150. detection by traditional methods. Event [2] Stallard, T., Levitt, K. 2003. Automated reconstruction models are limited in their ability to Analysis for Digital Forensic Science: provide investigators with clear attack scenarios, Semantic Integrity Checking. in Proceedings because they rely on the exhaustive identification of of the 19th Annual Computer Security possible machine states; there is yet to be a resource Applications Conference (ACSAC '03). IEEE providing such information. Probabilistic reasoning Computer Society, Washington, DC, USA, models rely on prior probabilities known to the 160-. investigator, which have so far mainly been [3] Garfinkel, S., Nelson, A., Young, J. A General determined from surveying others in the field. Strategy for Differential Forensic Analysis. in Besides the obvious expenditure of time and effort

52

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

Digital Forensics Research Workshop 2012, [18] Al-Kuwari, S., Wolthusen, S.D. Fuzzy Trace August 2012, pages S50--S59. Validation: Toward an Offline Forensic [4] Gielen, M.W. 2014. Prioritizing Computer Tracking Framework. in Systematic Forensics Using Triage Techniques. Approaches to Digital Forensic Engineering University of Twente. (SADFE). 2011 IEEE Sixth International [5] 2015. Microsoft Windows. Microsoft. Workshop on, pages 1 – 4, IEEE, 2011. [6] 2015. EnCase. Guidance Software. [19] Carrier, D. 2006. A Hypothesis-based [7] 2015. Forensic Toolkit (FTK). Access Data. Approach to Digital Forensic Investigations. [8] 2012. The Sleuth Kit. Carrier, D. Purdue University. [9] Nunamaker, N.J.J., Romano J., Briggs, R. A [20] Gladyshev, P., Patel, A. Finite State Machine Framework for Collaboration and Knowledge Approach to Digital Event Reconstruction. Management. in Proceedings of the 34th Digit. Investig., 1(2):130–149, June 2004. Annual Hawaii International Conference on [21] Liu, C., Singhal, A., Wijesekera, D. Using System Sciences, January 2001. Attack Graphs in Forensic Examinations. in [10] Fiore, U. Selective Redundancy Removal: A Availability, Reliability and Security (ARES), Framework for Data Hiding. in Proceedings of 2012 Seventh International Conference on , Etude de la notion de pile application à vol., no., pp.596,603, 20-24 Aug. 2012 l'analyse syntaxique.. 2010, 30-40. [22] Xie, P., Li, J.H., Ou, X., Liu, P., Levy, R. [11] Overill, R.E., Silomon, J.A.M. Digital Meta- Using Bayesian Networks for Cyber Security Forensics: Quantifying the Investigation. in Analysis. in Dependable Systems and Proceedings of the Fourth International Networks (DSN), 2010 IEEE/IFIP Conference on Cybercrime Forensics International Conference on , vol., no., Education and Training, 2010 pp.211,220, June 28 2010-July 1 2010 [12] Huygen, P.E.M. Use of Bayesian Belief [23] NIST. Guide to Integrating Forensic Networks in Legal Reasoning. in 17th BILETA Techniques into Incident Response. Annual Conference, Amsterdam 2002 http://csrc.nist.gov/publications/nistpubs/800- [13] Overill, R.E. Quantifying Likelihood in Digital 86/SP800-86.pdf Forensics Investigations. Journal of Harbin [24] Ricci S. C. Ieong. 2006. FORZA - Digital Institute of Technology, Vol.21, No.6, 2014 forensics investigation framework that [14] Kwan, M., Kam-Pui Chow, Law, F., Lai, P. incorporate legal issues. Digit. Investig. 3 Reasoning About Evidence Using Bayesian (September 2006), 29-36. Networks. in Advances in Digital Forensics [25] Vickers, A. Leah. "Daubert, Critique and IV, Fourth Annual IFIP WG 11.9 Conference Interpretation: What Empirical Studies Tell Us on Digital Forensics, Kyoto University, About the Application of Daubert." USFL Kyoto, Japan, January 28-30, 2008 Rev. 40 (2005): 109. [15] Overill, R.E., Silomon, J.A.M., Kam-Pui [26] Izenman, J.A. Introduction to Two Views on Chow, Tse, H. Quantification of Digital the Shonubi Case. Temple University. Forensic Hypotheses Using Probability Theory. in Systematic Approaches to Digital Forensic Engineering (SADFE), 2013 Eighth International Workshop on , vol., no., pp.1,5, 21-22 Nov. 2013 [16] Stoffel, K., Cotofrei, P., Han, D. Fuzzy Methods for Forensic Data Analysis. in Soft Computing and Pattern Recognition (SoCPaR), 2010 International Conference of , vol., no., pp.23,28, 7-10 Dec. 2010 [17] Shafer, G. Probability Judgment in Artificial Intelligence and Expert Systems. Statistical Science, Vol.2, No.1 (Feb., 1987), pp. 3-16

53

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

DATA EXTRACTION ON MTK-BASED ANDROID MOBILE PHONE FORENSICS Joe Kong Mphil Student in Computer Science The University of Hong Kong [email protected]

ABSTRACT In conducting criminal investigations it is quite common that forensic examiners need to recover evidentiary data from smartphones used by offenders. However, examiners encountered difficulties in acquiring complete memory dump from MTK Android phones, a popular brand of smartphones, due to a lack of technical knowledge on the phone architecture and that system manuals are not always available. This research will perform tests to capture data from MTK Android phone by applying selected forensic tools and compare their effectiveness by analyzing the extracted results. It is anticipated that a generic extraction tool, once identified, can be used on different brands of smartphones equipped with the same CPU chipset.

Keywords: Mobile forensics, MTK Android phones, Android forensics, physical extraction, flash memory, MT6582.

Full version of this paper is published in the Journal of Digital Forensics, Security and Law (http://www.jdfsl.org)

54

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

OPEN FORENSIC DEVICES Lee Tobin, Pavel Gladyshev

Digital Forensics Investigation Research Laboratory, University College Dublin, Ireland

ABSTRACT Cybercrime has been a growing concern for the past two decades. What used to be the responsibility of specialist national police has become routine work for regional and district police. Unfortunately, funding for law enforcement agencies is not growing as fast as the amount of digital evidence. In this paper, we present a forensic platform that is tailored for cost effectiveness, extensibility, and ease of use. The software for this platform is open source and can be deployed on practically all commercially available hardware devices such as standard desktop motherboards or embedded systems such as Raspberry Pi and Gizmosphere's Gizmo board. A novel user interface was designed and implemented, based on Morphological Analysis.

Keywords: Forensic device, open source, write-blocker, forensic imaging, morphological analysis, user interface design.

Full version of this paper is published in the Journal of Digital Forensics, Security and Law (http://www.jdfsl.org)

55

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

A STUDY ON ADJACENCY MEASURES FOR REASSEMBLING TEXT FILES Alperen Şahin, Hüsrev T. Sencar

TOBB University of Economics and Technology, Ankara, Turkey [email protected], [email protected]

ABSTRACT Recovery of fragmented files relies on the ability to accurately evaluate the adjacency of two fragments. Text-based files typically organize data in a very weakly structured manner; therefore, fragment re- assembly remains a challenging task. In this work, we evaluate existing adjacency measures that can be used for assembling fragmented test files. Our results show that individual performances of existing measures are far from adequately addressing this need. We then introduce a new approach that attempts to exploit the limited structural characteristics of text files which utilize constructs for description, presentation, and processing of file data. Our approach builds a statistical model of the ordering of file-type specific constructs and incorporates this information into adjacency measures for more reliable fragment reassembly. Results show that reassembly accuracy increases significantly with this approach.

Keywords: File carving, text files, fragmentation, file reassembly.

Full version of this paper is published in the Journal of Digital Forensics, Security and Law (http://www.jdfsl.org)

56

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

AN INTEGRATED AUDIO FORENSIC FRAMEWORK FOR INSTANT MESSAGE INVESTIGATION Yanbin Tang, Zheng Tan, K.P. Chow, S.M. Yiu

Department of Computer Science, The University of Hong Kong, China {ybtang, ztan, chow, smyiu}@cs.hku.hk

ABSTRACT Voice chat of instant message (IM) apps are getting popular. Huge amount of manpower is required to listen, analyze, and identify relevant chat files of IM apps in a forensic investigation. This paper proposes a semi-automatic integrated framework to deal with audio forensic investigation for IM apps by applying modern technologies. The main objective is to reduce the amount of manpower in the investigation. This is the first work that applies speech to text technology in voice chat of IM apps forensic. Both text and audio features are extracted to reconstruct the dialog conversation. Experiments with real case data show that the framework is promising. The framework is able to translate dialog into readable text and improve the efficiency during investigation with reconstructed conversation.

Keywords: Audio, voice chat, instant message, smartphone, digital forensics.

Full version of this paper is published in the Journal of Digital Forensics, Security and Law (http://www.jdfsl.org)

57

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

PROJECT MAELSTROM: FORENSIC ANALYSIS OF THE BITTORRENT-POWERED BROWSER Jason Farina, M-Tahar Kechadi, Mark Scanlon

School of Computer Science, University College Dublin, Ireland [email protected], [email protected], [email protected]

ABSTRACT In April 2015, BitTorrent Inc. released their distributed peer-to-peer powered browser Project Maelstrom into public beta. The browser facilitates a new alternative website distribution paradigm to the traditional HTTP based, client-server model. This decentralised web is powered by each of the users accessing each Maelstrom hosted website. Each user shares their copy of the website with other new visitors to the website. As a result, a Maelstrom hosted website cannot be taken offline by law enforcement or any other parties. Due to this open distribution model, a number of interesting censorship, security and privacy considerations are raised. This paper explores the application, its protocol, sharing Maelstrom content and its new visitor powered “web-hosting” paradigm.

Keywords: Project Maelstrom, BitTorrent, Decentralised Web, Alternative Web, Browser Forensics.

Full version of this paper is published in the Journal of Digital Forensics, Security and Law (http://www.jdfsl.org)

58

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

FACTORS INFLUENCING DIGITAL FORENSIC INVESTIGATIONS: EMPIRICAL EVALUATION OF 12 YEARS OF DUBAI POLICE CASES Ibtesam Al Awadhi, Janet C Read Andrew Marrington Virginia N. L. Franqueira University of Central Lancashire Zayed University University of Derby School of Computing, Engineering College of Technological College of Engineering and and Physical Sciences. Preston, UK Innovation. Dubai, UAE Technology. Derby, UK {IAlawadhi, JCRead}@uclan.ac.uk [email protected] [email protected]

ABSTRACT In Digital Forensics, person-hours spent on investigation is a key factor which needs to be kept to a minimum whilst also paying close attention to the authenticity of the evidence. The literature describes challenges behind increasing person-hours and identifies several factors which contribute to this phenomenon. This paper reviews these factors and demonstrates that they do not wholly account for increases in investigation time. Using real case records from the Dubai Police, an extensive study explains the contribution of other factors to the increase in person-hours. We conclude this work by emphasizing on several factors affecting the person-hours in contrast to what most of the literature in this area proposes.

Keywords: Cyber forensics, Digital forensics, Empirical data, Forensic investigation, Dubai Police.

Full version of this paper is published in the Journal of Digital Forensics, Security and Law (http://www.jdfsl.org)

59

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

PLC FORENSICS BASED ON CONTROL PROGRAM LOGIC CHANGE DETECTION Ken Yau, Kam-Pui Chow

University of Hong Kong, Hong Kong, China [email protected], [email protected]

ABSTRACT Supervisory Control and Data Acquisition (SCADA) system is an industrial control automated system. It is built with multiple Programmable Logic Controllers (PLCs). PLC is a special form of microprocessor-based controller with proprietary operating system. Due to the unique architecture of PLC, traditional digital forensic tools are difficult to be applied. In this paper, we propose a program called Control Program Logic Change Detector (CPLCD), it works with a set of Detection Rules (DRs) to detect and record undesired incidents on interfering normal operations of PLC. In order to prove the feasibility of our solution, we set up two experiments for detecting two common PLC attacks. Moreover, we illustrate how CPLCD and network analyzer Wireshark could work together for performing digital forensic investigation on PLC.

Keywords: PLC Forensics, SCADA Security, Ladder Logic Programming

Full version of this paper is published in the Journal of Digital Forensics, Security and Law (http://www.jdfsl.org)

60

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

FORENSIC ACQUISITION OF IMVU: A CASE STUDY Robert van Voorst National Police of the Netherlands Rotterdam, Netherlands [email protected]

M-Tahar Kechadi, Nhien-An Le-Khac University College Dublin Dublin 4, Ireland {tahar.kechadi,an.lekhac}@ucd.ie

ABSTRACT There are many applications available for personal computers and mobile devices that facilitate users in meeting potential partners. There is, however, a risk associated with the level of anonymity on using instant message applications, because there exists the potential for predators to attract and lure vulnerable users. Today Instant Messaging within a Virtual Universe (IMVU) combines custom avatars, chat or instant message (IM), community, content creation, commerce, and anonymity. IMVU is also being exploited by criminals to commit a wide variety of offenses. However, there are very few researches on digital forensic acquisition of IMVU applications. In this paper, we discuss first of all on challenges of IMVU forensics. We present a forensic acquisition of an IMVU 3D application as a case study. We also describe and analyse our experiments with this application.

Keywords: Instant Messaging, forensic acquisition, Virtual Universe 3D, forensic process, forensic case study

Full version of this paper is published in the Journal of Digital Forensics, Security and Law (http://www.jdfsl.org)

61

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

CYBER BLACK BOX/EVENT DATA RECORDER: LEGAL AND ETHICAL PERSPECTIVES AND CHALLENGES WITH DIGITAL FORENSICS

Michael Losavio University of Louisville Department of Criminal Justice Louisville, Kentucky 40292 U.S.A. [email protected]

Pavel Pastukov Perm State National Research University Department of Criminal Procedure and Criminalistics Perm, Russian Federation [email protected]

Svetlana Polyakova Perm State National Research University Department of English Language and Intercultural Communication Perm, Russian Federation [email protected]

ABSTRACT With ubiquitous computing and the growth of the Internet of Things, there is vast expansion in the deployment and use of event data recording systems in a variety of environments. From the ships’ logs of antiquity through the evolution of personal devices for recording personal and environmental activities, these devices offer rich forensic and evidentiary opportunities that smash against rights of privacy and personality. The technical configurations of these devices provide for greater scope of sensing, interconnection options for local, near, and cloud storage of data, and the possibility of powerful analytics. This creates the unique situation of near-total data profiles on the lives of others. We examine legal and ethical issues of such in the American and transnational environment.

Keywords: event, data, recorder, legal, ethical, privacy

Full version of this paper is published in the Journal of Digital Forensics, Security and Law (http://www.jdfsl.org)

62

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

TRACKING AND TAXONOMY OF CYBERLOCKER LINK SHARERS BASED ON BEHAVIOR ANALYSIS Xiao-Xi Fan and Kam-Pui Chow

The University of Hong Kong Department of Computer Science Hong Kong, 999077 {xxfan, chow}@cs.hku.hk

ABSTRACT The growing popularity of cyberlocker service has led to significant impact on the Internet that it is considered as one of the biggest contributors to the global Internet traffic estimated to be illegally traded content. Due to the anonymity property of cyberlocker, it is difficult for investigators to track user identity directly on cyberlocker site. In order to find the potential relationships between cyberlocker users, we propose a framework to collect cyberlocker related data from public forums where cyberlocker users usually distribute cyberlocker links for others to download and identity information can be gathered easily. Different kinds of sharing behaviors of forum user are extracted to build the profile, which is then analyzed with statistical techniques. The experiment results demonstrate that the framework can effectively detect profiles with similar behaviors for identity tracking and produce a taxonomy of forum users to provide insights for investigating cyberlocker-based piracy.

Keywords: identity tracking, taxonomy, user profiling, behavior analysis, cyberlocker, piracy

Full version of this paper is published in the Journal of Digital Forensics, Security and Law (http://www.jdfsl.org)

63

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

EXPLORING THE USE OF PLC DEBUGGING TOOLS FOR DIGITAL FORENSIC INVESTIGATIONS ON SCADA SYSTEMS Tina Wu, Jason R.C. Nurse

Cyber Security Centre Department of Computer Science University of Oxford Oxford, United Kingdom {tina.wu,jason.nurse }@cs.ox.ac.uk

ABSTRACT The Stuxnet malware attack has provided strong evidence for the development of a forensic capability to aid in thorough post-incident investigations. Current live forensic tools are typically used to acquire and examine memory from computers running either Windows or Unix. This makes them incompatible with embedded devices found on SCADA systems that have their own bespoke operating system. Currently, only a limited number of forensics tools have been developed for SCADA systems, with no development of tools to acquire the program code from PLCs. In this paper, we explore this problem with two main hypotheses in mind. Our first hypothesis was that the program code is an important forensic artefact that can be used to determine an attacker's intentions. Our second hypothesis was that PLC debugging tools can be used for forensics to facilitate the acquisition and analysis of the program code from PLCs. With direct access to the memory addresses of the PLC, PLC debugging tools have promising functionalities as a forensic tool, such as the “Snapshot” function that allows users to directly take values from the memory addresses of the PLC, without vendor specific software. As a case example we will focus on PLC Logger as a forensic tool to acquire and analyse the program code on a PLC. Using these two hypotheses we developed two experiments. The results from Experiment 1 provided evidence to indicate that it is possible to acquire the program code using PLC Logger and to identify the attacker's intention, therefore our hypothesis was accepted. In Experiment 2, we used an existing Computer Forensics Tool Testing (CFTT) framework by NIST to test PLC Logger's suitability as a forensic tool to analyse and acquire the program code. Based on the experiment's results, this hypothesis was rejected as PLC Logger had failed half of the tests. This suggests that PLC Logger in its current state has limited suitability as a forensic tool, unless the shortcomings are addressed.

Keywords: PLC Debugging, Program Code, SCADA, Digital Forensics, NIST, PLCs, Attackers.

Full version of this paper is published in the Journal of Digital Forensics, Security and Law (http://www.jdfsl.org)

64

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering

THE USE OF ONTOLOGIES IN FORENSIC ANALYSIS OF SMARTPHONE CONTENT Mohammed Alzaabi, Thomas Martin, Kamal Taha, Andy Jones

Khalifa University of Science, Technology and Research Sharjah, UAE {mohammed.alzaabi,thomas.martin,kamal.taha,andy.jones}@kustar.ac.ae

ABSTRACT Digital forensics investigators face a constant challenge in keeping track with evolving technologies such as smartphones. Analyzing the contents of these devices to infer useful information is becoming more time consuming as the volume and complexity of data are increasing. Typically, such analysis is undertaken by a human, which makes it dependent on the experience of the investigator. To overcome such impediments, an automated technique can be utilized in order to aid the investigator to quickly and efficiently analyze the data. In this paper, we propose F-DOS; a set of ontologies that models the smartphone content for the purpose of forensic analysis. F-DOS can form a knowledge management component in a forensic analysis system. Its importance lies in its ability to encode the semantics of the smartphone content using concepts and their relationships that are modeled by F-DOS.

Keywords: Digital Forensics, Forensic Analysis, Ontology.

Full version of this paper is published in the Journal of Digital Forensics, Security and Law (http://www.jdfsl.org)

65

Proceedings of 10th Intl. Conference on Systematic Approaches to Digital Forensic Engineering Carsten Rudolph, Nicolai Kuntze, Barbara Endicott-Popovsky, Antonio Maña

Proceedings of the 10th International Conference on Systematic Approaches to Digital Forensic Engineering

The SADFE series feature the different editions of the International Conference on Systematic Approaches to Digital Forensics Engineering. Now in its tenth edition, SADFE has established itself as the premier conference for researchers and practitioners working in Systematic Approaches to Digital Forensics Engineering.

SADFE 2015, the tenth international conference on Systematic Approaches to Digital Forensic Engineering was held in Malaga, Spain, September 30 – October 2, 2015.

Digital forensics engineering and the curation of digital collections in cultural institutions face pressing and overlapping challenges related to provenance, chain of custody, authenticity, integrity, and identity. The generation, analysis and sustainability of digital evidence require innovative methods, systems and practices, grounded in solid research and understanding of user needs. The term digital forensic readiness describes systems that are build to satisfy the needs for secure digital evidence.

SADFE 2015 investigates requirements for digital forensic readiness and methods, technologies, and building blocks for digital forensic engineering. Digital forensic at SADFE focuses on variety of goals, including criminal and corporate investigations, data records produced by calibrated devices, as well as documentation of individual and organizational activities. Another focus is on challenges brought in by globalization and cross-legislation digital applications. We believe digital forensic engineering is vital to security, the administration of justice and the evolution of culture.

ISBN: 978-84-608-2068-0

1