Digital Investigation 29 (2019) 180e197

Contents lists available at ScienceDirect

Digital Investigation

journal homepage: www.elsevier.com/locate/diin

Ten years of critical review on forensics research

* Rupali Chopade , V.K. Pachghare

Department of Computer Engineering & IT, College of Engineering, Pune, India article info abstract

Article history: The database is at the heart of any digital application. With the increased use of high-tech applications, Received 22 January 2019 the database is used to store important and sensitive information. Sensitive information storage leads to Received in revised form crimes related to computer activities. is an investigation process to discover any un- 30 March 2019 trusted or malicious movement, which can be presented as testimony in a court of law. Database fo- Accepted 7 April 2019 rensics is a subfield of digital forensics which focuses on detailed analysis of a database including its Available online 11 April 2019 contents, log files, metadata, and data files depending on the type of database used. Database forensics research is in its mid age and has not got awareness as compare to digital forensics research. The reason Keywords: Database forensics behind this is the internal complications of the database as well as the different dimensions to be Digital forensics considered for analysis. This review paper is focusing on the last ten years of research related to forensic Artifact analysis of relational and NoSQL along with the study of artifacts to be considered for database Recovery forensics. This review of the current state of database forensics research will serve as a resource to move SQL forward as far as research and investigation are concerned. NoSQL © 2019 Elsevier Ltd. All rights reserved.

Introduction Act (SOX) and Health Insurance Portability and Accountability Act (HIPAA), it is essential to find the evidence and pass on the same to Databases play an important role in any organization when customers, what was compromised. For database forensics storage and computing component come into (Son et al., research, understanding the internal architecture of the database is 2011). Now a days all activities are performed online and through very important. Most of the research work is already done on which lots of sensitive and personal information get stored in the relational databases forensics and now NoSQL database forensics is database. Though is not a new technique, still also considered. As per research papers, many authors shown their attacker tries to tamper the database to take away such kind of anxiety (W. K. Hauger and Olivier, 2015a) about less research in the information or try to it. Database forensics is a field to field of database forensics, that is due to the multidimensional answer W5H questions, i.e. what, when, why, where, how database nature of databases, that is perceptible from Fig. 1. Research on tampering has happened and by whom. Database forensics is a database forensics from Elsevier, Springer, IEEE, and ACM from the branch of digital forensics (Guimaraes et al., 2010) which involves year 2000 onwards is shown in Fig. 1 including journals and con- forensics phases such as identification, collection, preservation, ferences. All IEEE publications are from conference publications. analysis, and presentation. Database forensics is an upcoming Research papers containing either database forensics in the title or domain to find the evidence against any tampering of the database in keyword list are taken into consideration (through Google Search by authorized or unauthorized users (Pavlou and Snodgrass, 2008). Engine). All the way through this review paper, we have tried to is an analysis of accounting activity to solve the present the research done in various aspects of databases which are financial fraud cases (Choi et al., 2009)(Pavlou, 2012) and it is useful for the forensic purpose. As far as forensic study is con- gaining importance due to HIPAA (“Summary of the HIPAA Privacy cerned, data recovery aspect is also considered. Various methods Rule j HHS.gov,” n.d.) and SOX (“The Sarbanes-Oxley Act 2002,” proposed by researchers for data recovery like metadata, use of n.d.) Act. As per information security act like the Sarbanes-Oxley triggers, carving approach etc. are reviewed. Forensic study on various relational databases like MySQL, Oracle, SQLite etc. and NoSQL databases like MongoDB, Redis is presented here. The work fleshes out the field of database forensics to explore the various * Corresponding author. E-mail addresses: [email protected] (R. Chopade), [email protected] elements of investigation process. These elements are process (V.K. Pachghare). models, detection algorithms, forensic methods, artifacts, tools and https://doi.org/10.1016/j.diin.2019.04.001 1742-2876/© 2019 Elsevier Ltd. All rights reserved. R. Chopade, V.K. Pachghare / Digital Investigation 29 (2019) 180e197 181

Fig. 1. Number of Publications on Database Forensics, the year 2000 onwards. so on. solution is given by page based partitioning along with attribute based partitioning and corruption diagram (Partitioning based on Outline of paper pages or attributes). This is the first paper which has discussed forensics cost model for various forensic analysis algorithms. The Section 2 reviews the database forensic investigation models. lowest cost is associated with tiled Bitmap and a3D analysis algo- These models are helpful for the database investigation process. It rithms, while the highest cost is for RGBY algorithm. Forensic gives an idea about a piece of work to be done. There are various analysis algorithms make use of concepts like total and partial kinds of database artifacts for forensic study. Their explanation is chain computation. In total chaining, hash values of transactions given in section 3. Section 4 and section 5 explores the forensic are calculated and then these values are digitally notarized. In work done in the field of relational and NoSQL databases respec- partial chaining, hash values are recalculated and compared with tively. Finally there is conclusion. previously notarized values. The term chain refers to the order of hash calculation for transactions as they are committed. (Detailed Database forensic investigation models explanation about hash and notary ID calculation is available in section Audit Log). Working of these algorithms is summarized in Several forensic investigative frameworks are published on 2 (Pavlou and Snodgrass, 2010). The dissimilarity among database forensics examination. Al-Dhaqm et al. proposed (Al- these algorithms is the number of hash chains used and their Dhaqm et al., 2017) a forensic model which consist of 4 phases overall structure. namely identification, artifact collection & preservation, artifact Al-dhaqm et al. (Al-Dhaqm et al., 2014) discussed issues and analysis, and documentation & presentation. The model is pro- challenges in the forensic investigation of databases. These issues posed by reviewing 54 investigation processes from 18 forensic may include lack of training and knowledge, the different frame- models of databases. The proposed model is validated against work of databases, lack of process model etc. Due to these issues existing 9 database forensics investigation models, to check the and challenges, there is not a straight forward process for investi- completeness of the model (Al-Dhaqm et al., 2016). The model is gation. They have presented meta-model based on software engi- named as Common Database Forensic Investigation Processes for neering approach. The term meta-model refers to the special form Internet of Things (CDBFIP). The model is named with IOT, because of model, which is based on the concepts that are collected from a number of applications are now available with a wide range of existing models. It categorizes different features and related con- sensors, CCTV cameras etc., which need database for storage cepts. For concept collection databases used are Oracle, MySQL, and purpose. SQL Server. Examples of the concept are a live response, data Al-Dhaqm et al. developed (Al-Dhaqm et al., 2017) database collection, incident reporting phase, artifact analysis etc. The pro- forensic meta-model which can be used to solve the forensic posed meta-model consist of phases covering preparation, data problems. They used 8 steps to create this model. These steps and collection & analysis, investigation team & methods, network & work done in each step is given in Table 1. Limitation of this model etc. is that it does not find the missing concept. Managing Database Forensic investigation knowledge (DBFI) A generalized database forensic model is proposed by Pavlou based on the conceptual investigation is presented by Razak et al. et al (Pavlou and Snodgrass, 2013). Here generalized term means (2016). Model is developed by keeping the focus on investigation stressing on ‘where’ dimension. This model is applicable to find algorithms, artifacts (including volatile and non-volatile), and tools physically where and which values of database are corrupted. The like log miner, SQL Profiler. Various forensic investigation methods 182 R. Chopade, V.K. Pachghare / Digital Investigation 29 (2019) 180e197

Table 1 Steps involved in model creation.

Step Step Name Work Done No.

1 Preparing knowledge sources Knowledge sources for the models are prepared using Google search engine, Books, Journals etc. Sixty models are discovered by using keywords like database forensic process, investigation, tampering etc. on above sources. These are useful for knowledge gathering. These models are categorized into: i. Database forensic dimensions-based ii. Database forensic for technology-based iii. Database forensic investigation process-based 2 Recognize & extract general concepts Concepts are extracted from the models discovered in step 1. Recognized concepts include terms like database server, investigation team & forensic technique (nouns) and volatile, non-volatile artefacts (adjective þ noun). Other concept examples are redo log, log etc. Likewise 246 general concepts are recognized. 3 Nominate & propose common concepts These 246 concepts are grouped into 44 clusters by considering their similarity in function and meaning. 4 Short listing of candidate definitions For common definitions the concept definitions are shortlisted as reconstruction, transaction logs, log file and timeline. 5 Reconciliation of definitions This step is useful to definitions among number of definitions developed by different people, generated through multiple sources. 6 Designation of proposed common concepts into Concepts are mapped into processes as identification, artifact collection, artifact analysis, documentation & database forensic processes presentation. 7 Identifying relationships between concepts & the Relationship is determined by denoting association, specialization and aggregation. resultant Database Forensic Meta Model(DBFM) 8 Validation of database forensic meta-model Comparison Model is compared with existing models and specific forensic case; it has not been against other evaluated by a specific forensic examiner. models Frequency based It aims at evaluating the importance of concepts used in the model. The frequency of selection features is categorized as very strong, strong, moderate, mild and very mild.

Table 2 Forensic analysis algorithms comparison.

Name of Explanation Recommendation Forensic Algorithm

Monochromatic This algorithm works on simple concept total chain calculation, by sequentially hashing all Does not suffer from false positive but can detect only single data. It is based on single hash calculation, so termed as monochromatic (black) and event corruption. cumulative in nature. RGB It is based on partial chain computation over the entire database, so it is non-cumulative Does not suffer from false positive and can detect up to two in nature. In RGB, color code chains as Red, Green and Blue are added. These coloured events corruption. chains consist of individual transaction order, which is maintained as a linked sequence. Here binary search can be performed using the cumulative chain to detect the old corruption. The color chains are used for rest of the database. RGBY Along with total chain, enhancement of this algorithm is the addition of yellow chain. This Recommended for a large number of corruptions, but the yellow chain is useful for partial hash calculation. overlap between hash chains may introduce false positive. a3D Then the most advanced algorithm is a3D, which maintains a binary tree of hash chains as For a small number of corruptions, the cost is low. transaction time increases. Partial chains are slowly increasing for each validation. It also Recommended attribute based a3D for additional semantic maintains several small hash chains for a specific part of data, during validation scan. By information. dynamically generating small chains, the binary tree is formed. This tree can be searched to trace the corruption events. Tiled Bitmap It is based on the concept of a candidate set (Bhagwani et al., 2015). Candidate set is all For a small number of corruptions, the cost is low. possible corruption events that need to be traced. Corruption events can be found out It is preferred considering long term cost of the algorithm. using the hash and notarization system. The algorithm uses partial hash chain May introduce false positive. computation as well as a refined group of hash chains, which forms bitmap over data. Tile is a repeating pattern of partial hash chains. It can detect multiple corruption events. It is useful to find whether tampering is post-dated or back-dated.

are also taken into consideration like investigation, collection, File systems are represented using hierarchical structures. Data- analysis, detection & preservation. Few examples of each method bases are multidimensional in nature, so forensics examination are given in Table 3. To develop this model, they have reviewed should receive attention accordingly. File metadata is associated fourteen digital forensic investigation process models. with a file like a file name, date & time of creation etc. Files are represented using file formats like FAT, NTFS, and ext3. As Artifacts of database forensics study compared to files, databases have a more complex structure. They are represented using two dimensions. For database forensic investigation study, various artifacts need One dimension shows the schema structure namely external, to be focused. These artifacts are shown in Fig. 2. conceptual and internal levels. The second dimension called as an orthogonal intention-extension dimension which consists of the Metadata data model, , application schema, and application data. Basically, the forensic process consists of different phases like Martin Olivier et al. focused on (Olivier, 2009) different views of Preparation, Acquisition, Analysis, and Reporting. In this process, fi file system forensics and database forensics with their differences. the image is generated which is a copy of the disk. Sometimes le R. Chopade, V.K. Pachghare / Digital Investigation 29 (2019) 180e197 183

Table 3 Methods and examples used in model development.

Method Examples

Database forensic investigation process models MySQL e identification, collection artifacts, analysis artifacts and report Database forensic investigation algorithms Monochromatic, RGB, RGBY, a3D, tiled Bitmap Database forensic investigation artifacts Memory cache, data files, registry values, log files, event logs etc. Database forensic investigation tools Log Miner- Oracle SQL Trace & SQL Profiler e MS SQL Server ProfilerEventHandler e My SQL Database forensic investigation methods Policies, authorization, authentications resources, OS, Database, Network resources etc. Database forensic detection methods Detecting crime, type of database, type of attack, timing and resources of attack etc. Database forensic collection methods Live acquisition e data collection from volatile artifacts like SQL cache, data cache etc. Dead acquisition e data collection from non-volatile artifacts like transaction logs, data files, log files etc. Hybrid acquisition e Combination of live and dead, to collect best possible data. Database forensic preserving methods Hash functions Database forensic analysis methods Reconstructing and recovering database activities.

Fig. 2. Database forensic investigation artifacts.

Fig. 3. DBMS: Four abstract layers. reassembling is needed if the file structure is damaged and it is known as file carving. Table 4 shows the differences in file and database forensics with respect to different features. with a bit representing the abstract layer. Binary value 0 means that Hector Beyers et al. (2011) described a forensic investigation the abstract layer is clean and value 1 indicates that the respective method which transforms the database into the required state. The layer is compromised. These scenarios are shown in Table 5. database is explained with four abstract layers, as shown in Fig. 3 For scenario 0001 test was successful. It showed properly the and then the experiment is tested on these layers using the Post- same state of application data with compromised state. It also greSQL database. showed the correct application schema. They tested database forensic method using the following four The forensic framework is developed by Khanuja et al. (Khanuja scenarios. These scenarios are represented using the binary code

Table 4 Difference between file forensics and database forensics.

Features File System Forensics Database Forensics

Integrity In imaging it is verified that whether the copy created is an exact copy of The result returned by the query is trusted result or not? the original. Hashing techniques can be used to check the integrity. Searchability This is locating the required information from the whole information The database is the best example to search for particular information. analyzed. Restoration Locating information that is already deleted (File carving). In case of the file Here restoration may refer to restoring deleted record or deleted tables known system, it is generally in slack space or partitions that do not use known file as record carving and table carving. Data may be restored using carving or by format. reconstruction (deriving schema using the regular structure of the carved table.) Metadata Extraction of information like date and time of creation, last modification or It describes what type of data is exactly. extraction last access of the file. Attribution This is usually not seen in the file system. It is based on the style of the Two approaches can be used. One is to use logs to extract the information (but content of the file or may be metadata associated with the file. the log may be modified) and the second is to use metadata to find authorization details. A further combination of the above two approaches can also be used. 184 R. Chopade, V.K. Pachghare / Digital Investigation 29 (2019) 180e197

Table 5 Scenario working.

Scenario Meaning Working Requirement

1111 All layers are compromised The compromised database is copied to While copying Database should be as it is without any the second virtual machine. change. 0000 All layers are clean Create a data model and data dictionary. Installation should be clean. (Rare case) 0011 Data model & data dictionary clean while application The focus is on application schema and Copying application schema and data to clean DBMS. schema and application data is compromised. data. Done using script. 0001 Data model, data dictionary & application schema are clean The focus is on application data. The case is similar to scenario 0011. Requires three while application data is compromised. virtual machines for evident analysis.

and Suratkar, 2014) for finding various kind of attacks like brute An extension work of this algorithm is shown as proof of correct- force, buffer flow etc. This Framework works in two stages. In the ness (Fasan and Olivier, 2012c). Partial and total correctness is first stage, by scanning the required log files of the database, the proved by using lemma for ‘inverse’ function. In partial correctness metadata file is generated. The required log file may contain rename, selection, intersection, union and difference operation transaction logs, cache files, binary logs, server error logs, system shows the correct result for the inverse. Cartesian product, join and event logs etc. This file is useful to find out the relevant traces of projection cannot be a parameter of the inverse function, but the malicious activities performed by the user. Based on inference rule inverse of these operations can be calculated correctly. In total the decisions are taken to get the required information from met- correctness, lemma verified as the number of log entries, value adata file. In stage two, based on artifacts collected, reconstruction blocks, possible and value block tuples are finite. Finally, of activities is done, to generate a final forensic report. The evidence the theorem database reconstruction algorithm always terminates consist of how, what, where, who and why the malicious trans- is verified. actions were carried out. Attackers may compromise to stop partial or full functioning of the database. As far as database schema tampering is concerned, an attacker may perform following three Application schema types of changes (Adedayo and Olivier, 2014)(Fasan and Olivier, 2012a) to the database shown through Fig. 4. As discussed by Oluwasola Fasan et al. presented their views about very less Fasan et al. (Fasan and Olivier, 2012a) inverse operation can be research done in the area of database forensics (Fasan and Olivier, applied to the compromised schema for reconstruction. 2012b). In database forensics, focus on all dimensions of research schema may be altered in various ways by is very important. Database forensics is an upcoming research an attacker for a number of reasons. Hector Beyers et al. proposed domain of digital forensics which deals with database identifica- evidence extraction from application schema, which is helpful for a tion, collection, preservation, analysis, and presentation. forensic examiner (Beyers et al., 2014). Table 7 gives a clear idea Joshua Sablatura et al. represented three dimensions of recon- about the different ways used by an attacker to compromise the struction in database forensics as given in Table 6 (Sablatura and application schema. Clean and found environment can be used for Zhou, 2017). forensic analysis. In a clean environment, application schema can File carving is a process to reconstruct files without the help of be rebuilt using previous dump, documentation, through pieces or metadata file. (A file system that created the original file) This is a by replacing metadata with clean metadata. The found environ- challenging task to find whether the database is damaged, modified ment is like post-mortem analysis. In this environment, live anal- or compromised and to start research in more than one dimension. ysis can be done to collect data from volatile memory or by copying Fasan et al. (Fasan and Olivier, 2012a) presented an algorithm for the contents of the application schema. Then based on the envi- inverse operation to build the prior state of the ronment selected, the specific method can be used to find the database. Relational algebra is a used to show an evidence. internal form of SQL queries. They showed inverse of basic rela- tional algebra operations like the project, select, join, intersection, divide, union and difference. The initial query log is converted into Triggers relational algebra (RA) query log. Then this RA log is grouped into value blocks, which are a block of queries. Reconstruction algorithm Database triggers are the activities executed on data when by the name ‘solve’ is used, which takes input as the name of modifications are done with respect to data. Generally, database relation which needs to be reconstructed, the value block and new triggers are available on user created tables and views only (Still relation in which result to be stored. Here inverse function is used. few Relational databases do not support both).

Table 6 Dimensions in database forensics.

Dimensions in Database Meaning Concern Forensics Reconstruction

Compromised database Where metadata or database management system have been modified, The decision to be taken: Whether to use metadata as it is or to still database is in an operational state use a clean copy of the database Modified database Data files of a database are modified Reconstruction algorithms are available. Damaged database Data files of a database are either deleted, modified or copied to another Most of the research is under this category. location Even data hiding and database tampering also falls under this category. File carving may be useful for retrieving data. R. Chopade, V.K. Pachghare / Digital Investigation 29 (2019) 180e197 185

Fig. 4. Changes to database.

Table 7 Application schema damage types.

Damage Type Example Description

Application Schema Damage SQL Drop command To remove the table , key/indexes Alter To change access privileges Corrupt Table Which changes metadata Application schema alterations to deliver wrong Column Swap E.g. Interchanging column product quantity and price results Operator Swap Changing operator meaning (E.g. Addition is performing multiplication) Create a view for replacing the Create a copy of the table to hide the data table Aggregation Damage Changing functionality of operations (E.g. Sum is performing average) Database Behaviour Alterations Drop Index To slow down search operations Black hole Storage Engine Changing the default storage engine which accepts data but does not store it. Custom Storage Engine Replacing default storage engine to perform malicious activities

Impact of trigger information is stored in the database. Basic information is stored in Werner Hauger et al (Hauger and Olivier, 2014), have presented the main table and additional information like organ donor assent, the impact of the forensic process if the triggers are available. They allergic to, is stored in additional_info. Whenever a patient comes suggested that forensics process should be improved to handle the for treatment, he/she asked to fill the form. As per details of form presence of database triggers. Trigger operations are defined by clerk enters the information in the database. One patient was making changes to the rows like inserting, updating or deleting it. allergic to penicillin and that information is filled by him in his Triggers are defined either before these operations or immediately form. The clerk entered the same in the database. An attacker has after these operations. There are two kinds of triggers level written Instead Of trigger which changed allergic to penicillin value trigger which is activated for every row to be affected and state- to , before executing the actual insertion of information. After ment level trigger which will activate only once. Table 8 shows the reading all the details Doctor advised him penicillin as part of different types of trigger and DBMS which support it. treatment. This treatment leads to patient death due to an allergic This is an extension of work from the role of triggers in database reaction. After investigation when death cause is discovered, the forensics (Hauger and Olivier, 2014). During a forensic investiga- form has been checked; where the patient has correctly written tion, one of the steps is data collection and preservation, which is about allergy but the system doesn't have that record. This shows called an acquisition. There are two kinds of acquisitions live and pointer towards carelessness of clerk in making an entry in the dead. Whenever a crime happens, if the data collection process is database, which was not the reality. carried out with a system which is in on state, is called a live acquisition. When the system is disconnected from power and Objects on which trigger can be applied other networking media, then the data collection process is called The trigger can be used on database objects like user created dead acquisition. Based on the acquisition method employed, tables and use created views. Oracle, MySQL, SQL Server, DB2 all Werner Hauger et al. (W. K. Hauger and Olivier, 2015b) examined support trigger to be created on tables as well as on views except the impact of the trigger. One serious example of trigger effect is MySQL which does not allow trigger on views. No DBMS allow the discussed by Werner Hauger et al. For medical treatment patient's trigger to be created on system defined tables or views. 186 R. Chopade, V.K. Pachghare / Digital Investigation 29 (2019) 180e197

Table 8 Trigger types.

Trigger Type Sub Types Function DBMS which support it

Standard insert, , delete Will be activated whenever these manipulation operations Oracle, MySQL, SQL Server, DB2 are performed on rows After Activation of the trigger after the operation Oracle, MySQL, SQL Server, DB2 Before Activation of the trigger before the operation Oracle, MySQL Instead Of To modify view (which cannot be modified using Data Oracle, SQL Server Manipulation Language (DML) statements) Row Level Row: Activates for every row (Depending on the number of Oracle, MySQL, SQL Server, DB2 rows affected) Statement Level Statement: Activates only once Oracle, SQL server, DB2 Non-Standard Data Definition Changes made to data dictionary like create, alter, drop Oracle, SQL server Language (DDL) triggers Non data triggers Activates when an event occurs during normal or routine Oracle, SQL server use of database Login(it can perform Activates either before login or after login SQL server either conditional login It can perform an action only when login is successful Oracle or can completely block all login) Server e Error Activates when non-critical server error occurs like oracle Oracle internal error, not found, out of process memory etc. Used to send notification or to perform a specific operation. Role-Change Fired during database start-up (to perform initialization Oracle steps) and database shut down (when successfully closed and to perform cleaneup activities if required.)

Availability of database trigger may lead to incorrect or incom- engines of the database (Kieseberg et al., 2011). For faster I/O op- plete operations. For any malicious activity, it is important for any erations in databases, indexing mechanism is used. In database þ þ forensic investigator to check the presence of the trigger. Some storage engines, indexes built B trees. They showed a B tree databases, which need investigation, may contain hundreds or structure for Insert operation only. The main limitation of this thousands of trigger. It is not feasible to analyze each and every approach is that insert operation is not bijective. þ trigger code which affects the data. A more suitable technique is Peter Kieseberg et al. defined B trees (Kieseberg et al., 2013) suggested by Werner Hauger et al (W. Hauger and Olivier, 2015c). with a signature which is a widely used data structure for database They suggested two algorithms which are advantageous to find systems. They developed an algorithm for the same. This signature specific object affected using triggering concept. The first algorithm concept supports signature logging mechanism which is used for þ is based on the concept of a top-down approach, which is as shown the forensic purpose. Whenever any changes are made to B tree, it þ in Fig. 5. will affect changes in B tree signature also. These changes can be In the first algorithm, when it starts searching specific object verified through the signature log. name, it will produce list of procedure, function, table or view Peter Kieseberg et al. (2018) proved that Bþ trees can be used to which may contain the same name. The limitation this algorithm is find the change in structure with insert statements. Finding the that, searching the specific object from this huge list is not reliable. previous deletion of data using Bþ trees is not possible. They This limitation is overcome by the second algorithm, which uses focused only on leaf nodes of trees and not the entire structure. bottom-up approach, as shown in Fig. 6. In this algorithm it starts with checking triggers, procedures and functions and then narrows down the approach towards specific object. The given algorithms Indexing for data hiding & recovery are tested on SQL Server and Oracle. There are some factors which MySQL database consist of InnoDB storage engine which is most þ may affect the working of an algorithm like encrypted data, use of widely used. It uses B tree indexing for locating pages. Indexing is different data types, recursion used in function/procedure, low a valuable term for searching data more rapidly. There are two performance due to string matching character and case sensitivity types of indexing used primary and secondary. The primary index is of object names. Currently this algorithm checks only one object for used on the table. Generally, the secondary index is presence of trigger. Repeating the same process for every object is used to retrieve specific data. Through indexing data is managed in not feasible. Future scope is to improve algorithm efficiency. singly linked list format, where indexing is a pointer to the specific record. Leaf node consists of actual data stored in a sorted manner and the root node contains the index page. Data hiding is a term Data structure used to hide sensitive or secret data (Fruhwirt et al., 2015). Peter Fruhwirt et al. suggested different techniques for data hiding in Data structure provides a format through which data can be MySQL as shown in Table 9. Future work of these efforts is stored in a proper way for easy retrieval. In databases for faster data extending this approach for other databases. search, the concept of indexing is used. Indexing is based on tree ESE database is an Extensible Storage Engine used mostly for format. Researchers work on the effect of different trees on the storing web browsers or windows data. Its internal structure con- database is reviewed here. sists of the database header and pages. Pages are managed in the B tree structure format. Jeonghyeon Kim et al. presented technique þ B tree and tool (Kim et al., 2016) to recover deleted records and tables Peter Kieseberg et al. presented that as far as forensics process is from this database. Deleted pages and deleted tables can be þ concerned, B tress plays an important role in internal storage recovered from MsysObject table. MsysObject is a catalog table R. Chopade, V.K. Pachghare / Digital Investigation 29 (2019) 180e197 187

Fig. 5. A top-down approach for Algorithm.

this tool is that it cannot recover data if the record header is damaged.

Storage engine

Storage engine of the database is an important component for forensic study. It describes the format by which data is stored internally. For databases like Oracle, SQL Server storage engine is integrated with database, but few databases like MySQL (Frühwirt et al., 2013) and MongoDB (Yoon and Lee, 2018) supports the feature of custom storage engine (Beyers et al., 2014). For forensic analysis of any database, it is important to get acquainted with the underlying file structure of the database. Peter Fruhwirt et al. proposed how to recover deleted data from MySQL database file system (Frühwirt et al., 2013, 2010). InnoDB is a commonly used storage engine of MySQL. All information is stored in one file which is created as TableName.frm file. InnoDB storage format consists of seven parts, as shown in Table 10. Even every data type of MySQL has a different interpretation as shown in Table 11. For any type of data interpretation, it is very important to understand the internal file structure. Peter Fruhwirt et al. showed how to convert the commonly used data types in readable string format.

Logs

For database forensics, log file plays an important role. Any change in database can be traced out using log files. Log files available in different databases are reviewed here.

Fig. 6. Bottom-Up approach for Algorithm. Redo logs Fruhwirt et al. discussed importance of redo log in InnoDB forensic analysis (Frühwirt et al., 2013). Log files are very crucial as which manages table information. Proposed tool is compared with far as data recovery is concerned. Whenever the database state is an existing tool, which proved the accuracy of the tool. Limitation of changed then that particular information is recorded in the redo log 188 R. Chopade, V.K. Pachghare / Digital Investigation 29 (2019) 180e197

Table 9 þ Data hiding techniques in MySQL InnoDB B tree.

Data Hiding Working Benefit Drawback Technique

Manipulating Hiding of data can be achieved using secondary indexing. It will Data will not be retrieved in normal Data can be found using Select statement Search Results unlink the index entries, without removing it. operations. which does not use any indexing or by using primary indexing. E.g. select * from table_name; Reorganizing the Link among primary indexing is managed. It un-link the deleted Data cannot be found if the primary Data can be found using Select statement index record. index search is used. which uses secondary indexing Hiding data inside A record containing sensitive data is removed using modified Used to generate slack space. (areas entry contains insert and the index delete operation (does not link to list of deleted records) and the which are not accessible by storage delete operation traces. garbage space retrieved back using file carving method. engine of the database) This is not allocated by the database. Hiding data inside Free space generated between user records and page directory, is This space is not used by DBMS. Not saved against overwriting as DBMS the free space of used to hide the data. treats this as free space. an index page Removing a page Pointers from non-leaf nodes are modified to unlink a page and it is e Not feasible, as the page may contain lots from the index used to hide the data. of data. Also creates an overload in rearranging Bþ tree.

Table 10 InnoDB storage parts.

Sr. Name of Part Function No.

1. Fil Header Defines data like offset and checksums 2. Page Header This is the outer casing for user records. It defines page positions at heap and B trees. It is used to find which pages are deleted and how many bytes are free. 3. Infimum and supremum These records are automatically created and never deleted. Infimum is the lowest possible value for existing entry and supremum is a records maximum possible value. 4. User Records For every record one row is created 5. Free Space It contains only NULL values for further processing 6. Page Directory It contains the pointers among the records. Logical order of the records is maintained by pointers. 7. Fil Trailer This information can be found at last 8 bytes of a page.

Table 11 MySQL data type interpretation.

Sr. Data Type Interpretation No.

1. Tiny int, small int, big int, If it is signed then need to subtract as shown below For Tiny Int: 0x80 int Small Int: 0x80 00 Int: 0x80 00 00 00 Big Int: 0x80 00 00 00 00 00 2. Float and Decimal Last seven bits stand for the exponent. 8th Bit from the end is flag sign. If it is 1 then the real number is negative. Other bits represent mantissa. 3. Date and Time Date: Convert number to decimal and subtract 0x80 00 00 If the value is before date 01.01.0000, convert resulting value to a date object. Datetime: Convert number to decimal and subtract 0x80 00 00 00 00 00 00 Timestamp: Convert hexadecimal string to decimal Time: Convert hexadecimal string to decimal and take the mod of 1.000.000 and subtract 388.608 from it. Year: MySQL counts the year starting from 1900. i. e. add 1900 to decimal value or save year before 1900

file. Redo, as per the name, re-process it. So whenever any error Similarly, reconstruction of the table using redo log files is shown. occurs then using the redo log file database can be restored to an earlier state. They showed the reconstruction of data manipulation Database logs and setting operations like insert, update and delete, as these are the opera- Oluwasola Adedayo et al. presented (Adedayo and Olivier, 2015) tions which alter the database state. Recovery of Data Definition (Khanji et al., 2015) the availability of log files in six different Language statements like Alter, Truncate or Drop table is not relational databases and suggested log parameters value settings. considered here. For reconstruction of the Insert statement, entries Table 12 dissipates the idea about the same. from mlog_comp_rec_insert need to be checked. Different types of Traces of malicious activities can be investigated from audit logs. entries are available in the log like tablespace ID, page ID, number of Wagner et al. proposed architecture named ‘DBDetective’ (Wagner fields etc. For reconstruction of update operation mlog_undo_insert et al., 2017a), which works on capturing snapshots from RAM and is required to find overwritten data type and a mlog_comp_re- hard drive. Then this snapshot is compared with audit logs, to check c_insert log entry is required to find new data inserted. Recon- the missing entries from the log file. A suitable example given by struction of the delete operation is similar to the update. In case of Wagner et al. makes this idea clear. Now suppose a database delete, they considered the execution of delete query which marks administrator is declared to be guilty for cyber fraud, he wants to the record as deleted. Actual or physical deletion is not considered. delete his entry from the database. If he directly executes delete R. Chopade, V.K. Pachghare / Digital Investigation 29 (2019) 180e197 189

Table 12 Database log files.

Database Log files Available Functionality Default Setting Log Settings Suggested by Adedayo et al.

MySQL Error Log Problems related to start/stop Enabled General Query Log, Binary Log, and of MySQL Server Relay Log Should be enabled by default General Query Log Client connection information Disabled Binary Log Statements which change Disabled MySQL Data Slow Query Log Queries that take more execution Disabled time than specified Relay Log Data change statements received Disabled from the replication server SQL Server Windows Event Log Application Log Stores events which occur Disabled Size should be managed to avoid Security Log Authentication Information stored performance issue System Log Service start-up & shutdown information SQL Server Agent Log Warning/Error Messages concerning Disabled server agent SQL Server Error Log Error Messages Disabled Transaction Log Most important and useful for Enabled recovery Postgre SQL Write Ahead Logging Minimal How much information to be Enabled Set to either Archive or Hot Standby to (Transaction Log) Archivewritten to the log Disabled restore enough information from the Hot Standby Disabled log Server Log log_min_messages Error Messages NOTICE log_min_error_statement ERROR log_statement NONE Oracle Redo Log Changes made to data Enabled The archive should be enabled and Archived Log No-archive Does not allow recovery Enabled archive trace level should be set from Archive Enables archiving of logs Disabled 0to1 Alert Log/Trace Files Record of errors occurring in Enabled the database Trace level 0 DB2 DB Recovery Log Circular Logging Changes made to data Overwriting of log file Change log_ddl_statements to Yes (Transaction Log) when it is full Set file size to the maximum as much as Infinite Logging No overwriting. possible to restore enough information But closed & archived. Diagnostic Information Parameters Extra information related to DDL No Log log_ddl_statements statements to store or not Logarchmeth1 (Log OFF Enabled archiving method) LOGRETAIN Disabled Logfilsiz Size of file (4e1048572) 1000 Kb Sybase Transaction Log Changes made to data Handles automatically Stops when the disk is full. Avoid log Message Log Error Messages Created automatically for files to grow indefinitely/archive every new database shorter log files.

statement, that entry will be recorded in the log file. So first he et al., 2004). Whenever data is available to the system for opera- deactivates log, perform deletion and then activates log again. tions, its hash value is calculated. The proposed module consists of Though this entry is not recorded in the log file, but this can be notarizer which sends the calculated hash value to the external traced out from disk, which contains the evidence of deleted record digital notarization system. This notarization service calculates (Unless some record is overwritten there). They also used existing notary ID. This ID and the initial hash value are stored in a separate tool ‘DICE’ to reconstruct database storage. They conducted an database. Master database is available at a different location. Then experiment on the suggested architecture with three relational hash value is calculated for the database which undergoes through databases Oracle, PostgreSQL, and MySQL. A number of experi- various transactions. The validation application sends this hash ments have been carried out to evaluate the performance of value and previous notary ID to service. Notarization service uses DBDetective. In three parts performance is analyzed, namely A) cost notary ID to retrieve the corresponding hash. Then old and new estimate to perform memory snapshots B)Carving process perfor- hash values are compared to check for consistency. If there is any mance against database and C) Carving speed against memory change in hash value, it means that the corresponding database has snapshots. This experimentation is done on Oracle. Then deleted been compromised. record detection is done on MySQL with the help of DICE. Then An efficient model has been developed by Jasmin Azemovic et al. updated and modified record detection, full table scan, and index to find the database tampering (Azemovic and Music, 2009). access detection experiments are carried out. Updated and modi- Whenever any authorized or unauthorized user makes changes to fied record detection is done on MySQL, while for full table scan and the database, that information is stored in the data warehouse. index access detection PostgreSQL is used. Audit logging is performed to answer access control elements as who, when and what has tampered data. The developed model is Audit log based on a SQL trigger concept to collect audit logs. is The innovative approach of the cryptographic hash function is achieved by using the hashing mechanism. Row level and column proposed by Richard Snodgrass et al. for transaction-safe opera- level hash values are calculated. If there is a change in any row or tions. Due to the hash mechanism, it prevents the attacker to column that can be checked by comparing new and old hash value. disturb the audit log files (Khanuja and Adane, 2011; Snodgrass 190 R. Chopade, V.K. Pachghare / Digital Investigation 29 (2019) 180e197

Relational database forensics study and output is an array of table fields. Sebastian Nemetz et al. presented a standardized corpus for Database forensic research is done on most of the relational SQLite database forensics (Nemetz et al., 2018). SQLite is a widely databases. Relational databases considered by researchers are used database to store application data. Many forensic tools are MySQL, Oracle, SQLite, PostgreSQL, DB2 and SQL Server. available for this database. This corpus consists of 77 databases, which are divided into 14 folders among 5 categories. Corpus is MySQL specific to SQLite database file format. The five categories of the corpus are Keywords & identifiers, Encodings, Database elements, Khanuja et al. prepared framework (Khanuja and Adane, 2012) Tree & page structures, Deleted & overwritten contents. Each to monitor any suspicious activity of MySQL database 5.5. The category contains different folders like deleted records, overwritten default storage engine for MySQL 5.5 is InnoDB. For any database records, weird table names, deleted tables etc. They evaluated six forensic analysis, it is important to know the internal details of that different SQLite forensic tools against this created corpus. The database, so they first presented the internal structure of MySQL. evaluation shows the strength and weaknesses of these tools, as Forensic analysis is done in two stages. In the first stage, data is shown in Table 14. The final conclusion is that none of the tools can collected from various log files including text and binary log files. handle all the corners of the SQLite database. Meaningful information is extracted from these log files using the Li et al. proposed recovery of data from android based mobile script. For further analysis and decision making, the extracted in- phones (Li et al., 2014). The database used is SQLite. Recovery formation is filtered out using inference rules. In the second stage, method is based on deleted WAL (Write Ahead Logging) log from activities are reconstructed with the help of various caches like the ext4 file system. WAL consist of page header and frames. Values query cache, key cache etc. This framework is useful to check out if from page header and frames are checked, to find the deleted data. any transaction is suspicious or not. This type of framework can be There are some limitations of this method like the existing method applied to other databases by studying its internal file structure. is not available for comparison, the recovery algorithm is not that (Frühwirt et al., 2014) presented a novel approach for database much efficient and data set used for experimentation is very small forensic investigation using two internal data structures namely and simple. transaction and replication. Transaction management supports ACID (Atomicity, Consistency, Isolation, and Durability) properties. SQL server This is helpful to rollback the database state to the former version. Replication performs the duplication of the data. These redundant Chain of custody (CoC) is a document maintained in chrono- copies of the data are helpful as it contains identical data. The logical order, which is helpful in solving criminal cases (Flores and suggested a prototype which is implemented using MySQL Jhumka, 2017). There are two kinds of database forensics ap- database. proaches, proactive and reactive. Proactive is a top-down approach for auditing inside activities like triggers. Reactive is a bottom-up Oracle approach to find the evidence in scattered pieces of database like traditional forensic approach. Focus of Flores et al. is on the pro- Various factors like logs, system change number, authentication active database. The chain of custody requirements for proactive attacks etc. that need consideration for oracle forensic study are database forensics is shown in Table 15. shown in Table 13 (Litchfield, 2008, 2007a; 2007b, 2007c; 2007d, Above requirements are tested on MSSQL server 2014. Experi- 2007e, 2007f). mentation shows that timeline and causality are mutually related Shweta Tripathi et al. (Tripathi and Meshram, 2012) conducted a CoC requirements. They have used vector clock timestamp in the detailed study of the to find out the evidence when audit table and then proved that vector clock timestamp is the same the database tampers. Whenever any database is compromised as hardware clock timestamp. then forensic examiner may get a clue of tampering by observing Khanuja et al. developed a methodology for SQL server forensic the different locations. They focused on these various locations of analysis. The methodology is based on DempstereShafer theory Oracle database like redo log, data block etc., which are already (Khanuja and Adane, 2013, 2012). SQL server consists of various discussed in Table 13. artifacts like resident and non-resident. For the forensic investi- gation, evidence may be available in these artifacts. SQLite3 DempstereShafer combination rule is based on a probabilistic approach, which collects multiple pieces of evidence from artifacts SQLite3 is a widely used database on mobile phones. Liu et al. and then finds out the given transaction is suspicious or not. suggested a method (Liu et al., 2016; Nemetz et al., 2018), to recover Transactions including DDL/DML statements can be traced out. deleted as well as partially overwritten data. As of now three tools namely SQLite Exert, Android 5.0 ADB tool, and WinHex software PostgreSQL are used. SQLite database file system consist of 4 pages namely free page, overflow page, Bþ tree page, or B tree page. This database file Data hiding is a different concept used to hide sensitive data, consists of multiple B trees. Experiment test is carried out on which cannot be tampered by an attacker. Data hiding concept will deleted messages. Logical and physical acquisitions are carried out not remove the data instead it only put the data out of sight. For to find initial data. Whenever delete record is invoked, data area of data hiding, it must follow specified requirements. The hidden data deleted record changes to free block and database file size remains should be recoverable and integrity of data should be maintained. unchanged. If all data from the entire page is deleted then the page Fig. 7 shows the different techniques used by Pieterse et al. on will not exist and the size of the database will reduce. In the re- PostgreSQL to hide the data (Pieterse and Olivier, 2012). PostgreSQL covery method, the first step is to find the internal page, which is a widely used object-. Though here this contains child page reference and then locate leaf pages to find concept is applied to PostgreSQL, it can be applied to other data- records. The cell is a unit to store data. Leaf cell consists of three bases too. The same techniques may not be applicable to other parts: Cell header, payload header, and a payload data area. In the databases, but file system concept like slack space on the hard disk estimation process, variable integers are read from payload header may be useful on databases with additional efforts. (Data hiding in R. Chopade, V.K. Pachghare / Digital Investigation 29 (2019) 180e197 191

Table 13 Oracle forensics.

Oracle Forensics

Redo Logs Locating Dropped Attacks against Live Response Data Theft Evidence Undo, Flashback & System Change (Litchfield, 2007a) Objects (Litchfield, Authentication (Litchfield, 2007e) (Litchfield, 2007c) Recycle Bin Number (SCN) 2007d) (Litchfield, 2007b) (Litchfield, 2007f) (Litchfield, 2008) Operations Change in database When an object is Evidence can be Collecting evidence Evidence can be Undo stores As database state Performed state is recorded in dropped row in found from TNS without shutting found in the information related changes SCN Redo Logs. (Mostly OBJ$ is marked as listener log file and down the system absence of auditing to insert, delete and increments through DML deleted or changes audit trail. update operations) from 0 2C to 0 3C Forensic examiner OR During login, if an Very useful because Useful data can be Update-data before Up to 5 days entries need to check After deletion 5th error occurs as volatile memory found in cost based and after update are stored in the archive is enabled bit of byte is set wrong username or contents can be optimizer. When a Delete-old table. or disabled password, then captured query is parsed, the information Later gets attacker will plan including cost Insert-file no, row overwritten. understand the related details are and slot existence of a stored in a separate specific user table. account The forensic By scanning data Password guessing By scanning Other sources are Backup useful for Orablock and examiner may find files dropped attack (Brute Force) erunning processes Fixed views and recovery is also Oratime are also evidence by objects can be is also possible. list of DLLs automatic useful for the useful to find last scanning redo log traced (Attacker Scanning memory workload forensic purpose. block changes and binary format file. may drop it after dumps/registry repositories when it may got Also useful to check malicious activity) information changed. if entries are changed by an attacker.

þ MySQL using B tree is explained in section Indexing for Data Wagner et al. presented a forensic tool to reconstruct data hiding & Recovery). structure and database contents using the carving process (Wagner et al., 2015). This tool supports eight different relational databases. analysis of oracle, MySQL, SQLite & MS SQL server They extracted the contents deleted from hard disk and RAM using a table, indexing and materialized views. Cankaya et al. have performed an experiment on ten different Wagner et al. devised a tool (Wagner et al., 2016) to recover the forensic tools available for data recovery (Cankaya and Kupka, data that is deleted by users, marked as deleted but not actually 2016). This experimentation is carried out on the same database deleted and data which is de-allocated without knowledge of users. on windows operating system. From the list, few tools are available They also explained similarities and differences among different for mobile data recovery. All tools cannot be compared with each relational databases about how deletion can be handled. They used other as database format may vary depending on tool functionality, Database image content explorer (DICE) tool for recovery purpose. as well as operating system support. There are three tools namely Digital detective blade v1.13, Kernel database recovery and Foren- NoSQL database forensics study sics toolkit, which fall under the same group. As per execution time fi these tools are ranked as forensics toolkit rst, Digital detective Now a day NoSQL database is in demand due to its huge storage fi blade v1.13 s and nally Kernel database recovery. These tools are capability, called as big data. Relational databases are not capable of fi supporting other le formats also, but here angle consideration is handling big data so there is need of NoSQL databases. Relational from database perceptive. A detailed explanation of these tools is databases can handle only structured data while NoSQL databases presented in Table 16. can handle unstructured as well as semi-structured data. There are four main categories of NoSQL database namely Document store, Oracle, PostgreSQL, MySQL and SQL server database reconstruction Key Value, Column based and Graph based, as shown in Fig. 8. using carving approach Qi showed the performance evaluation of two widely used NoSQL databases MongoDB and Riak (Qi, 2014). MongoDB is a James Wagner et al. developed a tool (Wagner et al., 2017b) document based database in which data is stored in the form of called DBCarver to reconstruct the database from the database collection. Each collection consists of one or many documents. This image. Carving is an image of data residing on different media. In database uses a master-slave application model, called as page carving, the contents are restored without using any log or primaryesecondary nodes which are running as a mongod process. metadata information. The framework works in three stages This model is useful for replication purpose. Election algorithm is namely evidence acquisition, reconstruction, and analysis. For used to elect a primary. A node which receives maximum decision carving individual page three parameters are considered namely votes is marked as primary. MongoDB can be scaled horizontally the page header, the row directory, and the row data. Table 17 using sharding (Hauger and Olivier, 2018, 2017; Qi, 2014). shows the details. Riak is a key-value database. In Riak, databases are stored in This tool has experimented on relational databases namely terms of buckets and actual data is stored as a key-value pair inside Oracle, PostgreSQL, and MySQL and SQL server. The advantage of buckets. Replication strategy for this database is similar to Dyna- this tool is that reconstruction is possible even though the log file is moDB, where peer distribution model is used. Bitcask is a default missing or corrupted when the database image is captured. backend storage engine for Riak. Performance of these two 192 R. Chopade, V.K. Pachghare / Digital Investigation 29 (2019) 180e197

Table 14 SQLite forensic tools comparison.

SQLite Forensic Tools Undark SQLite Parser SQLite Doctor Phoenix Repair DB Recovery Forensic Browser

Proprietary/Open Source Open Source Open Source Proprietary Proprietary Proprietary Proprietary Version Used 0.6 1.3 1.3.0 1.0.0.0 1.2 3.1.6a Working Used as a recovery tool Works similar to Repair and To recover Repair and Display present for corrupted and carver and the focal restore corrupted export corrupt data and deleted data. point is on deleted corrupted data database and SQLite files. Do restore deleted record area easily recovers not store records deleted records deleted records. Keywords & identifiers Folders: Subjects Not Supported Not Supported Good for Totally failed 28% data Problem with 01: Weird table encapsulated for recovery right all databases name type definitions encapsulated 02: Encapsulated Fails with type definitions Rest of the Column defs strange column databases 03: SQL keywords name handled & constraints incorrectly Encodings 04: UTF character Successful at printing Not supported Always Fails at Regular Fails at sets ASCII characters only converts format database contents are database into UTF8 encoded in recovered encoded in encoding UTF16 & non- properly UTF-16be latin characters format format * Database elements 05: Trigger, views Not supported Not supported Can handle R Most of the Fails to display Regular & Indices tree module regular index contents are 06: Virtual & contents are recovered temporary tables recovered properly properly Tree & Page Structures 07: Fragmented Pointer map page from Able to show Most of the Out of 4 It fails for Pointer map contents database file is correctly hidden messages at traces are scenarios only correctly page from 08: Reserved bytes handled. the end of the page. destroyed by once displayed handling database file is per page Pointer map page this tool. the record. pointer map correctly 09: Pointer-map from database file is Pointer map Pointer map page from handled. page correctly handled. page from page from database file. database file is database file is correctly correctly handled. handled. Deleted & Overwritten 0A: Deleted table without deleting any Able to recover text Able to display Not able to Able to display Sometimes Contents 0B: Overwritten record: from dropped table logically restore a single logically restore every tables Stores every record with deleted existing records record. existing records record and 0C: Deleted After deletion: Stores records but does Able to display sometimes records half of the records not recover non- logically restore few 0D: Overwritten deleted records existing records records. records 0E: Deleted overflow pages

Table 15 Functionality of CoC requirements.

Chain of Custody Requirement Functionality

Role Segregation Segregation does not allow the administrator to have forensic permission at the same time. Provenance Records the answers for What, When, Why, Who, Which and Where Evidence Sources Recording details for the same Evidential Events Logical order of DML events generated in the audit table recorded with the timeline Event Timestamps If the audit record e is generated then the vector clock component is incremented from an audit table. If the audit record e is received then the vector clock component is set to maximum from the audit table. Casual Audit Record Recording occurrence of DML events Event Timeline Recording occurrence of DML events from corresponding audit records, with a timestamp

Event Sequentiality Property If DML events ea: sending & eb: Receiving then Timestamp of ea must be less than the timestamp of eb Event Transitive Property Similar for three events ea, eb & ec. If ea happens before eb and eb before ec, then timestamp of ea is less than the timestamp of ec. Event Concurrency Property Concurrent events are not restricted by timestamps.

databases is evaluated on Elastic Compute Cloud from Amazon Forensic attribution in NoSQL databases cloud computing platform, as shown in Table 18. MongoDB shows its good performance for smaller dataset due NoSQL database can handle huge data that need to be stored to its in-memory processing. As dataset increases performance of securely. As per Database engine ranking survey (“DB-Engines MongoDB is degraded, while Riak is performing better with larger Ranking - popularity ranking of database management systems,” dataset also. n.d.), Hauger et al. selected top databases for this forensic attribu- tion study namely MongoDB (Document-based), Redis (Key value), R. Chopade, V.K. Pachghare / Digital Investigation 29 (2019) 180e197 193

Cassandra (Column based) and Neo4j (Graph-based). Forensic database failure. attribution is a term used to specify an action of the investigation carried out by person or tool using the scientific method. Forensic MongoDB attribution is carried out on this specified four NoSQL databases (Hauger and Olivier, 2018). Table 19 shows Security features and As of now most of the database forensics research is on Rela- Access control & Logging features for these databases which are set tional databases. Very less attention is given to NoSQL databases by default (Hauger and Olivier, 2017). from forensics point of view. Jongseong Yoon et al. showed their Access Control: Through this forensic examiner can trace out work on widely used NoSQL database MongoDB, to recover the authentication and authorization activities. It will be as per the deleted data (Yoon and Lee, 2018). As per database engine ranking policies applicable to different databases. (“DB-Engines Ranking - popularity ranking of database Logging: There are three different types of files namely Audit management systems,” n.d.), MongoDB is ranked 5th among logs, system logs and storage logs. Audit log gives information overall databases and ranked 1st among NoSQL category. This work about various operations performed in databases. System logs is represented using two storage engines MMAPv1 and WiredTiger. contain error messages and any other information. Storage log can MMAPv1 storage engine contains two data files namely namespace be used for database reconstruction purpose. Here write operations and data file. Namespace file stores metadata of collections created of databases are maintained, which can be used later after a and data file stores actual data. WiredTiger storage engine stores

Fig. 7. Data hiding techniques in PostgreSQL. 194 R. Chopade, V.K. Pachghare / Digital Investigation 29 (2019) 180e197

Table 16 Forensic tools for database extraction.

Forensic Tool Proprietary/ Explanation Supporting Databases Supporting OS Freeware/Open Source

Oxygen Freeware/ Useful for extraction of evidence from mobile phones .SQLite,.sqlite3, SQLite DB,.db &.db3 Android, iOS, Windows, Forensic Proprietary Blackberry & symbian Detective Xplico Open Source Network forensic tool used to capture packets over the network. MySQL & SQLite Linux Digital Proprietary Data carving and recovery tool SQLite Windows Detective Blade Kernel Data Proprietary Data recovery tool from kernel repair damaged and inaccessible data MySQL, MS Access, DBF, Share point Windows Recovery Server SysTools SQL Proprietary Data recovery tool. Analyzes.ldf and.mdf files SQL Server 2017, 2016, 2014, 2012, Windows Log 2008 and SQL Server 2005 LDF Files Analyzer WinHex Proprietary (Free Hex editor which displays file contents in a hexadecimal data value MS Access Windows for 45 days) format NetCat Open Source Monitors network traffic and extract the contents from suspicious Databases which can be accessed Windows, Linux, Mac computer from the remote network Windows Proprietary Tool for extracting volatile data like error logs, recent activities. Also, SQL Server Windows Forensic provide MD5 checksum for logged activities to verify integrity. Toolchest SQLCMD Proprietary It is a command line utility tool by Microsoft for execution of SQL SQL Server Windows commands and scripts. Forensic Proprietary (Free One of the popular tools, used to review forensic memory dumps or Oracle, MS SQL Windows Toolkit for few days) images (FTK)

data in pages, which are maintained by B trees. The changes in Ming Xu et al. have developed a tool (Xu et al., 2014) to extract the structure with respect to storage engines, when data is deleted are contents of deleted elements in terms of data structures from the shown in Table 20. database. Extracting contents from memory is a tedious job instead Jongseong Yoon et al. developed a forensic framework for data can be extracted from Append Only File (AOF) and Redis MongoDB database (Yoon et al., 2016). This framework consists of Database File (RDB). It is important to study the internal storage five phases namely phase 1 - preparation, phase 2 - logical evidence structures of these two files. Deleted data can be found in the RDB acquisition & preservation, phase 3 - distributed evidence identi- file because an image of memory data is stored in this file. AOF file fication, acquisition & preservation, phase 4 - examination & contains the execution of write operations. These operations are analysis, and phase 5 - reporting & presentation. They created stored as a short sequence of statements, due to which file does not crime scenario and shown the steps involved in each of the above become longer and also contains sufficient details to restore the phases. Above mentioned crime case is solved using three database. The experiment has been carried out on a high level and deployment modes of MongoDB. These modes are as shown in low-level data structures to check the performance of the devel- Fig. 9. oped tool. The tool has successively extracted the deleted contents For easy storage and retrieval GridFS is a special functionality from high-level data structures like string, list, set, sorted set and provided by MongoDB. When a file is stored using this mechanism, hash. Table 21 shows the percentage of data successfully extracted two collections are created fs.files and fs.chunks. The collection using a tool from RDB. fs.files consists of metadata such as file name, chunk size, file size, The tool has failed to extract the data stored using LZF uploaded time etc. The fs.chunks collection consists of actual data compression. LZF is a light data compressor which is used to stored in the form of chunks. Data recovery from MongoDB files is compress the string during dump time. The reason behind this is also shown using its file format. that the length of one of the elements in the structure goes beyond the pre-specified threshold.

Redis Conclusion Redis is a key-value NoSQL Database. It is also called an in- memory database, as most of the contents of the database are When we think of any application, databases come into the stored in memory. Some contents are also stored in the file system. picture. Presently big data has captured the world, so we are

Table 17 Parameters of page.

Page Type Description Other Details

Page Header Contains general page information (Contains 4 types of metadata) -General page identifier -Unique page identifier -Object identifier -Record count Row Directory Stores addresses for referencing records within a page Keep eye on insertion of new rows and deletion of existing rows Row Data Stores actual raw data with metadata e.g. Row 2: 34, Pune, 12.13 Row 1: 45, Mumbai, 20.2 R. Chopade, V.K. Pachghare / Digital Investigation 29 (2019) 180e197 195

Fig. 8. NoSQL types.

Table 18 Performance comparison of MongoDB & Riak.

Category Performance of Read Operation Performance of Balanced Read and Write Operations

MongoDB Document Store Good for smaller Dataset Good for smaller Dataset Riak Key- Value Performance is better as compare to MongoDB Performance is better as compare to MongoDB

Table 19 NoSQL security and access control features.

Database Authentication Authorization Logging Access Control Logging

MongoDB Yes Yes Yes No No Cassandra Yes Yes Yes No Yes Redis Yes No Yes No Yes Neo4j Yes No Yes Yes No

Following points are considered as far as forensics examination is concerned.

Table 20 Storage engine data recovery.

Storage Delete Actual Result Method Suggested Engine Operation

MMAPv1 Delete First 4 bytes of data file changes to 0xEEEEEEEE For recovery, NameSpace file is searched with the database name. Deleted Record array Document is sequentially read. BSON documents are recovered using JSON. If deleted record information is not available in NameSpace file then the certain signature is used to find a record in the file. Drop Metadata of related collection is deleted from Namespace As only metadata information is deleted from NameSpace, recovery is done by using Collection file but actual data remains in the data file extent file from NameSpace and extent signature from the data file. Drop Namespace and data file related to the database are Database deleted from the file system WiredTiger Delete The page containing deleted document is removed from Difficult to recover deleted document as no traces are left in the file system. Root offset Document B-tree and a new page is created and linked with B-tree. is checked from WiredTiger.turtle and WiredTiger.wt. The page which is not included in Drop The data file is deleted from the file system B-tree is searched as it will contain deleted data. The cell of this page is traced and Collection compared with normal data. If it is different then, it is marked as deleted and can be Drop recovered. Database

For recovery implementation C# is used and Windows Presentation Foundation (WPF) for GUI implementation. 196 R. Chopade, V.K. Pachghare / Digital Investigation 29 (2019) 180e197

Fig. 9. MongoDB Deployment modes.

Table 21 Contents retrieved from Redis in Percentage (%).

High-Level Data Structure String List Set Sorted Set Hash Data successfully extracted in % 100 90 100 90 95 Low-Level Data Structure String Ziplist Linkedlist Intset Skiplist Hashtable Data successfully extracted in % 100 100 60 100 60 93.33

moving from structured to unstructured databases. When we are Al-Dhaqm, A., Razak, S., Othman, S.H., Choo, K.-K.R., Glisson, W.B., Ali, A., Abrar, M., using the database for storing data, security of that data is very 2017b. CDBFIP: Common database forensic investigation processes for Internet of Things. IEEE Access 5, 24401e24416. much imperative. Digital forensics is an old term used to investigate Azemovic, J., Music, D., 2009. Efficient model for detection data and data scheme digital devices, to catch what has gone wrong while solving the tempering with purpose of valid forensic analysis. In: 2009 International criminal cases. Now Database forensics is an upcoming field of Conference on Computer Engineering and Applications. ICCEA 2009. Beyers, H., Olivier, M., Hancke, G., 2011. Assembling metadata for database forensics. research. As per the interpretation of graph from Fig. 1, database In: IFIP International Conference on Digital Forensics, pp. 89e99. forensic research came into existence hardly in the year 2007. This Beyers, H.Q., Hancke, G.P., Oliviery, M.S., 2014. Database application schema fo- research speed has increased from 2009 onwards. Literature review rensics. S. Afr. Comput. J. 55, 1e11. Bhagwani, M.H., Dharaskar, R.V., Thakare, V.M., 2015. Comparative analysis of on last 10 years research papers is presented here. As per the cur- database forensic algorithms. In: Proceedings on National Conference on rent state of research in database forensics NoSQL databases needs Knowledge, Innovation in Technology and Engineering (NCKITE 2015) NCKITE, consideration, as most of the work on relational databases is pp. 33e36. Cankaya, E.C., Kupka, B., 2016. A survey of digital forensics tools for database already done. Researchers have done forensic analysis using log extraction. In: Future Technologies Conference (FTC), pp. 1014e1019. files, metadata, data files etc. File carving is a renowned method for Choi, J., Choi, K., Lee, S., 2009. Evidence investigation methodologies for detecting , but analysis of the database using carving financial fraud based on forensic accounting. In: 2009 2nd International Con- method is a new approach. Database forensics is a broad area which ference on Computer Science and its Applications, CSA 2009. DB-Engines Homepage. https://db-engines.com/en/ranking. covers research challenges like data recovery, finding data Fasan, O.M., Olivier, M., 2012a. Reconstruction in database forensics. In: IFIP Inter- tampering artifacts, different ways to avoid data tampering etc. national Conference on Digital Forensics, pp. 273e287. Fasan, O.M., Olivier, M.S., 2012b. On dimensions of reconstruction in database fo- rensics. WDFIA, pp. 97e106. Acknowledgment Fasan, O.M., Olivier, M.S., 2012c. Correctness proof for database reconstruction al- gorithm. Digit. Invest. 9, 138e150. Flores, D.A., Jhumka, A., 2017. Implementing chain of custody requirements in The authors wish to acknowledge Information Security Educa- records for forensic purposes. In: Trustcom/BigDataSE/ICESS, tion and Awareness Project, Department of Electronics and Infor- 2017 IEEE, pp. 675e682. mation Technology, Ministry of Communications and Information Frühwirt, P., Huber, M., Mulazzani, M., Weippl, E.R., 2010. InnoDB database foren- sics. In: 2010 24th IEEE International Conference on Advanced Information Technology, Government of India which has made it possible to Networking and Applications, pp. 1028e1036. undertake this research. The authors would also like to thank the Frühwirt, P., Kieseberg, P., Schrittwieser, S., Huber, M., Weippl, E., 2013. InnoDB anonymous reviewers for their valuable suggestions. database forensics: enhanced reconstruction of data manipulation queries from redo logs. Inf. Secur. Tech. Rep. 17, 227e238. Frühwirt, P., Kieseberg, P., Krombholz, K., Weippl, E., 2014. Towards a forensic-aware References database solution: using a secured database replication protocol and trans- action management for digital investigations. Digit. Invest. 11, 336e348. Fruhwirt, P., Kieseberg, P., Weippl, E., 2015. Using internal MySQL/InnoDB B-tree Al-dhaqm, A., Razak, S., Othman, S.H., Ngadi, A., Ahmed, M.N., Mohammed, A.A., index navigation for data hiding. In: IFIP International Conference on Digital 2017a. Development and validation of a database forensic metamodel (DBFM). Forensics, pp. 179e194. PLoS One 12 e0170793. Guimaraes, M.A.M., Austin, R., Said, H., 2010. Database forensics. In: 2010 Infor- Adedayo, O.M., Olivier, M., 2014. Schema reconstruction in database forensics. In: mation Security Curriculum Development Conference, pp. 62e65. IFIP International Conference on Digital Forensics, pp. 101e116. Hauger, W.K., Olivier, M.S., 2014. The role of triggers in database forensics. In: In- Adedayo, O.M., Olivier, M.S., 2015. Ideal log setting for database forensics recon- formation Security for South Africa (ISSA), 2014, pp. 1e7. struction. Digit. Invest. 12, 27e40. Hauger, W.K., Olivier, M.S., 2015a. The state of database forensic research. In: 2015 Al-Dhaqm, A.M.R., Othman, S.H., Razak, S.A., Ngadi, A., 2014. Towards adapting Information Security for South Africa. ISSA), pp. 1e8. metamodelling technique for database forensics investigation domain. In: Hauger, W.K., Olivier, M.S., 2015b. The Impact of Triggers on Forensic Acquisition Biometrics and Security Technologies (ISBAST), 2014 International Symposium and Analysis of Databases. on, pp. 322e327. Hauger, W., Olivier, M., 2015c. Determining trigger involvement during forensic Al-Dhaqm, A., Razak, S.A., Othman, S.H., Nagdi, A., Ali, A., 2016. A generic database attribution in databases. In: IFIP International Conference on Digital Forensics, forensic investigation process model. J. Teknol. 78, 45e57. R. Chopade, V.K. Pachghare / Digital Investigation 29 (2019) 180e197 197

pp. 163e177. Liu, X., Fu, X., Sun, G., 2016. Recovery of deleted record for SQLite3 database. In: Hauger, W.K., Olivier, M.S., 2017. Forensic attribution in NoSQL databases. In: In- Intelligent Human-Machine Systems and Cybernetics (IHMSC), 2016 8th In- formation Security for South Africa (ISSA), 2017, pp. 74e82. ternational Conference on, pp. 183e187. Hauger, W.K., Olivier, M.S., 2018. NoSQL databases: forensic attribution implications. Nemetz, S., Schmitt, S., Freiling, F., 2018. A standardized corpus for SQLite database SAIEE Africa Res. J. 109, 119e132. forensics. Digit. Invest. 24, S121eS130. HIPAA Privacy Rule. https://www.hhs.gov/hipaa/for-professionals/privacy/laws- Olivier, M.S., 2009. On metadata context in database forensics. Digit. Invest. 5, regulations/index.html. 115e123. Khanji, S.I.R., Khattak, A.M., Hacid, H., 2015. Database auditing and forensics: Pavlou, K.E., 2012. Database Forensics in the Service of Information Accountability. exploration and evaluation. In: Computer Systems and Applications (AICCSA), Pavlou, K.E., Snodgrass, R.T., 2008. Forensic analysis of database tampering. ACM 2015. IEEE/ACS 12th International Conference Of, pp. 1e6. Trans. Database Syst. 33, 30. Khanuja, H.K., Adane, D.S., 2011. Database security threats and challenges in data- Pavlou, K.E., Snodgrass, R.T., 2010. The tiled bitmap forensic analysis algorithm. IEEE base forensic: a survey. In: Proceedings of 2011 International Conference on Trans. Knowl. Data Eng. 22, 590e601. Advancements in Information Technology (AIT 2011). Http://Www. Ipcsit. Com/ Pavlou, K.E., Snodgrass, R.T., 2013. Generalizing database forensics. ACM Trans. Vol20/33-ICAIT2011-A4072. Database Syst. 38, 12. Khanuja, H.K., Adane, D.S., 2012. A framework for database forensic analysis. Pieterse, H., Olivier, M., 2012. Data hiding techniques for database environments. In: Comput. Sci. Eng. 2, 27. Advances in Digital Forensics VIII. Springer, pp. 289e301. Khanuja, H.K., Adane, D.S., 2013. Forensic analysis of databases by combining Qi, M., 2014. Digital forensics and NoSQL databases. In: Fuzzy Systems and multiple evidences. Int. J. Comput. Technol. 7, 654e663. Knowledge Discovery (FSKD), 2014 11th International Conference on, Khanuja, H., Suratkar, S.S., 2014. Role of metadata in forensic analysis of database pp. 734e739. attacks. In: Advance Computing Conference (IACC), 2014 IEEE International, Razak, S.A., Othman, S.H., Aldolah, A.A., Ngadi, M.A., 2016. Conceptual investigation pp. 457e462. process model for managing database forensic investigation knowledge. Res. J. Kieseberg, P., Schrittwieser, S., Mulazzani, M., Huber, M., Weippl, E., 2011. Trees Appl. Sci. Eng. Technol. 12, 386e394. cannot lie: using data structures for forensics purposes. In: Intelligence and Sablatura, J., Zhou, B., 2017. Forensic database reconstruction. In: Big Data (Big Security Informatics Conference. EISIC), 2011, European, pp. 282e285. Data), 2017 IEEE International Conference on, pp. 3700e3704. Kieseberg, P., Schrittwieser, S., Morgan, L., Mulazzani, M., Huber, M., Weippl, E., Sarbanes Oxley Act. http://www.soxlaw.com/. 2013. Using the structure of bþ-trees for enhancing logging mechanisms of Snodgrass, R.T., Yao, S.S., Collberg, C., 2004. Tamper detection in audit logs. In: databases. Int. J. Web Inf. Syst. 9, 53e68. Proceedings of the Thirtieth International Conference on Very Large Data Bases, Kieseberg, P., Schrittwieser, S., Weippl, E., 2018. Structural limitations of Bþ-Tree vol. 30, pp. 504e515. forensics. In: Proceedings of the Central European Cybersecurity Conference Son, N., Lee, K., Jeon, S., Chung, H., Lee, S., Lee, C., 2011. The method of database 2018, p. 9. server detection and investigation in the enterprise environment. In: FTRA Kim, J., Park, A., Lee, S., 2016. Recovery method of deleted records and tables from International Conference on Secure and Trust Computing, Data Management, ESE database. Digit. Invest. 18, S118eS124. and Application, pp. 164e171. Li, Q., Hu, X., Wu, H., 2014. Database management strategy and recovery methods of Tripathi, S., Meshram, B.B., 2012. Digital evidence for database tamper detection. Android. In: Software Engineering and Service Science (ICSESS), 2014 5th IEEE J. Inf. Secur. 3, 113. International Conference on, pp. 727e730. Wagner, J., Rasin, A., Grier, J., 2015. Database forensic analysis through internal Litchfield, D., 2007a. Oracle Forensics Part 1: Dissecting the Redo Logs. NGSSoftware structure carving. Digit. Invest. 14, S106eS115. Insight Secur. Res. (NISR), Next Gener. Secur. Softw. Ltd, Sutt. Wagner, J., Rasin, A., Grier, J., 2016. Database image content explorer: carving data Litchfield, D., 2007b. Oracle forensics: Part 3 isolating evidence of attacks against that does not officially exist. Digit. Invest. 18, S97eS107. the authentication mechanism. NGSSoftware Insight Secur. Res. Wagner, J., Rasin, A., Glavic, B., Heart, K., Furst, J., Bressan, L., Grier, J., 2017a. Carving Litchfield, D., 2007c. Oracle Forensics Part 5: Finding Evidence of Data Theft in the database storage to detect and trace security breaches. Digit. Invest. 22, Absence of Auditing. NGSSoftware Insight Secur. Res. (NISR), Next Gener. Secur. S127eS136. Softw. Ltd, Sutt. Wagner, J., Rasin, A., Malik, T., Heart, K., Jehle, H., Grier, J., 2017b. Database forensic Litchfield, D., 2007d. Oracle forensics part 2: locating dropped objects. NGSSoftware analysis with DBCarver. In: CIDR 2017, 8th Biennial Conference on Innovative Insight Secur. Res. Publ. Next Gener. Secur. Softw. Data Systems Research. Litchfield, D., 2007e. Oracle Forensics Part 4: Live Response. NGSSoftware Insight Xu, M., Xu, X., Xu, J., Ren, Y., Zhang, H., Zheng, N., 2014. A forensic analysis method Secur. Res. (NISR), Next Gener. Secur. Softw. Ltd., Sutt. for Redis database based on RDB and AOF file. J. Comput. 9, 2538e2544. Litchfield, D., 2007f. Oracle forensics part 6: examining undo segments, flashback Yoon, J., Lee, S., 2018. A method and tool to recover data deleted from a MongoDB. and the oracle recycle bin. NGSSoftware Insight Secur. Res. Publ. Next Gener. Digit. Invest. 24, 106e120. Secur. Softw. Yoon, J., Jeong, D., Kang, C., Lee, S., 2016. Forensic investigation framework for the Litchfield, D., 2008. Oracle forensics part 7: using the Oracle system change number document store NoSQL DBMS: MongoDB as a case study. Digit. Invest. 17, in forensic investigations. Insight Secur. Res. Publ. NGSSoftware. 53e65.