Journal of , Security and Law

Volume 13 Article 6

October 2018

A New Framework for Securing, Extracting and Analyzing Big Forensic Data

Hitesh Sachdev Georgia Southern University hayden wimmer Georgia Southern University, [email protected]

Lei Chen Georgia Southern University

Carl Rebman University of San Diego, [email protected]

Follow this and additional works at: https://commons.erau.edu/jdfsl

Part of the Computer and Systems Architecture Commons, Computer Law Commons, Data Storage Systems Commons, Information Security Commons, Management Information Systems Commons, Systems Architecture Commons, and the Technology and Innovation Commons

Recommended Citation Sachdev, Hitesh; wimmer, hayden; Chen, Lei; and Rebman, Carl (2018) "A New Framework for Securing, Extracting and Analyzing Big Forensic Data," Journal of Digital Forensics, Security and Law: Vol. 13 , Article 6. DOI: https://doi.org/10.15394/jdfsl.2018.1419 Available at: https://commons.erau.edu/jdfsl/vol13/iss2/6

This Article is brought to you for free and open access by the Journals at Scholarly Commons. It has been accepted for inclusion in Journal of Digital Forensics, Security and Law by an authorized administrator of (c)ADFSL Scholarly Commons. For more information, please contact [email protected]. A NEW FRAMEWORK FOR SECURING, EXTRACTING AND ANALYZING BIG FORENSIC DATA

Hitesh Sachdev1, Hayden Wimmer1, Lei Chen1, Carl Rebman2 1Georgia Southern University Dept. of Information Technology P.O. Box 8150 Statesboro, GA 30415, USA [email protected], [email protected]*, [email protected]

2University of San Diego School of Business Administration 5998 Alcala Park Coronado 212 San Diego, CA 92110 [email protected]

ABSTRACT

Finding new methods to investigate criminal activities, behaviors, and responsibilities has always been a challenge for forensic research. Advances in big data, technology, and increased capabilities of smartphones has contributed to the demand for modern techniques of examination. Smartphones are ubiquitous, transformative, and have become a goldmine for forensics research. Given the right tools and research methods investigating agencies can help crack almost any illegal activity using smartphones. This paper focuses on conducting forensic analysis in exposing a terrorist or criminal network and introduces a new Big Forensic Data Framework model where different technologies of Hadoop and EnCase software are combined in an effort to promote more effective and efficient processing of the massive Big Forensic Data. The research propositions this model postulates could lead the investigating agencies to the head of the terrorist networks. Results indicate the Big Forensic Data Framework model is capable of processing Big Forensic Data. Keywords: Big forensic data, Forensic analysis, Big Data, EnCase, Hadoop, Map and Reduce, NodeXL, Social Network analysis example, there is no shortage of multi- 1. INTRODUCTION billion dollar scams associated with money laundering, bribery, corruption, In recent years, advancement in mobile embezzling, as well as other internet based technologies made them an attractive scams1 While some of the fraud and attacks medium for illegal activities. The number were identified and prevented many others of fraud, criminal use and identity theft by were successfully executed. Consequently, mobile and smartphones has dramatically traditional analytical techniques could be increased (Pascual, Marchini, & Miller, improved by applying big data techniques 2018). This demonstrates the need for for digital forensic analysis. mobile forensic analysis (Curran, Robinson, Peacocke, & Cassidy, 2012). Big Forensic Data Analysis can help Smartphones are no longer used to simply criminal agencies to process information at connect us with our friends, family and a higher rate. This paper proposes a colleagues, but has transformed into a framework for the analysis of Big Forensic repository which stores all our activities, Data by combining digital forensics important dates, numbers, and experiences techniques with the Hadoop and map (audio, video etc.). With internet access reduce framework. The Hadoop framework commonplace on smartphones, one can allows distributed processing of large data send anything at any time. Terrorist and sets across clusters of computers using other criminal networks have adopted this simple programming models. It is designed powerful leap in technology as a new means to scale up to thousands of machines, each of advancing their objectives. From a offering local computation and storage (The digital forensics perspective, new methods Apache Software Foundation, 2004) and frameworks are necessary to process Industry leading forensics software was this plethora of gathered information. One employed for collecting the such possibility includes combining big information/data from the mobile devices, data with forensic analysis. This study passing the data through the Hadoop refers to this combination as Big Forensic framework to aggregate it, and then Data. Without big data techniques, applying Social Network Analysis (SNA) analyzing the voluminous amount of data on the aggregated data. With the processing stored on smartphones would be nearly power of the Hadoop framework, thousands impossible or require a massive amount of of devices could be analyzed efficiently labor hours. thereby reducing the number of labor hours and time to detect criminal activities. Big As the volume, veracity, variety, and Forensic Data Analysis can help criminal velocity of forensic data increases it agencies in the efficient analysis of digital becomes essential to find methods and forensic data leading to more timely techniques to manage and analyze the data decision making capabilities. The and use it for benefit, such as revealing remainder of this paper is structured as criminal networks. Traditional data follows: section II presents a review of warehousing techniques have been relevant literature regarding big data and inefficient in proactive fraud management data forensics, section III discusses the and threat detection in large datasets. For Digital Forensic Process, section IV details

1 http://www.fraud.org the conceptual framework and forensic main fields. First, healthcare which analysis tools, Section V presents the includes, clinical decision support systems, proposed Big Forensic Data Framework individual analytics applied for a patient and illustration, section VI presents of the profile, personalized medicine etc. Second, Case study and Hadoop setup and section the public sector which includes creating VII presents the conclusion and the need for transparency by accessible related data, additional research. discover needs and improve performance. Third, retail which includes product 2. LITERATURE SURVEY placement design, performance improvement, labor inputs optimization etc. 2.1 BIG DATA Fourth, manufacturing which includes Data is being produced with increasing improved demand forecasting, supply chain volume, velocity, and veracity. Five planning, sales support etc. Fifth, personal exabytes (1018) of data, which was location data which includes smart routing, collectively created by humans till 2003 is urban planning, new business models etc. now produced in just two days. In 2012, the Sagiroglu and Sinanc (2013) defines digital world of data expanded to 2.72 21 big data as massive data sets having a large zettabytes (10 ) and is predicted to double and complex structure which are difficult to every two years (Sagiroglu & Sinanc, store and analyze. They further describe big 2013). By the year 2020, it is predicted that data analytics as analysis of the hidden around 50 billion devices will be connected patterns and secret correlations within to the network and the internet (Gerhardt, massive data sets. Labrinidis and Jagadish Griffin, & Klemann, 2012). In 2012, The (2012) developed a panel to discuss the Human Face of Big Data documentary was controversies and debunk the myths accomplished as a global project, which surrounding big data. The goals of the panel centers on real time collection, were, to identify how big data is different visualization and analysis of large amount from the past very large databases and, how of data. Many statistics were derived from data management industries and academia this project. For example, Facebook has 955 come together to solve big data challenges. million monthly active accounts using 70 Additionally, the panel validated many languages, 140 billion photos uploaded, claims such as Big Data is the same as 125 billion friend connections. On scalable analytics and that Big Data YouTube, every minute 48 hours of video problems are primarily at the application are uploaded and 4 billion people view side. videos every day. Google provides many services like monitoring 7.2 billion pages Big data is gaining popularity and per day, processes 20 petabytes (1015) of companies like GE and Verizon Wireless data daily, and translates the pages into 66 are making products that use big data languages. 140 million users send about 1 analytics (Davenport, 2014). Big data also billion Tweets every 72 hours. 571 new promises benefits in healthcare industries in websites are created every minute of the clinical operations, research & day (Sagiroglu & Sinanc, 2013). development, public health, evidence-based medicine, genomic analytics, pre- According to the McKinsey Global adjudication fraud analysis, device/remote Institute, the potential of big data lies in five monitoring, and patient profile analytics (Raghupathi & Raghupathi, 2014). The use of scientifically derived Raghupathi and Raghupathi (2014) also and proven methods towards the discussed the promise and potential of big preservation, collection, validation, data analytics in healthcare. Their study identification, analysis, interpretation, describes how big data analytics (BDA) in documentation and presentation of healthcare is in its initial stage and how the digital evidence derived from digital BDA has the potential to transform the way source for the purpose of facilitating or healthcare providers make decisions. furthering the reconstruction of events Similarly, Patil and Seshadri (2014) discuss to be criminal, or helping to anticipate how big data is being used in healthcare to unauthorized actions shown to be lower cost while improving the care disruptive to planned operations. process, delivery, and management. They Digital data can be obtained from present the state-of-the-art security and several sources and there are multiple privacy issues in big data when applied to methods and tools to analyze this data. the healthcare system. Tassone, Martini, Choo, and Slay (2013) Marchal, Jiang, State, and Engel (2014) discussed the capabilities of forensic tools introduced a new architecture for security for three different platforms: Apple, monitoring of local enterprise networks. Android, and RIM’s BlackBerry. Tassone The application of this system is to detect et al. (2013) identified where each tool can and prevent network intrusions. The be best used and provides limitations of architecture consists of two systems, one tools in accessing different information like which is dedicated to scalable distributed call history, message, contacts, media files, data storage and management and the and other data. Curran et al. (2012) second that is dedicated to data explored guidelines for maximum data exploitation. Katal, Wazid, and Goudar recovery as well as the different types of (2013) discussed the importance of Big subscriber identity modules (SIM) Data for modern analysis and various issues available in the market and the different like Privacy and Security, Data Access, and types of analysis that are carried out on the sharing of information. Guarino (2013) SIM cards for retrieving information. explored the challenges of Big Forensic Curran et al, also highlighted the different Data and provides techniques and types of tools available for forensic research algorithms that are used in big data analysis at http://www.mobiledit.com/forensic- that can also be used in digital forensics and solutions. mentions techniques like MapReduce and There are several aspects that are machine learning for the analysis of the considered when aiming to assess the forensic disk image as a model to integrate privacy level of an application. The data can the new paradigm into the established exist in various forms: data at rest, data in forensic standards. use, and data in transit. All the different 2.2 DIGITAL FORENSICS aspects use different methodologies and technologies. Stirparo and Kounelis (2012) According to the Digital Forensic demonstrated how different mobile Research Workshop in 2001, Digital forensics methodologies and tools can be is defined by Carrier used to assess the privacy level of mobile (2003) as : applications. Catanese, Ferrara, and Fiumara (2013) presented the LogAnalysis results show that nothing could be tool that represents and filters visual data, recovered from BlackBerry devices. On the and also statistical analysis features and the other hand, iPhone and Android store a possibility of a temporal analysis of mobile significant amount of data which can be phone activities. The tool helps in used for forensic research. understanding the structure of the network Digital forensics framework can also and the hierarchies of the criminal help law enforcement agencies. Quick and organization. Ferrara, De Meo, Catanese, Choo (2016) developed an approach for and Fiumara (2014), through their work, data volume reduction which focuses on the have tried to achieve two goals: first to registry, documents, spreadsheets, email, provide a conceptual framework for internet history, communications, logs, detecting and characterizing criminal pictures, videos, and other relevant file organizations from a phone call network. types. When their approach was applied to Next, to provide an expert system which the Australian Law Enforcement Agency, helps in unveiling the underlying structure the data volume was reduced leaving only of the criminal network. Grispos, Storer, the main evidential files and data. and Glisson (2011) discussed the different methods and tools to view information held on a Windows mobile device. Their paper 3. DIGITAL FORENSIC PROCESS seeks to find what information is held on the is a relatively mobile device. They mainly focus on the young discipline as compared to other use of Celebrite’s Universal Forensic forensic sciences. Because of this, Extraction Device (CUFED) as a tool for oftentimes the term computer forensics is acquiring data from the mobile phones. misunderstood. To address this issue, the They demonstrated that no technique can Cybercrime Lab in the Computer Crime and recover all the information of forensic Intellectual Property Section (CCIPS) interest from a mobile device. developed a flowchart, which is shown in Digital forensic tools could be used for Figure 2.10, describing the digital forensic analyzing the data from social websites like analysis methodology. The Cybercrime Lab Facebook, Twitter etc. Al Mutawa, Baggili, developed this flowchart after consulting and Marrington (2012) use different tools with various computer forensic experts for forensic analysis of this data from from several federal agencies. It helps in mobile devices. Their forensic tests were understanding the different elements of the aimed to recover data from the internal process (Carroll, Stephen K. Brannon, & memory of popular smartphones like Song, 2008). iPhone, Android, BlackBerry. All There are three main steps for the smartphones used different tools for analysis of computer forensic data is Data forensic analysis: BlackBerry used Extraction, Identifications and Analysis BlackBerry Desktop Software, iPhone used represented in Figure 3.10 an outlined in a iTunes application, and unlike the other two box. devices, Android does not have a management and backup solution. Therefore, the Android phone was rooted using Odin3 which gave access to the protected directories on the system. The 3.3 ANALYSIS In this phase, the examiners try to answers all the questions like who/what, where, when, how. For each relevant item, examiners try to explain when it was created, accessed, modified, received, sent, viewed, deleted, and launched. After the examiner has analyzed the relevant items, they move to the reporting phase in which the examiner tries to document all the finding so that the layman can understand

and use them in the case. Figure 3.10: Process Overview adopted from (Carroll et al., 2008) 4. CONCEPTUAL FRAMEWORK ATA XTRACTION 3.1 D E 4.1 DIGITAL FORENSICS AND BIG Examiners start by asking whether they DATA have sufficient information to start the Big Forensic Data and its analysis are process. If sufficient data is available, they an emerging topic and not much work is move to the next step where the system is done in this field. Researchers have tried to setup and the forensic data is duplicated. combine these two fields but no framework Once the forensic data is duplicated, the has been designed. Tahir and Iqbal (2015) integrity of the data is analyzed. A plan is studied different techniques for the analysis developed to extract the data. Examiners of forensic data from large datasets like should know the data for which they are Map Reduce, phylogenetic trees, blind searching. They add the data into a list source separation, and image culling. known as “Extracted Data List”. Finally, they concluded that Big Forensic 3.2 IDENTIFICATIONS Data is heavily dependent on the data source. They also mentioned the factors that Once the data is in the extracted data should be kept under consideration during list, the examiners analyze the type of data. the selection of a forensic technique. If the data is relevant, then examiners Zawoad and Hasan (2015) analyzed Big document it into a different document list, Forensic Data using a map and reduce the Relevant Data List. If an examiner framework to explore the challenges and comes across an item that is incriminating, issues in forensic paradigm. They but outside the scope of the original search concluded that the data is growing very fast, warrant, then it is recommended that the and there is a need for providing support for examiner immediately stops all activity, digital forensic in big data application notify the appropriate individuals, domain. including the requester, and wait for further instructions. If the data is not relevant the examiners would simply mark it as processed and move on to other relevant elements. 4.2 DIGITAL FORENSICS ANALYSIS forensic TOOLS analysis. Smartphones are a repository DEFT Linux It is used for containing huge amounts of information (Digital live computer about a user. There are many forensic tools Evidence forensic available in the market for processing this & systems. data. Table 1 illustrates the live forensic Forensic analysis tools which are adapted from Toolkit) Bashir and Khan (2013). In recent years, the market added new and advanced tools BinDiff Windows This tool which are listed in Table 2 to augment the identifies and work by Bashir and Khan (2013) in Table highlights the 1. changes and compares the Table 1. Live Forensic Analysis Tools code of adopted from (Bashir & Khan, 2013) program Tool Platform Description before and Name after running. AIR Windows A Graphical SIFT Ubuntu This tool to (Automat User Interface (SAN perform live ed Image used to create Investigat digital and live image ion forensic in Restore) dump of Forensics the operating memory. Toolkit) system. Autopsy Windows/ This tool is IEF Windows It is used to Linux used for disk (Internet collect analysis. Evidence evidence Finder) from the PDUMP Windows This tool is Internet. used to take a live memory FTK Windows This tool is dump. (Forensic used to the Toolkit) indexing of WFT Windows Used to digital (Window automate the evidence. s Forensic incident Toolchest response and DFF Windows/ This tool is ) perform a live (Digital Linux/ used for digital Forensic Mac digital analysis. Framewo evidence rk) collection and SMART Linux It is used to analysis perform a live toolkit. digital OS Windows This tool is Hash Windows used to Keeper application perform for storing forensic file hash analysis on signatures. web Table 2. New Forensic Analysis Tools browsers, emails, Files, Tool Platform Description and Images. Name CAINE Linux For live EnCase Windows Collect from a (Compute computer wide variety of r Aided forensics on operating and Investigat Linux. file systems, ive including mobile Environ devices(Beneish, ment) Lee, & Tarpley, 2001). COFEE Windows It is a toolkit (Compute for live ISafe Windows ISafe is a r Online digital network and Forensic forensic system Evidence analysis. monitoring Extractor digital forensic ) tool CMAT Windows Memory FTK Windows Allow to acquire (Compile analyzer. Imager images from Memory systems. Analysis ProDisc Windows Find data on the Tool) over computer disk. Wire Windows/ Captures and

Shark Linux/ analyzes Mac packets on 4.3 ENCASE the network. Out of the most commonly used Network Windows/ Extracts files, computer forensic tools in police Miner Linux images and departments in the United States is EnCase other which is used by about 2000 law- metadata enforcement agencies around the world. It from PCAP is a 1-mb program written in C++. EnCase files on the can analyze a disk drive from a Windows, network. Macintosh, Linux, or a DOS machine and make its bit-stream mirror image. To check the authenticity of mirror-image data, EnCase calculates cyclical redundancy checksums and MD5 hashes (Garber, Hadoop can also be used for unstructured 2001). data. Additionally, the open-source framework is free, and scalable. It can grow EnCase forensics gives you the ability by simply adding more nodes (Richardson to quickly search potential evidence to et al., 2010). determine whether further investigation is needed. Data could be collected from a There a few challenges associated with wide range of operating and file systems, Hadoop. MapReduce programming of including mobile devices with EnCase Hadoop is not a good match for all Forensic. EnCase Forensic provides a problems. There’s a widely-acknowledged flexible reporting framework that talent gap. It is hard to find the entry level empowers one to tailor case reports programmers who have enough java according to specific needs (Encase, 2017; programming skill to use MapReduce. Garber, 2001). William D. Taylor, Lastly, Data Security is another challenge president and CEO of the International associated with big data (Richardson et al., Association of Computer Investigative 2010). Specialists says, "It's a great program. I use it every day. It does the work for you if you know what you're doing." (Garber, 2001). 4.5 NODEXL 4.4 HADOOP NodeXL, Network Overview, Discovery, and Exploration add-in for From an open source perspective, big Excel 2007, 2010, 2013, 2016, adds data technologies, like Hadoop, provide network analysis and visualization features cost advantages which is an important to spreadsheets. NodeXL takes the data factor in any business. Hadoop was created from spreadsheets and converts it into a in 2008 by data scientists Doug Cutting and graph (Smith et al., 2009). NodeXL has Mike Cafarella. Their aim was to return many features which make it easy to use web search results faster by distributing represented in Table 3 (MarcSmith, 2016): data and calculations across different computers. Hadoop is an open-source Table 3. Features of NodeXL software framework for running very large data sets on clustered environments built NodeXL Description from commodity hardware. It provides Graph Metric NodeXL can calculate massive storage, enormous processing Calculations network metrics like power, and also supports hardware failure degree, closeness (Richardson, Tuna, & Wysocki, 2010). centrality, eigenvector Hadoop plays an important role because of centrality, PageRank, its advantages like computing power, clustering coefficient, distributed storage and distributed graph density etc. processing, which help in faster processing of data. Other features of Hadoop include fault tolerance. Data and application Flexible NodeXL has a processing are protected against hardware Import and predefined format in failure. If node fails, Hadoop automatically Export which it stores the data. assigns the job to another node. Most of the Feature NodeXL can import and databases store the structured data but export graphs from different input files illustrated in Figure 3.9 GraphML, Pajek, are taken from the digital forensic tool, UCINet, and matrix Encase. These files are passed to the formats. Hadoop framework for processing. The processed data is then fed NodeXL to

design a social network graph. This graph Direct NodeXL allows import helps in exposing a criminal network such Connections of network data from as an organized crime or terrorist network. to Social Twitter, Facebook, Networks Exchange, Wikis, and Input Encase File 1 YouTube etc. F I Encase Input · File 2 L Zoom and NodeXL allows to E Scale: zoom into areas of · · M NODEXL interest, and scale the E Encase Input File 3 R graph's vertices to Graphical Hadoop G Representation reduce clutter. Processor E R Easily NodeXL allows you to Adjusted set the color, shape, Encase Input Appearance size, label and much File n more. Figure 3.9: System Architecture Task Using NodeXL can The Big Forensic Data Framework has Automation repeat the set of tasks different stages: with a single click. Stage 1 represents the smartphones of NodeXL->Graph- the convicted criminals that are seized and >Automate->Run this taken into custody. A smartphone is like a step executes all that is repository in today’s modern era and it needed to process a contains all the information from meeting network data set from schedules to personal information that can start to finished, be obtained about the criminal. This published report smartphone is passed to the EnCase digital forensic tool for data extraction. Stage 2 represents EnCase digital 5. BIG FORENSIC DATA forensic tool which takes the smartphone as FRAMEWORK an input. This tool makes an image of the device into the local system which can be Our proposed Big Forensic Data used by the investigating agencies. In our Framework combines all the above- use case, we extracted the criminal’s text mentioned components, namely digital message. We can also find all the other forensics, Hadoop, and NodeXL. Figure 3.9 information which is available in the represents our system architecture. The smartphone. Encase generates a report based on the version 5.1 (Lollipop). For extracting the requirement of the investigating data, a new case was created with the name agency/officer. This generated report can be of “Tablet”. Our process follows the seen in Fig 13. It gives all the information standard digital forensic process (Carroll et like what, when, and where the text al., 2008). For step 1, the data extraction message was sent or delivered. This report phase, we extracted smartphone data from is unstructured in nature which requires an aforementioned device. Using EnCase manual human intervention to analyze the we duplicated the smartphone data. For the report. A separate report is generated for identification phase, we used the text each smartphone. As the report is messages extracted by EnCase. In the final unstructured in nature and requires a human phase, using the SNA graph, we graphed to perform analysis. We aim to improve this the connections between the text messages. process via employing the Hadoop For our experiment, 3 different networks framework to reduce the amount of human were used. We took one data set from the effort required. The Hadoop framework is ZTE phone that represents Hitesh’s designed to handle unstructured files. By network. The other two, Max’s and this process data can be converted into Chinmay’s networks, were fabricated to valuable information about the criminal. support the experiment. In Max’s and Chinmay’s phones different numbers were The output files from the Hadoop used to send text. One common connection, framework are given to file merger process Matt, was used in the phone numbers because the NodeXL takes all output files between all the three different networks. in a specific format. Before passing each file separately to NodeXL, the files are EnCase has a user friendly graphical merged and formatted to directly input into user interface (GUI). Clicking on the NodeXL. Once the output files from the “acquire smartphone” link as in Figure Hadoop framework are changed to the 6.10, a window will open that will ask about NodeXL format by the file merger it is then the details of the smartphone. The EnCase passed NodeXL. tool gives you options for all the different devices like Apple, Blackberry, Nokia, NodeXL uses the file generated by file Palm etc. Figure 6.11, on the right side, merger to generate the social network shows the option of choosing the elements analysis (SNA) graph. SNA graph necessary for forensic analysis. generated by NodeXL can possibly give you the connection of the kingpin with other criminal groups. The graph facilitates examining the network to determine how the network operates and who is connected to whom.

6. CASE STUDY We employed the EnCase forensic tool for extracting the forensic data from mobile devices. The data was extracted from a ZTE phone, model number Z812, android Figure 6.10: Acquire Smartphone

Figure 6.11: Type of Smartphone Figure 6.12 has all the previous cases and is the starting point in the EnCase tool. Figure 6.13 shows when a specific case is read, data is extracted from a device. Figure 6.14 represents the hierarchy of the element of the device. Last, Figure 6.15 shows the extracted report from the EnCase tool. Figure 6.15 gives the option of selecting the Figure 6.13: Screen with a specific case elements to include in the final report.

Figure 6.14: Forensic data is duplicated

Figure 6.12: Home Screen with all the cases Name Node

Client

Read Data Nodes Data Nodes

Replications

Rack 1 Rack 2

Write Client Figure 6.16: HDFS Architecture adopted from (Borthakur) The command “Hadoop fs –put” was used to put the digital forensic report into the HDFS. In our project, we used the command “Hadoop fs –put reportSMS.txt /user/dfProject/BFDFramework/input.” To Figure 6.15: Report from EnCase check the text file in HDFS “Hadoop fs –ls” 6.1 HADOOP SETUP or in our case “Hadoop fs –ls /user/dfProject/BFDFramework/input” was The Hadoop Architecture has two main used. To remove a file from the HDFS we components, the Hadoop Distributed File use the command “Hadoop fs –rm System (HDFS) and MapReduce (Alam & /user/dfproject/BFDFramework/input/ Ahmed, 2014). HDFS contains three major reportSMS”. components Name Node, Data Node, and Secondary Name Node. Name node is also MapReduce is a software framework known as the master node. It contains all the for processing a large amount of data in- information about the data blocks against parallel on multi-node clustered each data node. Data node is the node which environment, in a reliable, fault tolerant contains the actual data. Upon any request, manner. The task of MapReduce is to divide the data can be read and written to and from the data into smaller blocks which are this node. The Secondary Name Node is the processed individually by the mapper in a helper of the Name Node. When name node parallel manner. Then this data is fed to the performs some action on any of the data reducer. Typically, both the input and the nodes, it creates a checkpoint and that output of the job are stored in a file-system. checkpoint saved on the secondary name The framework takes care of scheduling node. The system architecture of the HDFS tasks, monitoring them and re-executes the is illustrated in Figure 6.16. failed tasks. Figure 3.17 shows the operation of the MapReduce framework. The command “Hadoop jar BFDFramework.jar org.myorg”. BFDFramework “user/dfProject/project1/input /user/dfProject/BFDFramework/output” is used to execute our MapReduce job. Once criminal network, we analyzed the text the program is successfully executed, the messages sent within the network. After results can be viewed with “Hadoop fs –cat analyzing text messages, we observed that /user/dfProject/BFDFramework/output/*” there are 3 different organizations which are command. illustrated in Figure 6.18, 6.19 and 6.20:

Input Splitting Mapping Shuffling Reducing Output

ISIS, 1 HAMAS, 1 Taliban, 1 HAMAS, 1 HAMAS, 1 HAMAS, 1 HAMAS, 1 ISIS Taliban HAMAS HAMAS, 4

ISIS Taliban HAMAS HAMAS, 4 HAMAS, 1 HAMAS HAMAS Taliban HAMAS Taliban HAMAS ISIS, 1 ISIS, 2 ISIS, 2 HAMAS, 1 HAMAS ISIS Taliban ISIS, 1 Taliban, 3 Taliban, 1

HAMAS ISIS Taliban Taliban, 3 HAMAS, 1 Taliban, 1 ISIS, 1 Taliban, 1 Taliban, 1 Taliban, 1

Figure 6.17: Map Reduce Architecture

In this work, we created a Hadoop cluster using 3 virtualized machines, one master node and two name nodes that will Figure 6.18 Max’s Network process our MapReduce job. We created this distributed computing environment for storing and analyzing the huge amount of unstructured data. In our case, the data is reports extracted from the EnCase forensic software. This data is stored on HDFS, separate from a machine’s file system. HDFS helps facilitate the distribution of data across the cluster for Hadoop operations. 6.2 NODEXL SOCIAL NETWORK ANALYSIS FOR BIG FORENSIC DATA The output of the Big Forensic Data Framework acts as an input for NodeXL. NodeXL is used for network analysis and visualization. In this paper, we built a case Figure 6.19: Chinmay’s Network for exposing criminal/terrorist network. After analyzing the data from Hadoop framework, a part of Big Forensic Data, this data is fed to NodeXL. In exposing a 7. CONCLUSION In today’s world terrorism and crime are a major concern. In this work, we focused on conducting forensic analysis to expose a terrorist or criminal network. Criminal and terrorist activities frequently employ mobile technology such as smart phones. Advances in technology facilitate data to grow at exponential rates coining the Figure 6.20: Hitesh’s Network term Big Data. Similarly, Big Data appears when dealing with forensic analysis of When we looked at the bigger picture mobile devices coining the term Big of the network we observed that Matt was a Forensic Data. While having this vast single link between the heads of all three amount of data can be beneficial in digital criminal organizations. This is illustrated in forensic analysis, this data is not useful Figure 6.21. Based on this case, an unless it can be explored appropriately. investigator could determine there are 3 connected criminal or terrorist networks. The existing analysis tools prove to be Each network has someone coordinating inadequate for investigating Big Forensic communications and there is yet another Data, hence, new methodologies need to be person coordinating between the networks. introduced to perform forensic analysis As is the case with most organized criminal with an expected minimum time response. networks, each of the networks, or cells, are By combining a different components from unaware of other cells and the larger part of both digital forensics, big data, and social the network and are only concerned with network analysis, we designed a new executing their specific tasks. This makes it framework to explore Big Forensic Data. difficult to extract information from actors We demonstrated how this new Big in the network; however, employing the Big Forensic Data Framework can be used to Forensic Data Framework on seized expedite and support the investigation devices may yield additional information process. We also demonstrated that text and insight for investigators. messages can be used to identify, analyze, and expose terrorist and criminal social networks. Our paper identifies the challenges of forensic analysis and provides a new approach to digital forensics; however, additional research is needed to solve more challenges in digital forensic analysis and provide opportunities for new insights. We propose a new branch of science with Big Forensic Data and our Big Forensic Data Framework. Figure 6.21. Analysis of Criminal Network 8. REFERENCES: Davenport, T. (2014). Three big benefits of big data analytics. Retrieved from Al Mutawa, N., Baggili, I., & Marrington, https://www.sas.com/en_ca/news/s A. (2012). Forensic analysis of ascom/2014q3/Big-data- social networking applications on davenport.html mobile devices. digital De Jong, K. A. (2006). Evolutionary investigation, 9, S24-S33. computation : a unified approach. Alam, A., & Ahmed, J. (2014). Hadoop Cambridge, Mass.: MIT Press. Architecture and its issues. Paper Encase. (2017). EnCase Forensic Software. presented at the Computational Retrieved from Science and Computational https://www.guidancesoftware.co Intelligence (CSCI), 2014 m/-forensic International Conference on. Ferrara, E., De Meo, P., Catanese, S., & Bashir, M. S., & Khan, M. (2013). Triage in Fiumara, G. (2014). Detecting Live Digital Forensic Analysis. criminal organizations in mobile International journal of Forensic phone networks. Expert Systems Computer Science, 1, 35-44. with Applications, 41(13), 5733- Beneish, M. D., Lee, C. M. C., & Tarpley, 5750. R. L. (2001). Contextual Garber, L. (2001). Encase: A case study in Fundamental Analysis through the computer-forensic technology. Prediction of Extreme Returns. IEEE Computer Magazine Review of Accounting Studies, 6, January. 165-189. Gerhardt, B., Griffin, K., & Klemann, R. Borthakur, D. HDFS Architecture Guide. (2012). Unlocking value in the Retrieved from fragmented world of big data https://hadoop.apache.org/docs/r1. analytics. Cisco Internet Business 2.1/hdfs_design.html Solutions Group, June. Carrier, B. (2003). Defining digital forensic Grispos, G., Storer, T., & Glisson, W. B. examination and analysis tools (2011). A comparison of forensic using abstraction layers. evidence recovery techniques for a International Journal of Digital windows mobile smart phone. Evidence, 1(4), 1-12. digital investigation, 8(1), 23-36. Carroll, O. L., Stephen K. Brannon, & Guarino, A. (2013). Digital forensics as a Song, T. (2008). Computer big data challenge ISSE 2013 Forensics. 56. Securing Electronic Business Catanese, S., Ferrara, E., & Fiumara, G. Processes (pp. 197-203): Springer. (2013). Forensic analysis of phone Katal, A., Wazid, M., & Goudar, R. (2013). call networks. Social Network Big data: issues, challenges, tools Analysis and Mining, 3(1), 15-33. and good practices. Paper Curran, K., Robinson, A., Peacocke, S., & presented at the Contemporary Cassidy, S. (2012). Mobile phone Computing (IC3), 2013 Sixth forensic analysis. Crime International Conference on. Prevention Technologies and Labrinidis, A., & Jagadish, H. V. (2012). Applications for Advancing Challenges and opportunities with Criminal Investigation, 250. big data. Proceedings of the VLDB (social media) networks with Endowment, 5(12), 2032-2033. NodeXL. Paper presented at the Marchal, S., Jiang, X., State, R., & Engel, Proceedings of the fourth T. (2014). A Big Data Architecture international conference on for Large Scale Security Communities and technologies. Monitoring. Paper presented at the Stirparo, P., & Kounelis, I. (2012). The 2014 IEEE International Congress mobileak project: Forensics on Big Data. methodology for mobile MarcSmith. (2016, 9/27/2016). NodeXL: application privacy assessment. Network Overview, Discovery and Paper presented at the Internet Exploration for Excel. Retrieved Technology And Secured from http://nodexl.codeplex.com/ Transactions, 2012 International Pascual, A., Marchini, K., & Miller, S. Conference for. (2018). Al Pascual, Kyle Marchini, Tahir, S., & Iqbal, W. (2015). Big Data??? Sarah Miller. Retrieved from An evolving concern for forensic Patil, H. K., & Seshadri, R. (2014). Big data investigators. Paper presented at security and privacy issues in the Anti-Cybercrime (ICACC), healthcare. Paper presented at the 2015 First International 2014 IEEE international congress Conference on. on big data. Tassone, C., Martini, B., Choo, K.-K. R., & Quick, D., & Choo, K.-K. R. (2016). Big Slay, J. (2013). Mobile device forensic data reduction: digital forensics: A snapshot. Trends and forensic images and electronic Issues in Crime and Criminal evidence. Cluster Computing, 1- Justice(460), 1. 18. The Apache Software Foundation. (2004). Raghupathi, W., & Raghupathi, V. (2014). What Is Apache Hadoop? Big data analytics in healthcare: Retrieved from promise and potential. Health http://hadoop.apache.org/ Information Science and Systems, Zawoad, S., & Hasan, R. (2015). Digital 2(1), 1. Forensics in the Age of Big Data: Richardson, S., Tuna, I., & Wysocki, P. Challenges, Approaches, and (2010). Accountinganomalies and Opportunities. Paper presented at fundamental analysis: A review of the High Performance Computing recent research advances. Journal and Communications (HPCC), of Accounting and Economics, 2015 IEEE 7th International 50(2-3), 410-454. Symposium on Cyberspace Safety Sagiroglu, S., & Sinanc, D. (2013). Big and Security (CSS), 2015 IEEE data: A review. Paper presented at 12th International Conferen on the Collaboration Technologies Embedded Software and Systems and Systems (CTS), 2013 (ICESS), 2015 IEEE 17th International Conference on. International Conference on. Smith, M. A., Shneiderman, B., Milic-

Frayling, N., Mendes Rodrigues, E., Barash, V., Dunne, C., . . . Gleave, E. (2009). Analyzing