Carbone 1

TITLE Image Visualization and Content Identification Using Mathematica

/Introduction In this article, we investigate image visualization (IV) and image content identification (ICI) using Mathematica. It is not the only framework to provide such capability; popular web services, cloud APIs and programs are readily available. These include various online services including ’s Vision API and Image Recognition service, Microsoft’s CaptionBot, as well as desktop software including Picassa (an image organizer) and many others. Each capable in its own right, some have capabilities that others do not. Nevertheless, in some ways Mathematica is superior to these programs because it gives the user, through its rich functionality, the ability to explore data and imagery in ways the others do not. Those already using online APIs should have little difficulty adapting to the Wolfram Data Language (WDL). Of course, the issue of uploading imagery from an investigation to the Internet or Cloud is ill advised.

Many of today’s programming languages have image processing (IP) capabilities, built-in or from third-party libraries, which have helped to advance the spread of computer vision. Many of these languages and libraries typically require many lines of programming, and some require extensive use and manipulation of data structures. Using WDL, we will show how little code is actually required to perform complex IP. Our Mathematica code, concentrated in the program body, should be understood by anyone with prior programming experience. As with our first Mathematica article, we expand on data processing using file I/O for generating potentially actionable information by transforming images into more meaningful information.

Throughout this article, three different datasets are used to provide useful examples and establish performance metrics. Important points for building WDL program are introduced, enabling us to pick up where we left off from our first Mathematica article. WDL programming is straightforward but like any language, the more complex the program the more difficult the programming. WDL helps to reduce many of these hurdles by providing an intuitive manner for building up program functionality. We hope to show substantially more complex programs compared to our first article suitable for digital forensics (DF), in little time.

Finally, to save space in our code we have reduced the various snippets by condensing the previously examined elements from Article 1 into familiar nuggets, denoted by “(* TITLE... *)”. If a code heading is followed by an equals sign (=) this signifies that we are adding or modifying that subsection of code; if “…” is used then we are not making any changes to that subsection of code.

/About the datasets The first dataset is our primary set, based on a collection of photos taken by the authors using various camera models. They are from personal collections so are free from copyright and attribution notices. Consisting of a wide variety of places, things, time of day (e.g., daytime, blue skies, sunsets), pets, vehicles and stellar objects it makes for a realistic set of images. It is a small collection consisting of 121 images representing 961 MB of data. They include large panoramas, the largest of which is 14,807 x 11,631 pixels in size. From this set, we hope to provide useful information concerning Mathematica’s overall ICI accuracy.

The second set is used to more closely examine its ability to identify and differentiate between persons in images, their backgrounds and what they are wearing, issues that tend to confuse ICI software. The images are medium sized and occupy several MB of disk space. To make the analysis more realistic we used images from the web consisting of various “persons”. Anyone working in DF knows all too well the prevalence of finding images of “persons” in a typical investigation.

The final set, which we use for establishing meaningful performance metrics, consists of 8,620 images files consuming about 22.0 GB of disk space. This set represents the full set from which the first was established - we visualize and perform ICI on it as well.

Finally, performance and processing issues for all datasets will be discussed.

/Test system specs

Carbone 2

To implement and test the functionality we describe, we need a capable system for experimentation with to ensure that our code is both accurate and functional.

For our testing purposed, we use a customized workstation equipped with 2 Xeon 2630v3 processors running at 2.40 GHz, with 128 GiB RAM running atop a SuperMicro X10-DAL-i motherboard. Swap is placed onto a Kingston SV300S3 SSD raw partition and set at 128 GiB, enabled for this article. An NVidia GeForce 720 2 GiB video card supported system graphics. Five internal 1 TB hard drives are used for data storage needs, but no RAID was configured. The system is running a heavily customized installation of Fedora 23 x64- using kernel 4.4.9- 300 SMP. Currently, we are running Mathematica is 10.4.1.0.

The concepts shown in this article are applicable to Windows, Linux and Mac OS X.

/Concept I – Clearing memory of previously stored results Mathematica is indeed a strange creature. In Article 1 we recommended using ClearSystemCache[] and ClearAll[“Global`*”] to clear both the Wolfram system cache and all values or information associated to existing symbols. It turns out that this is insufficient to “truly clear the cache” of past results or evaluations. To definitively clear out this memory we set $HistoryLength=0 (zero). This will conclusively clear all previous results from memory, returning all unused memory back to the collective pool.

We only learned of this memory clearing issue after rerunning our ICI program many times over, until it crashed Mathematica because of insufficient system memory. At that time, Wolfram technical support informed us of this feature, discussed in Wolfram’s Documentation Center.

Specifics for correctly implementing a memory clearing routine are shown in Code Snippet 1. From this point forward, we will use this routine in all WDL programs.

(* PROGRAM INITIALIZATION & MEMORY MANAGEMENT (Begin) = *) (* Clear everything *) ClearSystemCache[] ClearAll["Global`*"] $HistoryLength=0 Code Snippet 1: Procedure to correctly clear memory of unused symbols, history and cache.

Some blogs and non-Wolfram support sites state that Remove[“Global`*”] will clear a notebook’s (NB) memory of all symbols, cache and history; this is incorrect. The only way to do this correctly is using Code Snippet 1.

/Concept II – Keeping track of memory usage Key to running any program consuming large amounts of memory is keeping track of it. Mathematica makes this simple. We create a variable, say startmemory, to store the value obtained from MemoryInUse[], which we implement shortly after a program begins. MemoryInUse[] identifies the amount of memory in use by the Wolfram system at the moment it is called. To identify the memory in use by Mathematica’s front-end use MemoryInUse[$FrontEnd].

Equally important is determining how much memory is in use at the end of the program, just before closing files and cleaning up. To determine this we again run MemoryInUse[] and to store this value to some new variable, say finalmemory. Of course, we will want to know the program’s peak memory usage; this value is determined from MaxMemoryUsed[].

We can use Share[] to reduce duplication of memory data structures used by a program’s functions, expressions and variables. Extensive testing reveals memory savings of several hundred KB to several MB with no noticeable overhead. To maximize memory savings use it shortly after a program begins. If the need arises to share specific program elements we can use Share[x] where x is some variable, expression or function.

To implement these memory management capabilities, we need look no further Code Snippet 2.

Carbone 3

(* PROGRAM INITIALIZATION & MEMORY MANAGEMENT (Begin) = *) (* Clear everything, share memory and start memory tracking *) … Share[]; startmemory = MemoryInUse[];

(* PROGRAM TIME KEEPING (Begin)... *)

(* PROGRAM BODY... *)

(* PROGRAM TIME KEEPING (End)... *)

(* MEMORY MANAGEMENT (End) = *) (* Print out program memory usage information *) maxmem = MaxMemoryUsed[]; finalmemory = MemoryInUse[]; StringForm["Memory in use when program started: ``\n", startmemory] StringForm["Max. memory consumed by program: ``\n", maxmem - startmemory] StringForm["Memory in use when program ended: ``\n", finalmemory]

(* CLOSE UP PROGRAM & STREAMS... *) Code Snippet 2: Memory management code.

/Concept III (Optional) – System report writing and running external commands As with any investigation, report writing is for to detailing the tools and systems used in an analysis. Mathematica provides “self-documenting” functionality, quickly helping us to gather considerable information about a Wolfram deployment. Fortunately, the names of these functions are straightforward and self- explanatory. For the curious, Mathematica’s documentation has in-depth details about these functions.

Ideally, information obtained from these functions should be stored to an external file. While highly specialised output-generating code can be implemented in a NB, it is much simpler to export this information directly to PDF or to a text file. We will use the latter approach, which is shown in Code Snippet 3, where we save the output to file systeminfo.txt about the current deployment.

It may be necessary when generating a Mathematica report to obtain specific operating system (OS) information, which Mathematica’s built-in functionality does not obtain. To run a command or program and save its output, from Mathematica we use variable = ReadList["!command –parameter(s)”, String]. For example, to get running Linux kernel version we run command uname -a. Implementing this specific command in our code is shown in bold in Code Snippet 3.

(* PROGRAM INITIALIZATION & MEMORY MANAGEMENT (Begin) ... *)

(* PROGRAM TIME KEEPING (Begin)... *)

(* PROGRAM BODY = *)

(* Define files *) sysinfofile = "systeminfo.txt" ...

(* Define input, output and error streams *) ... sysinfostream = OpenWrite[sysinfofile];

...

(* WOLFRAM SYSTEM INFORMATION REPORTING (OPTIONAL) = *) (* Write Wolfram Deployment Information to Report File *)

Carbone 4

WriteString[sysinfostream, "Operating System: ", $OperatingSystem, "\n"] uname = ReadList["!uname -a", String]; WriteString[sysinfostream, "Linux Kernel Information:", uname, "\n"] WriteString[sysinfostream, "Base System Information: ", $System, "\n"] WriteString[sysinfostream, "System Shell: ", $SystemShell, "\n"] WriteString[sysinfostream, "Physical RAM (in bytes): ", $SystemMemory, "\n"] WriteString[sysinfostream, "Char Encoding Type: ", $SystemCharacterEncoding, "\n"] WriteString[sysinfostream, "Mathematica Version: ", $Version, "\n"] WriteString[sysinfostream, "Hardware Type: ", $MachineType, "\n"] WriteString[sysinfostream, "Default Num. Kernels: ", $ConfiguredKernels, "\n"] WriteString[sysinfostream, "Home Directory: ", $HomeDirectory, "\n"] WriteString[sysinfostream, "Installation Directory: ", $InstallationDirectory, "\n"] WriteString[sysinfostream, "Machine Precision: ", $MachinePrecision, "\n"] WriteString[sysinfostream, "Avail. CPU Cores (No HT):", $ProcessorCount, "\n"] WriteString[sysinfostream, "Architecture Type: ", $ProcessorType, "\n"] WriteString[sysinfostream, "DNS Information: ", $MachineDomains, "\n"] WriteString[sysinfostream, "System IP Address(es): ", $MachineAddresses, "\n"] WriteString[sysinfostream, "Network License Server: ", $LicenseServer, "\n"] WriteString[sysinfostream, "License ID: ", $LicenseID, "\n"] WriteString[sysinfostream, "Activation Key: ", $ActivationKey, "\n"] WriteString[sysinfostream, "Packages: ", $Packages, "\n"]

(* PROGRAM TIME KEEPING (End)... *)

(* MEMORY MANAGEMENT (End)... *)

(* CLOSE UP PROGRAM & STREAMS = *) ... Close[sysinfostream]; Code Snippet 3: Wolfram system information reporting code.

Example output from running Code Snippet 3 is shown in Output 1.

Operating System: Linux Kernel Information:{Linux localhost.localdomain 4.4.9-300.fc23.x86_64 #1 SMP Wed May 4 23:56:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux} Base System Information: Linux x86 (64-bit) System Shell: /bin/sh Physical RAM (in bytes): 135077982208 Char Encoding Type: UTF-8 Mathematica Version: 10.4.1 for Linux x86 (64-bit) (April 11, 2016) Hardware Type: PC Default Num. Kernels: {<<8 local kernels>>} Home Directory: /home/username Installation Directory: /usr/local/Wolfram/Mathematica/10.4 Machine Precision: 15.9546 Avail. CPU Cores (No HT):16 Architecture Type: x86-64 DNS Information: {} System IP Address(es): {fd10::bc1:8dce:ae49:3bcf, 192.168.102.11, 192.168.10.10} Network License Server: localhost.localdomain License ID: L1122-3344 Activation Key: 3347-0104-LVUL6P Packages: {Parallel`Protected`, Parallel`Developer`, Parallel`Evaluate`, Parallel`Combine`, Parallel`Queue`Priority`, Parallel`Queue`FIFO`, Parallel`Queue`Interface`, Parallel`Concurrency`, Parallel`VirtualShared`, Parallel`Status`, SubKernels`LocalKernels`, Parallel`Palette`, Parallel`Parallel`, SubKernels`Protected`, SubKernels`, Parallel`Kernels`, Parallel`, Parallel`Debug`Perfmon`, Parallel`Debug`, Parallel`Preferences`, ProcessLink`, QuantityUnits`, TypeSystem`, WolframAlphaClient`, Macros`, GeneralUtilities`, Developer`, JLink`,

Carbone 5

GetFEKernelInit`, CloudObjectLoader`, StreamingLoader`, IconizeLoader`, ResourceLocator`, PacletManager`, System`, Global`} Output 1: System information obtained from text file “systeminfo.txt.” (Note: HT = Hyper-Threading)

/Concept V – Importing graphics in Mathematica Before we examine IV and ICI, we need to consider how graphics are opened in Mathematica. To open an image, we must first import it; this in contrast to opening an image. Through importation, the raster data of the image and all its are available to the user. We will briefly look at metadata manipulation later on.

Once imported, a myriad of IP and other computations can be applied to an images. We can import one, two, or many images with a single line of code. While many parameters can be specified when importing an image or other file format, in its most general sense for immediate use, we use Import[]. Its functionality is straightforward, and if uninterested in metadata or specialized capabilities, we specify Import[“filename”]. Putting a semicolon “;” after the statement prevents the image from being displayed; sometimes this is useful, especially when processing multiple images.

In Figure 1, we show what a large image looks like after importation. It is large, measuring 7,668 x 2,433 pixels, about 74 MB in size.

Although versatile and highly capable Mathematica is not designed to display multiple very large images and allow for their real-time manipulation, as we would do in a program like Photoshop. It is important to remember that Mathematica is not Photoshop, GIMP or other IP program. Its power lies in the operations and manipulations the frameworks lets us perform against one or many images.

Once an image is imported, right clicking just once on the image will display an “image processing” toolbar. On the toolbar, clicking just once on the orange i will present the user with information about the image, as shown in Figure 2. The various options and capabilities of the toolbar can then be explored which provide various IP shortcuts. Looking at Figure 2, we see just how much information about an image is presented to the user.

Figure 1: Importing a graphics image directly into Mathematica.

Carbone 6

Figure 2: Checking out an image file’s information in Mathematica.

/Image visualization – The specifics Mathematica is not an image previewing application; when viewing or scrolling through many hundreds or thousands of images is needed a more suitable program should be used. Visualization in Mathematica should be used on its own to explore distinct datasets or used in combination with more complex programs.

By default, Mathematica automatically manages the sizing or “scaling” of all images it displays. This is often beneficial when working with several images at the same time because it alleviates the user from having to juggle the position of the images on the screen and manage their size.

For example, if we just wanted to open one image, as shown in Figure 1, rescaling it in Mathematica is easy. We can adjust the magnification from the image toolbar, as seen in Figure 2, or we can right-click the image to bring up its Magnification feature and select the appropriate scaling, as shown in Figure 3.

Left it to its defaults, even very large images are readily displayed in the current NB window. The trouble begins when trying to explore many images, especially when several of them are large. Then the lack of interface responsiveness becomes obvious.

Carbone 7

Figure 3: Setting an appropriate magnification level. A simple solution is to use a for loop and resize the images as we cycle through them. We started by using this approach but quickly learned our code, based on the feeding in of I/O via the loop that we could not adjust the image’s scaling or magnification, therefore losing significant control over visualization. Thus, another method was required and which took immediate advantage of Mathematica’s default behaviour, placing images next to the other, left to right, line by line.

By resizing images to something more reasonable in size we are able to visualize many more images from Mathematica’s interface, which makes working with the interface far more manageable. The visualization code shown in Code Snippet 4 is straightforward to understand, but there are a several features from Mathematica’s functional programming paradigm requiring explanation.

(* PROGRAM INITIALIZATION & MEMORY MANAGEMENT (Begin) ... *)

(* PROGRAM TIME KEEPING (Begin)... *)

(* PROGRAM BODY = *) ... inputfile = "dataset1_list.txt"; length = Length@Import[inputfile, "Lines"]; list = Import[inputfile, "Lines"]; StringForm["Number of files to be processed from input file: ``\n", length]

images = Parallelize[Map[Import[#] &, list]]; (* PART 1 *) resizedImages = Parallelize[Map[ImageResize[#, {200}] &, images]]; (* PART 2 *) Parallelize[Image[#, Magnification -> 1] & /@ resizedImages] (* PART 3 *)

(* WOLFRAM SYSTEM INFORMATION REPORTING (OPTIONAL) ... *)

(* PROGRAM TIME KEEPING (End)... *)

(* MEMORY MANAGEMENT (End)... *)

(* CLOSE UP PROGRAM & STREAMS ... *) Code Snippet 4: Image visualization, resizing and magnification code, suitable for parallelization (important variables and Parallelize[] have been colour highlighted).

Carbone 8

Code Snippet 4 has been colour coded to help read it. Just three lines of code perform the actual data processing; we have done away entirely with looping by loading all images into memory. In our opinion, if control over the screen size of images displayed in a NB is important then this is the best way to do it, though other methods exist.

The program body begins by importing the contents of text file dataset1_list.txt, which contains the filenames and locations of our primary image dataset. We built this list using the commands shown in Article 1. Upon importation, this list is stored in variable list.

Examining the code tagged as “* PART 1 *” in, we proceed to import the entire set of images, whose names are stored in list. The images are imported in a single sweep. We then store the graphics data of these images in variable images. The more images, the more RAM will be required. To do this, Import[] is called by Map[]; this function is used to apply some action or function, f(x), to every element in list. In this case, f(x) = Import[]. Symbols # and & ensure that Import[] is run against all items in list. The output from this line of code is suppressed, because of the semicolon. Then, we use Parallelize[]to increase the performance of this code, which realizes a significant speedup.

In “* PART 2 *” parallelized image resizing is carried out using ImageResize[]; again symbols # and & are used to ensure that function g(x) = ImageResize[], is correctly carried out against the imported graphics raster data, currently stored in images. The actual size used for resizing is set to {200}, and can be readily modified at any time. This value indicates that resized images are to be made 200 pixels in length. Mathematica then automatically scales the height of each image according to the ratio established by the original size. So, an image that is 1000 x 2000 rescaled by {200} is resized to 200 x 400. All reprocessed images are then stored in variable resizedImages.

Finally, in “* PART 3 *” we replace the Mathematica’s autoscaling or magnification with a specific size; here, we we want a ratio of 1:1. This means that the image presented to the screen is the same as that of the resized image(s). Of course, this assumes that the NB’s default magnification remains 100%. Using Image[] and its various options, we rescale the graphics data stored in resizedImages to 1:1 using Magnification -> 1. Again, parallelization increases the code’s performance. In-depth information concenring symbols #, & and @ can be found in the documentation.

An example of the output we produced from the first dataset, after setting NB magnification to 50%, is shown in Figure 4. A reduced magnification enables more images to be displayed on a per-line basis, which can be advantageous. We succeeded in visualizing our dataset of “persons” without issue as well as the third dataset, although various problems were encountered which we will discuss in the next section.

The code we presented in Code Snippet 4, though in some ways slower, is more versatile than had it been implemented with a loop.

/Image visualization – Performance Remember that not all Mathematica functions can be parallelized; we experimented with variations of this code to find those that could be parallelized, confirmed through testing. The extent to which parallelization increases performance depends on the number of images to be processed, their size, the number of available CPU cores and Mathematica kernels.

For performance related results, we ran the program three times per dataset and averaged the results. Variations in results did not vary by more than 15%.

For our primary dataset, with and without parallelization, the average runtime for the visualization program using our first dataset was 17 sec and 42 seconds, respectively. Even when parallelization was left unused, Mathematica automatically parallelized some of the code. The average peak memory used by the program was around 3,004,000,000 bytes while the average consumed memory was roughly 3,003,000,000 bytes. Saving the NB took only a few seconds.

Carbone 9

Figure 4: Partial output for first dataset visualization after resizing, at 50% magnification, resizing={200}.

Our second dataset was visualized in almost no time at all; it took 1 and 2 seconds for parallelized and non- parallelized code. Memory usage was also negligible and saving the NB took one to two seconds.

However, various issues were encountered when visualizing the third dataset. At first, the program crashing before any actual visualization was done. It was eventually determined, from the thread-based debugging code discussed in Bonus Topic that three images were causing Mathematica’s kernel to crash. The culprits were identified and removed from the input file, leaving us with 8,617 images to process. We succeeded in visualizing all resized, displayed in the NB. However, for each of three trials, following the visualization code, the system ran out of memory (RAM and swap). It was then that one of Mathematica’s kernels crashed, ending all subsequent processing, including time keeping and memory management code. Thus, we never able to determine exactly how long the program ran for or how much memory it consumed. From system monitoring tools left running on the system, we determined that processing took an average of about 2 hrs and that the program consumed all available memory resources. Saving the NB took about 10 min and saving it as PDF took about 25 min. Visualizing the resized dataset via scrolling or PgDn, after closing, restarting Mathematica and reloading the NB, was unreasonably slow. Thus, visualization had to be conducted on the PDF file. The NB and PDF files were approximately 2.3 and 3.6 GB, respectively. PDF visualization proceeded without issue using Acrobat Reader 9.5.5.

Finally, testing and retesting the program on our datasets has confirmed that $HistoryLength=0 (zero) clears memory and cache. Full file logging and debugging (see Bonus Topic section) was used only for the third dataset.

Carbone 10

/Image content identification – background Identifying image contents is more commonplace today. Just go to Google Images (https://images.google.com) and upload or drag and drop an image into the browser, and voila. Google makes ICI easy. Many other web sites exist too, some which offer APIs, for a price. There are also various software frameworks and tools, some of which also offer APIs. With many to choose from, each has its benefits and shortcomings. Yet, at this time there does not appear to be a DF framework with this capability.

Some will ask why Mathematica? Others, why not instead use a dedicated locally hosted API solution? We prefer Mathematica because it lets us apply a wide array of operations against images in various ways, with capabilities beyond those of advanced IP programs and tools. Other mathematical frameworks provide similar capabilities as Mathematica; which is better is open to debate.

In light of full disclosure, we have observed that when conducting ICI with Mathematica it will sometimes communicate with Wolfram’s servers. When it does, it is obtaining training data. We have verified with Wolfram why this occurs and their response was that Mathematica occasionally updates its ICI training data, but that no data concerning actual images, thumbnails, or other identifying data is transmitted at any time to Wolfram. In our tests, with a network packet analyser in place, we observed no image data being transmitted from our system to Wolfram. Nevertheless, such a situation is not acceptable for those working from highly secure, classified or air-gapped environments. In such circumstances, it is possible to use Wolfram Cloud Services, installed to a local LAN, where training and other data can be securely stored and updates installed from outside media. Integration specifics are carried out in collaboration with a Wolfram account representative.

Mathematica ICI uses neural networks based on the downloaded training data. In the event an expert witness is required to testify concerning how it works, Wolfram may be willing to provide details to the expert. The level of detail will depend on many factors, outside the scope of this article. Fortunately, much research has already been done on image identification algorithms, leaving the expert with a great starting point. For more information about where to start, please email the article’s author directly (email at end of article).

/Image content identification – the specifics Our ICI code is straightforward, based on previously discussed concepts. Before looking at the code understanding the true purpose of ICI is important – anyone can identify contents from a few images. However, identifying contents from hundreds, thousands or more images quickly becomes an overwhelming task. Thus, automation aids us in this undertaking.

With so much information comes the need to store results. We recommend using a flat file consisting of a simple record format. This will keep it simple and permit text-based data processing using various utilities and tools. We propose a simple record format, shown in Record Format.

File Name: /home/username/Desktop/Article 2/DSC_0398.JPG SHA256 Hash: fd5d4cfc6f63fbf1036aeb5b78cb911eaf1b4825910750ef049859773766d043 File Format: JPEG Image Size: {3008, 2000} Disk Size: 1470487 Identification: ======General: domestic dog Good Guess: Tibetan mastiff Specific: Bouvier des Flandres End of Record Record Format: Proposed data record format with example data.

When conducting ICI we instruct Mathematica to identify the contents with varying levels of specificity. In our program, we chose to work with three levels of precision ranging from general, to good estimate, to very specific. Regardless of the specificity used, ICI is rarely exact. Many factors can influence results. No program or API is foolproof. Consider how humans perceptually differentiate between multiple subjects in an image. How can we expect software to do it as reliably? Backgrounds and objects at a distance in an image? Any service, program or

Carbone 11

tool stating its very high-level of accuracy is to be treated with suspicion. Every image is different, and humans, subject to bias and personal preferences, are often the ones putting these datasets together.

So why try in the first place? Because doing it against tens or hundreds of thousands of images is a daunting task even in a team. Consider that some images are just too offensive to risk chronic exposure. We are not using Mathematica to sift out pornographic images, as high quality software already exists for that, although we could if we trained it. Content classification training can be used to identify images corresponding to a set of criteria. We will examine this in an upcoming article.

Our ICI program, shown in Code Snippet 5, is longer than other code seen thus far, but is straightforward and based on concepts we have already seen. The most important part of this code to examine in detail is ImageIdentify[], highlighted in red. To use it, we specify the variable containing the image; in our program images are stored in tmp. We then indicate the level of required specificity using PerformanceGoal and SpecificityGoal. Together, these parameters are responsible for determining the precision used. The former specifies the speed with which determination is made; typical values are Speed and Quality. The latter signifies the actual level of precision to be used; Low and High are common values, but a numerical value between 0 and 1 is acceptable.

(* PROGRAM INITIALIZATION & MEMORY MANAGEMENT (Begin) ... *)

(* PROGRAM TIME KEEPING (Begin)... *)

(* PROGRAM BODY = *) ... inputfile = "datasetlist.txt"; outputfile = "output.txt"; errorfile = "errors.txt"; sysinfofile = "systeminfo.txt";

length = Length@Import[inputfile, "Lines"]; StringForm["Number of files to be processed from input file: ``\n", length]

inputstream = OpenRead[inputfile]; outputstream = OpenWrite[outputfile]; errorstream = OpenWrite[errorfile]; sysinfostream = OpenWrite[sysinfofile]; $Messages = Append[$Messages, errorstream];

Timing[For [i = 0, i < length, i++; filename = Read[inputstream, String]; tmp=Import[filename]; WriteString[outputstream, "File Name:\t", filename, "\n"] WriteString[outputstream, "SHA256 Hash:\t", IntegerString[FileHash[filename, "SHA256"],16],"\n", "File Format:\t", FileFormat[filename],"\n", "Image Size:\t", ImageDimensions[tmp],"\n", "Disk Size:\t", FileByteCount[filename],"\n", lowres=ImageIdentify[tmp,PerformanceGoal->"Speed", SpecificityGoal->"Low"]; midres=ImageIdentify[tmp,PerformanceGoal->"Speed", SpecificityGoal->"High"]; highres=ImageIdentify[tmp,PerformanceGoal->"Quality",SpecificityGoal->"High"]; "Identification:\n======\n", "General:\t", CommonName[lowres],"\n", "Good Guess:\t", CommonName[midres],"\n", "Specific:\t", CommonName[highres],"\n", "End of Record\n\n"]]]

(* WOLFRAM SYSTEM INFORMATION REPORTING (OPTIONAL) ... *)

(* PROGRAM TIME KEEPING (End)... *)

Carbone 12

(* MEMORY MANAGEMENT (End)... *)

(* CLOSE UP PROGRAM & STREAMS ... *) Code Snippet 5: Multi-image ICI program.

Various techniques can be used to optimize Code Snippet 5, including fewer instances of ImageIdentify[]. Loop optimization and additional code parallelization may be of help, though these are very advanced topics.

/Image identification results We recognize that ICI is very hard, and producing consistently accurate results is not possible, given the wide array of images investigators will encounter. After running our ICIC program against the first two datasets, we validated their results. Our methodology was simple: if the primary object/subject of an image is correctly identified, we give it a score of 1; if the contents are partially identified, we gave it ½ point; finally, we score 0 (zero) if the contents were altogether incorrectly identified.

Examples of scoring may be helpful. Consider the second resized image in Figure 4; it is an image of the Sun captured using a hydrogen-alpha filter, in black and white. It was identified as Uranus – ImageIdentify[] was sort of right, so it get a ½ point. Incorrectly identifying a pigeon as a turtledove, that’s also a ½ point. Incorrectly recognizing breed of dog is of little concern where generalized ICI is concerned; so long as the image is identified as a dog, it’s a full point. Recognizing a tree as a flower, that’s 0 (zero). Identifying a cityscape as “architecture” is also a ½ point.

Based on this logic, our primary dataset scored 70/121, or 57.9% correct. However, the devil is in the details. Of the 121 images in the set, 50 were correctly identified and 40 were sort of right, partially identified image contents. Only 31 or 25.6% of the images were incorrectly identified. In our opinion, the results are comparable to other frameworks, having been compared to Google’s Image Recognition service, as it is readily available and free for use. Recognition of specific types or groupings of images could have been improved had we trained Mathematica using a classifier. Overall, we were pleased with the results. Some of the images contained many objects and subjects, and even identifying such contents is be open to interpretation by humans.

Overall, results from the second dataset were very good. The “persons” were all recognized as persons or as the clothes they wore, with the exception of model #10 who was oddly recognized as a bathtub. Result improvement for all images is possible – this will be examined in a follow-up article.

We did not conduct result validation for the third dataset, given its size and length. However, ICI completed without issue.

Average runtime results for the three datasets were found to be 273 sec, 16 sec and 16,840 sec (or 4.68 hrs), respectively. Average peak memory usage was identified as 2,306,530,696 bytes, 606,035,336 bytes and 4,230,674,640 bytes, respectively. These results are in stark contrast against those from image visualization.

Finally, testing and retesting of the programs confirms that setting $HistoryLength=0 (zero) clears memory and cache. Full file logging and debugging (see Bonus Topic section) was used only for the third dataset and added negligible overhead.

/Bonus Topic: Debugging The time will come when program debugging becomes necessary. Basic debugging in Mathematica is similar to other languages; by printing processing-based status information, we can clue in to what is working and failing or taking too long. Such information can be printed to the screen but if error file-based logging is used then to maintain error-logging consistency, it should be sent to file errorstream.

Debugging information can go anywhere in a program but it typically is put where the bulk of processing occurs. For non-parallelized loops, this is straightforward. All we need to do is replace “$Messages=Append[$Messages, errorstream];” with “$Messages=errorstream;” and reset errorstream at the end of a program. Then, in our loop we add a WriteString[] statement. For example, “WriteString[errorstream, "Processing file: "<>filename<>"\n"];” will print to the name

Carbone 13

and location of the file currently being processed to errorstream, maintaining debugging consistency.

Unless error messages are explicitly left on, they will by default stop displaying after a minimum threshold is reached. Switching messages on or off requires specifying the message type or group. Errors messages we encountered while running the ICI program (Code Snippet 5) against our third dataset were:

Divide::infy Clip::nord Infinity::indet ImageIdentify::imginv CommonName::noent General::stop ImageDimensions::imginv

Adding the appropriate On[] statements for these errors and rerunning the program provided us with consistent debugging and so were able to readily isolate the problematic data files and remove them from the input file.

When working with parallelized non-looping code (i.e., Code Snippet 4) we insert debugging information using a different approach because output file-based concurrent logging is more difficult to implement. The easy way around this problem to log output directly to the NB. To keep track of which data file is currently being processed by any given thread we make use of Range[] and MapThread[], as shown in Code Snippet 6, where this concurrent output code is highlighted in red.

images = Parallelize[Map[Import[#] &, list]]; resizedImages = Parallelize[Map[ImageResize[#, {200}] &, images]]; indicator = Range[Length[resizedImages]] Parallelize[Image[#, Magnification -> 1] & /@ resizedImages] Parallelize[MapThread[(Print[#1]; Image[#2, Magnification → 1]) &, {indicator, resizedImages}]]

Code Snippet 6 boils down to this: create a new variable, indicator, to store the number associated with the image being processed, based on its position in resizedImages. If one of the kernels crash the last number printed will be the culprit. This method is the most direct technique for identifying the culprit in parallelized non-looping code as it keeps track of I/O concurrency.

We may eventually look at debugging parallelized loop-based code.

/Conclusion In this article, we have examined important issues with respect to computer forensics and computer vision technology. Mathematica has shown itself to be quite capable in this field, but like any software, it is not without faults. Nevertheless, the resources required to conduct in-depth image analysis can be monumental; it is for this reason that we have proposed software-based aid.

There remains little else for us to explore with respect to image visualization. It is straightforward for small to medium sized datasets and up to several thousand images can be visualized in a single pass, given sufficient memory resources.

While it is possible to use open source platforms and math frameworks to provide the functionality we have explored thus far, there is one no stop shop. Many things need to be strung together to get it done just right, something that will take lots of time, and patience for the uninitiated. That said, those looking for alternatives to Mathematica get started with Python and SciPy for providing similar functionality.

Much remains to be done for image content identification; we have only scratched the surface in this article. In follow up articles, we’ll look at other ways to improve its accuracy and extend its reach to additional forms of digital evidence including video. While other frameworks and tools exist to perform what we have examined,

Carbone 14

none provide the extent of combinatory functionality, which we’ll be exploring soon.

/Acknowledgments We sincerely express our gratitude to Wolfram Premier Service technical support, specifically Danny Finn, Kyle Martin, Wang Zhang, Abrita Chakravarty and Lin Guo.

/AUTHOR BIO R. Carbone has been working for Defence R&D Canada since 2001. He has been working as an infosec analyst and researcher for the last seven years and has published numerous articles and case studies. He is also an open source software expert and was the co-author for a Government of Canada study that influenced federal government policy on the adoption of open source. Finally, he is a certified digital forensic investigator, incident handler, and malware reverse engineer. Defence R&D Canada [email protected] Head Shot – See image DRDC-Graphicbar-3inch-300dpi.jpg

/FEATURE IN BRIEF

 /Subject Matter Image analysis; symbolic and mathematical programming (procedural and functional)

 /Skill Level 4-5

 /Requirements - Very powerful workstation or desktop computer (i7 (6-core / 12 with Hyper-Threading) or Xeon (8-core / 16 with Hyper-Threading) or greater); - Lots of memory (64 GiB or greater); - Lots of disk space for storing the very large notebook files that Mathematica can generate; - Fast disk(s) for reading and writing Mathematica’s notebooks; - Very stable 64-bit computer operating system, preferably one that support 64+ GiB RAM; - Hardware RAID storage if there will be lots of data to be stored and processed; - Familiarity with Mathematica or other symbolic computer assisted mathematic framework, such as, but not limited to, Maple, Matlab, GNU Octave, Scilab, R, etc.; - Familiarity with batch processing; - Knowledge of procedural or functional programming paradigms; - A basic knowledge of “big data,” how it can be processed and analysed will be helpful.

 /Additional Reading Mathematica’s documentation is very extensive and up to date. Many books have been written on Mathematica programming and interactive program development. For a full listing of known books relating to Mathematica, see URL http://www.wolfram.com/books/?source=nav. A variety of books at all levels can be found here.

As for digital forensics using Mathematica, no publicly available books or treatises is known on the subject.

/BOXOUTS

 /Boxout 1 – Parallelizing the For loop and common gotcha Non-parallelized loops can be parallelized using ParallelDo[]. However, this requires modifying the loop structure. It may be of considerable help when replacing a for loop, but may require more thought prior to implementation.

 /Boxout 2– Thumbnail[]

Carbone 15

Through experimentation, we have determined that large dataset visualization requires significant amounts of memory, including swap and may take lots of time. To save on processing and memory, we can substitute ImageResize[] with Thumbnail[]. This avoids us having to resize images and save the results in memory; instead, we immediately resize and display, saving a step. However, with this function we cannot readily modify its onscreen size.

 /Boxout 3 – Image preprocessing to improve identification accuracy Image content identification accuracy can be improved by preprocessing images. Their backgrounds can often be removed, if only partially, and large images can be split into smaller chunks. This can be done using RemoveBackground[] and ImagePartition[], respectively. When running these functions with Parallelize[], some speed-up may be experienced.

 /Boxout 4 – Saving very large notebooks Saving (or opening) very large notebooks in Mathematica can crash it or take what feels like an eternity. Sometimes saving a notebook to PDF can also fail. It appears there is an underlying issue when saving very large notebook-based output. Our recommendation is to consider saving very large notebooks first as PDF and then using the native notebook format.

/SIDEBARS

 /Sidebar 1 – Stopping runaway calculations To stop a runaway calculation, because it is taking too long or has caused the system to become unresponsive, pressing Alt+, (comma) within Mathematica will cause a popup to be displayed asking whether to continue or abort. This is a form of graceful program shutdown that should prevent changes made to the running notebook from being lost. On the other hand, killing it from Task Manager will lose all these changes.

 /Sidebar 2 – The front and back ends Mathematica consists of two main components, the front and back (or kernel) ends. The front-end performs all I/O while the kernel(s) performs calculations. The front-end runs on the user’s computer while the kernel(s) runs on the local or remote system, or both. This division of labour enables compute nodes to be added or run entirely from more powerful remote systems.

 /Sidebar 3 – Slow, interrupted or frozen scrolling or typing in a notebook Sometimes when typing or scrolling in a notebook, responsiveness may become very slow or take on a frozen like state for many seconds. To fix this, delete all “*.m” files located in $UserBaseDirectory/.Mathematica/SystemFiles/FrontEnd/SystemResources/FunctionalFrequency. It may become necessary to restart the framework for this to take effect.