Making Sense of Email Addresses on Drives
Total Page:16
File Type:pdf, Size:1020Kb
Journal of Digital Forensics, Security and Law Volume 11 Number 2 Article 10 2016 Making Sense of Email Addresses on Drives Neil C. Rowe U.S. Naval Postgraduate School Riqui Schwamm U.S. Naval Postgraduate School Michael R. McCarrin U.S. Naval Postgraduate School Ralucca Gera U.S. Naval Postgraduate School Follow this and additional works at: https://commons.erau.edu/jdfsl Part of the Computer Engineering Commons, Computer Law Commons, Electrical and Computer Engineering Commons, Forensic Science and Technology Commons, and the Information Security Commons Recommended Citation Rowe, Neil C.; Schwamm, Riqui; McCarrin, Michael R.; and Gera, Ralucca (2016) "Making Sense of Email Addresses on Drives," Journal of Digital Forensics, Security and Law: Vol. 11 : No. 2 , Article 10. DOI: https://doi.org/10.15394/jdfsl.2016.1385 Available at: https://commons.erau.edu/jdfsl/vol11/iss2/10 This Article is brought to you for free and open access by the Journals at Scholarly Commons. It has been accepted for inclusion in Journal of Digital Forensics, Security and Law by an authorized administrator of (c)ADFSL Scholarly Commons. For more information, please contact [email protected]. Making Sense of Email Addresses on Drives JDFSL V11N2 MAKING SENSE OF EMAIL ADDRESSES ON DRIVES Neil C. Rowe1, Riqui Schwamm1, Michael R. McCarrin1 and Ralucca Gera2 U.S. Naval Postgraduate School 1Computer Science Department 2Applied Mathematics Department Monterey, California 93943 USA 1-831-656-2462, fax 1-831-656-2814, [email protected] ABSTRACT Drives found during investigations often have useful information in the form of email addresses, which can be acquired by search in the raw drive data independent of the file system. Using these data, we can build a picture of the social networks in which a drive owner participated, even perhaps better than investigating their online profiles maintained by social-networking services, because drives contain much data that users have not approved for public display. However, many addresses found on drives are not forensically interesting, such as sales and support links. We developed a program to filter these out using a Naïve Bayes classifier and eliminated 73.3% of the addresses from a representative corpus. We show that the byte-offset proximity of the remaining addresses found on a drive, their word similarity, and their number of co-occurrences over a corpus are good measures of association of addresses, and we built graphs using this data of the interconnections both between addresses and between drives. Results provided several new insights into our test data. Keywords: digital forensics, electronic mail, email, addresses, users, filtering, networks, visualization INTRODUCTION service or page owner. A better source could be the contacts users store on their computers Finding social networks is important in and devices in the form of electronic-mail investigating organized crime and terrorism. (email) addresses, telephone numbers, street Social networks can be learned from addresses, and personal names. Some may be connections recorded by social-networking information of which users are not aware services and discussion-forum Web pages. because it was stored by software or left in However, this is public information, and is unallocated space after deletions. often of limited value in a criminal investigation in which subjects conceal key This work focuses on email-address data. information about their contacts. Such data can indicate personal connections by Furthermore, license agreements often prohibit the nearby placement of pairs of addresses on a automated analysis of such data without drive, similarity of words in address pairs permission, and much user data may not be suggesting aliases, or repeated association of accessible without the cooperation of the address pairs on a set of drives. We may thus © 2016 ADFSL Page 153 JDFSL V11N2 Making Sense of Email Addresses on Drives be able to infer networks of contacts of varying extract all email addresses, their offsets on the connection strengths from the data on one drive, and their preceding and following drive alone. We can then use several methods characters. Email addresses consist of a to build visualizations of the networks to make username of up to 64 characters, a “@”, and a them easier to understand. A key obstacle is set of domain and subdomain names delimited that many addresses on a drive are by periods. Bulk Extractor bypasses the file uninteresting because businesses and system and searches the raw drive bytes for organizations put contact addresses into their patterns appearing to be email addresses, so it Web pages and software, so we must first can find addresses in deleted files and slack exclude these. Note that seeking addresses on space. On our corpus, this resulted in drives has advantages over inspecting the 292,347,920 addresses having an average of records of a mail server because it can find 28.4 characters per address, of which there older addresses removed from a server or were 17,544,550 distinct addresses. The potential contacts never used to send mail. number of files on these drives was 61,365,153, so addresses were relatively infrequent, though LITERATURE REVIEW they were more common for mobile devices. Most work on visualizing social networks has Sample output of Bulk Extractor: focused on social-networking sites where links between people are made explicit through a 65484431478 [email protected] variety of friend and contact links (Holzer, vy.mil>\x0A> To: "'ttfaculty@cs Malin, and Sweeney, 2005; Zhou et al, 2006; 69257847997 [email protected] Polakis et al, 2010). They can also be 0\x1E\x06\x09*\x86H\x86\xF7\x0D\x01\ extracted from text by nearby mentions of x09\x01\x16\[email protected]\x00\x0 names of people (Laclavik et al, 2012), though 0\x00\x00\x00\x00\x00\x00\x00\x00\x00\ nearness may be coincidental. So far there has x00\x00\x00\x00\x00 been little attention to the mining of addresses Bulk Extractor’s email-address scanner from drives, their classification, or their seeks domain/subdomain lists containing at connection to social networks. There has been least one period and punctuating delimiters in work on the classification of email messages front and behind the address. Currently, Bulk from their message headers (Lee, Shishibori, Extractor handles common compressed formats and Ando, 2007), but headers provide but cannot recognize non-ASCII addresses. significantly richer contextual information than Recent standards for international email in the lists of email addresses found scattered over a IETF’s RFC 6530 allow the full Unicode drive. character set in both usernames and domains TEST SETUP (Klensin and Ko, 2012) so that must be considered in the future. Most of our corpus This work primarily used the Real Data predates the standard. Corpus (Garfinkel et al, 2009), a collection of currently 2401 drives from 36 countries after This work primarily used an “email excluding 961 empty or unreadable drives. stoplist” provided by the U.S. National These drives were purchased as used Institute of Standards and Technology (NIST) equipment and represent a range of business, by running Bulk Extractor on a portion of government, and home users. We ran the their software collection. Email addresses Bulk Extractor tool (Bulk Extractor, 2013) to found inside software are likely software contact information and unlikely to be Page 154 © 2016 ADFSL Making Sense of Email Addresses on Drives JDFSL V11N2 interesting, but this is not guaranteed because 102,426 personal names including both software developers may have inadvertently given names and surnames. We left personal addresses, or deliberately put obtained international coverage by them in to enable unauthorized data leakage. using a variety of resources including We supplemented this list with three other proposed baby names. Another sources: known “blackhole” addresses used for approximately 15,000 were found by forwarding spam inspecting the training set and stop list. (gist.github.adamloving/4401361, 482 items), 3,515 names of email and messaging known scam email addresses servers. Some had words like “mail” (www.sistersofspam.co.uk, 2030 items), and and “chat” in their domains, but others addresses found by us using Bulk Extractor on required Web research to determine fresh installs of standard Windows and their function. Macintosh operating systems (332 items). 2,813 “generic usernames” representing After eliminating duplicates, the stoplist had generic addresses such as business 496,301 addresses total. contacts. These we identified by We also created four important word lists manually inspecting the training set for use in interpreting the words within and stoplist. Table 1 gives examples. addresses: Table 1 Example generic usernames 809,858 words of the online dictionaries auctions backup beauty used in (Rowe, Schwamm, and callcenter conference download Garfinkel 2013). The English word list editor feedback group was 223,525 words from two dictionary homepage ideas jobsindia sources and included morphological linux marketinfo nobody variants. 587,258 additional words outlet passwords registration were from Croatian, Czech, Dutch, response save testing Finnish, French, German, Greek, tickets update yourmatches Hungarian, Italian, Norwegian, Polish, Portuguese, Romanian, Serbian, ELIMINATING Spanish, Swedish, and Turkish dictionaries and transliterated to UNINTERESTING ASCII. We did not consider languages ADDRESSES like Arabic and Chinese whose A first step is to remove addresses unlikely to transliteration to ASCII was be interesting in a forensic investigation. We problematic. We converted words to have clues in both the username and the lower case, eliminated those that were domain/subdomain names. For instance, also personal names and generic [email protected] uses two human usernames, and eliminated non-English one-letter and two-letter words since names in the username, and a domain name that includes “mail” to suggest a mail server; they caused too many false matches to these clues suggest it is an interesting address.