The Eighth International Conference on Data Analytics September 22, 2019 to September 26, 2019 - Porto, Portugal

Tutorial: Analyzing the Enron Corpus with the Shell Presenter: Andreas Schmidt

Exercise I

Task 0: Preparation step Go tot he tutorial page and download the 1% extract from the enron dataset. Unzip and unpack it using the :

tar xvfz enron_mail_extract_0_01.tgz Task 1: send emails?

Approach: Look for the header line in each , starting with From: …, extract the email address and count the number of times they appear. Then them, according to the number of times they appear and output the last dataset.

Needed tools:

• extract email address: • Count the number of occurrences: • Output last dataset: • Also needed: , sort First step: Extract email addresses from file:

grep '^From: ' maildir/*/*/*. Problems:

1. The output also contains the file, where the pattern was found. We don’t need this information, so we omit this information using the –h option with grep. (please try …). 2. Some emails seem to have multiple From-fields. Why? How can we solve this? On possibility is to look only in the header-section of an email. In this case we have to the email in header and body part, using the command csplit or alternatively, we only look for the first occurrence of the specified pattern (using –m option). So, our next try is: grep -m1 –h '^From: ' maildir/*/*/*.

Looks better. Now extract the email address. This can be done with grep, using the –o option, specifying the of an email address. Alternatively we can see the current output as a two column output, with a space as separator. So if we only want the second column, we can do:

Version: 21/09/2019 grep -m1 -h '^From: ' maildir/*/*/*.| cut -d' ' -f2

With the uniq command, we can now count the number of times, identical email addresses appear in our output. One restriction with uniq is, that the input must be sorted, so we first sort it using the command sort (no options are needed, because we have only one column and want to sort alphabetically).

grep -m1 -h '^From: ' maildir/*/*/*.|cut -d' ' -f2|sort|uniq –c Ok, now we want to sort the output by the first column, so that we can found the most active email-writer the end of our output. In this case we want a numerical sort order, so we have to specify the –n option (We don’t have to specify the column, because the first column is the default).

grep -m1 -h '^From: ' maildir/*/*/*.| cut -d' ' -f2| sort | \ uniq -c| sort -n And finally, output only the last line …

grep -m1 -h '^From: ' maildir/*/*/*.| cut -d' ' -f2| sort | \ uniq -c| sort –n | tail –n1 fantastic !!!

Now, we want to answer another query …

Task 2: How many emails were send in each year?

Use the previously developed solution as draft …

Task 3: Define a shell function to avoid unnecessary work?

The first filter function (grep) only differs slightly for many header fields, so we can probably write a shell-funcion and use this instead. So please the following code in your shell (or copy & )1.

function single_line_header() { grep -m1 -h "^$1: " maildir/*/*/*. } Now enter in your shell:

single_line_header Subject , isn’t it?

1 The function is defined only in your actual shell. If you like to define a function that is available in all shell- instances you should define it in your ~/.bashrc or ~/.profile file.

Version: 21/09/2019 Task 4 (optional/homework): Extract multi-line headers from an email

We can’t use the previous approach to extract the header information for the recipients (To:), or (Blind) Carbon-Copies (Cc:, Bcc:), because they can over multiple lines. With the of csplit, we can split the email into different parts, are then written into separate files. The most important thing is to the regular expressions, which split the original file.

As a starting point, take a look into the file maildir/ybarbo-p/inbox/244. and think about how to define the pattern for the Cc:-header-field information. 1. We want to start at the line starting with Cc:, and 2. will end before the line starting with ‘Mime-Version:’

… and how will this pattern differ for the To:/Bcc header-field?

So a possible solution would be the following patterns: %^Cc:% (skip to, but not including the matching line), followed by /^[A-Z]/ (copy up to but not including the matching line).

csplit maildir/ybarbo-p/inbox/244. %^Cc:% /^[A-Z]/

Take a look into the files xx00 and xx01 in the current directory. Which one do we need? As result, we want each email-address in a separate line. Hints: use (perhaps multiple times) to separate each email address into a single line. Afterwards you can easily eliminate the Cc: string with tail. To get rid of the output from csplit (byte-count), take a look into the manpage.

Like in the previous task, create a function named multi_line_header, that takes as input a header-field (like To, Cc, Bcc) and returns the content of the field, with one entry per line.

Test the defined function with the following call: multi_line_header maildir/ybarbo-p/inbox/244. Cc

The desired output should look like this:

paul.y'[email protected] [email protected] [email protected] [email protected] [email protected]

Possible error messages using cygwin:

Syntaxfehler beim unerwarteten Wort `$'\r'':

syntax error near unexpected token `$'\r'

Version: 21/09/2019 Solution: run dos2unix.exe on the file

(dos2unix.exe must be installed separately using the Cygwin installer)

Version: 21/09/2019