Enron-Exercise-1.Pdf

Total Page:16

File Type:pdf, Size:1020Kb

Enron-Exercise-1.Pdf The Eighth International Conference on Data Analytics September 22, 2019 to September 26, 2019 - Porto, Portugal Tutorial: Analyzing the Enron Corpus with the Shell Presenter: Andreas Schmidt Exercise I Task 0: Preparation step Go tot he tutorial page and download the 1% extract from the enron dataset. Unzip and unpack it using the command: tar xvfz enron_mail_extract_0_01.tgz Task 1: Who send most emails? Approach: Look for the header line in each file, starting with From: …, extract the email address and count the number of times they appear. Then sort them, according to the number of times they appear and output the last dataset. Needed tools: • extract email address: grep • Count the number of occurrences: uniq • Output last dataset: tail • Also needed: cut, sort First step: Extract email addresses from file: grep '^From: ' maildir/*/*/*. Problems: 1. The output also contains the file, where the pattern was found. We don’t need this information, so we omit this information using the –h option with grep. (please try …). 2. Some emails seem to have multiple From-fields. Why? How can we solve this? On possibility is to look only in the header-section of an email. In this case we have to split the email in header and body part, using the command csplit or alternatively, we only look for the first occurrence of the specified pattern (using –m option). So, our next try is: grep -m1 –h '^From: ' maildir/*/*/*. Looks better. Now extract the email address. This can be done with grep, using the –o option, specifying the regular expression of an email address. Alternatively we can see the current output as a two column output, with a space as separator. So if we only want the second column, we can do: Version: 21/09/2019 grep -m1 -h '^From: ' maildir/*/*/*.| cut -d' ' -f2 With the uniq command, we can now count the number of times, identical email addresses appear in our output. One restriction with uniq is, that the input must be sorted, so we first sort it using the command sort (no options are needed, because we have only one column and want to sort alphabetically). grep -m1 -h '^From: ' maildir/*/*/*.|cut -d' ' -f2|sort|uniq –c Ok, now we want to sort the output by the first column, so that we can found the most active email-writer at the end of our output. In this case we want a numerical sort order, so we have to specify the –n option (We don’t have to specify the column, because the first column is the default). grep -m1 -h '^From: ' maildir/*/*/*.| cut -d' ' -f2| sort | \ uniq -c| sort -n And finally, output only the last line … grep -m1 -h '^From: ' maildir/*/*/*.| cut -d' ' -f2| sort | \ uniq -c| sort –n | tail –n1 fantastic !!! Now, we want to answer another query … Task 2: How many emails were send in each year? Use the previously developed solution as draft … Task 3: Define a shell function to avoid unnecessary write work? The first filter function (grep) only differs slightly for many header fields, so we can probably write a shell-funcion and use this instead. So please type the following code in your shell (or copy & paste)1. function single_line_header() { grep -m1 -h "^$1: " maildir/*/*/*. } Now enter in your shell: single_line_header Subject nice, isn’t it? 1 The function is defined only in your actual shell. If you like to define a function that is available in all shell- instances you should define it in your ~/.bashrc or ~/.profile file. Version: 21/09/2019 Task 4 (optional/homework): Extract multi-line headers from an email We can’t use the previous approach to extract the header information for the recipients (To:), or (Blind) Carbon-Copies (Cc:, Bcc:), because they can expand over multiple lines. With the help of csplit, we can split the email into different parts, which are then written into separate files. The most important thing is to find the regular expressions, which split the original file. As a starting point, take a look into the file maildir/ybarbo-p/inbox/244. and think about how to define the pattern for the Cc:-header-field information. 1. We want to start at the line starting with Cc:, and 2. will end before the line starting with ‘Mime-Version:’ … and how will this pattern differ for the To:/Bcc header-field? So a possible solution would be the following patterns: %^Cc:% (skip to, but not including the matching line), followed by /^[A-Z]/ (copy up to but not including the matching line). csplit maildir/ybarbo-p/inbox/244. %^Cc:% /^[A-Z]/ Take a look into the files xx00 and xx01 in the current directory. Which one do we need? As result, we want each email-address in a separate line. Hints: use tr (perhaps multiple times) to separate each email address into a single line. Afterwards you can easily eliminate the Cc: string with tail. To get rid of the output from csplit (byte-count), take a look into the manpage. Like in the previous task, create a function named multi_line_header, that takes as input a header-field (like To, Cc, Bcc) and returns the content of the field, with one entry per line. Test the defined function with the following call: multi_line_header maildir/ybarbo-p/inbox/244.mail Cc The desired output should look like this: paul.y'[email protected] [email protected] [email protected] [email protected] [email protected] Possible error messages using cygwin: Syntaxfehler beim unerwarteten Wort `$'\r'': syntax error near unexpected token `$'\r' Version: 21/09/2019 Solution: run dos2unix.exe on the file (dos2unix.exe must be installed separately using the Cygwin installer) Version: 21/09/2019 .
Recommended publications
  • GNU Coreutils Cheat Sheet (V1.00) Created by Peteris Krumins ([email protected], -- Good Coders Code, Great Coders Reuse)
    GNU Coreutils Cheat Sheet (v1.00) Created by Peteris Krumins ([email protected], www.catonmat.net -- good coders code, great coders reuse) Utility Description Utility Description arch Print machine hardware name nproc Print the number of processors base64 Base64 encode/decode strings or files od Dump files in octal and other formats basename Strip directory and suffix from file names paste Merge lines of files cat Concatenate files and print on the standard output pathchk Check whether file names are valid or portable chcon Change SELinux context of file pinky Lightweight finger chgrp Change group ownership of files pr Convert text files for printing chmod Change permission modes of files printenv Print all or part of environment chown Change user and group ownership of files printf Format and print data chroot Run command or shell with special root directory ptx Permuted index for GNU, with keywords in their context cksum Print CRC checksum and byte counts pwd Print current directory comm Compare two sorted files line by line readlink Display value of a symbolic link cp Copy files realpath Print the resolved file name csplit Split a file into context-determined pieces rm Delete files cut Remove parts of lines of files rmdir Remove directories date Print or set the system date and time runcon Run command with specified security context dd Convert a file while copying it seq Print sequence of numbers to standard output df Summarize free disk space setuidgid Run a command with the UID and GID of a specified user dir Briefly list directory
    [Show full text]
  • Reclaim Disk Space by Shrinking Files 49 from the Preceding Discussion It Should Be Clear That Cfsize and Csplit Are Two Very Different Tools
    sysadmins from time to time are faced with the problem of reclaiming disk sandeep saHoRe space, a problem that lurks in the shadows waiting to buzz the pager into life. The typi- cal response is either to remove files or to compress them, or to invoke some combina- reclaim disk space tion of the two approaches. But there are many situations where the choice is not so by shrinking files cut-and-dried. Let’s say there is a file fill- Sandeep Sahore holds a Master’s degree in com- ing up a storage partition that cannot be puter science from the University of Toledo and removed because its data should be kept for has nearly 15 years of experience in the comput- ing industry. He specializes in low-level and C a rolling window of one year or because its programming, systems engineering, and system performance. contents are sensitive or because it is being [email protected] held open by a process. In such cases it is better to shrink the file in size instead of The source for cfsize.c may be found at http:// removing or compressing it. www.usenix.org/publications/login/2008-10/ cfsize.c. Having faced such scenarios a countless number of times I decided to write a program that would shrink files in size from the beginning or “head.” This program is called cfsize, which is short for “change file size”; it is described in detail in the sections that follow. cfsize is written in the C lan- guage for the UNIX platform, but it works equally well on Linux.
    [Show full text]
  • Windows Tool Reference
    AppendixChapter A1 Windows Tool Reference Windows Management Tools This appendix lists sets of Windows management, maintenance, configuration, and monitor- ing tools that you may not be familiar with. Some are not automatically installed by Windows Setup but instead are hidden away in obscure folders on your Windows Setup DVD or CD- ROM. Others must be downloaded or purchased from Microsoft. They can be a great help in using, updating, and managing Windows. We’ll discuss the following tool kits: ■ Standard Tools—Our pick of handy programs installed by Windows Setup that we think are unappreciated and not well-enough known. ■ Support Tools—A set of useful command-line and GUI programs that can be installed from your Windows Setup DVD or CD-ROM. ■ Value-Added Tools—Several more sets of utilities hidden away on the Windows Setup CD-ROM. ■ Windows Ultimate Extras and PowerToys for XP—Accessories that can be downloaded for free from microsoft.com. The PowerToys include TweakUI, a program that lets you make adjustments to more Windows settings than you knew existed. ■ Resource Kits—A set of books published by Microsoft for some versions of Windows that includes a CD-ROM containing hundreds of utility programs. What you may not have known is that in some cases you can download the Resource Kit program toolkits with- out purchasing the books. ■ Subsystem for UNIX-Based Applications (SUA)—A package of network services and command-line tools that provide a nearly complete UNIX environment. It can be installed only on Windows Vista Ultimate and Enterprise, and Windows Server 2003.
    [Show full text]
  • Gnu Coreutils Core GNU Utilities for Version 5.93, 2 November 2005
    gnu Coreutils Core GNU utilities for version 5.93, 2 November 2005 David MacKenzie et al. This manual documents version 5.93 of the gnu core utilities, including the standard pro- grams for text and file manipulation. Copyright c 1994, 1995, 1996, 2000, 2001, 2002, 2003, 2004, 2005 Free Software Foundation, Inc. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the license is included in the section entitled “GNU Free Documentation License”. Chapter 1: Introduction 1 1 Introduction This manual is a work in progress: many sections make no attempt to explain basic concepts in a way suitable for novices. Thus, if you are interested, please get involved in improving this manual. The entire gnu community will benefit. The gnu utilities documented here are mostly compatible with the POSIX standard. Please report bugs to [email protected]. Remember to include the version number, machine architecture, input files, and any other information needed to reproduce the bug: your input, what you expected, what you got, and why it is wrong. Diffs are welcome, but please include a description of the problem as well, since this is sometimes difficult to infer. See section “Bugs” in Using and Porting GNU CC. This manual was originally derived from the Unix man pages in the distributions, which were written by David MacKenzie and updated by Jim Meyering.
    [Show full text]
  • Unix Command
    Veloce descrizione di comandi Unix Buona parte dei comandi dell’elenco seguente fanno parte della distribuzione standard di molte architetture Unix. Per i dettagli vedere le relative pagine di manuale, invocabili con il comando "man topic". a2p convertitore awk - perl amstex AmSTeX language create, modify, and extract from archives (per creare ar librerie) arch print machine architecture at, batch, atq, atrm - queue, examine or delete jobs for later at execution awk gawk - pattern scanning and processing language basename strip directory and suffix from filenames bash GNU Bourne-Again SHell bc An arbitrary precision calculator language bibtex make a bibliography for (La)TeX c++ GNU project C++ Compiler cal displays a calendar cat concatenate files and print on the standard output cc gcc, g++ - GNU project C and C++ Compiler checkalias usage: /usr/bin/checkalias alias .. chfn change your finger information chgrp change the group ownership of files chmod change the access permissions of files chown change the user and group ownership of files chsh change your login shell cksum checksum and count the bytes in a file clear clear terminal screen cmp compare two files col filter reverse line feeds from input column columnate lists comm compare two sorted files line by line compress compress, uncompress, zcat - compress and expand data cp copy files cpio copy files to and from archives tcsh - C shell with file name completion and command line csh editing csplit split a file into sections determined by context lines cut remove sections from each
    [Show full text]
  • Miscellaneous Topics Dd
    Fall 2008 Miscellaneous topics dd The dd program is a surprisingly powerful one. It can be used for everything from copying a disk partition to converting ASCII files to EBCDIC. 1 Fall 2008 Miscellaneous topics dd conversions ascii # from EBCDIC to ASCII ebcdic # from ASCII to EBCDIC ibm # from ASCII to alternated EBCDIC lcase # change upper case to lower case ucase # change lower case to upper case swab # swap every pair of input bytes 2 Fall 2008 Miscellaneous topics dd copying Copying raw block-structured devices is quite easy: dd if=/dev/hda1 of=/dev/hda2 3 Fall 2008 Miscellaneous topics dd other tricks You can also remove bytes from the beginning or the end of a file: dd bs=1 skip=4000 # skip over the first 4000 characters dd count=10000 bs=1 # copy only the first 10000 characters 4 Fall 2008 Miscellaneous topics csplit csplit (context split) lets you split a file by specifying a pattern for each split point. csplit /PATTERN/ /PATTERN/|COUNT 5 Fall 2008 Miscellaneous topics csplit For instance, say you want to split the /etc/termcap file into 1200 separate definitions. You can easily do this with the single line: csplit /etc/termcap ’/ˆ[a-z]/’ ’{*}’ # the second item is a repeat counter 6 Fall 2008 Miscellaneous topics csplit You can then get 1300+ files, such as [langley@sophie tmp]$ head -1000 xx* ==> xx01 <== dumb|80-column dumb tty:\ :am:\ :co#80:\ :bl=ˆG:cr=ˆM:do=ˆJ:sf=ˆJ: ==> xx02 <== unknown|unknown terminal type:\ :gn:tc=dumb: ==> xx03 <== lpr|printer|line printer:\ :bs:hc:os:\ 7 Fall 2008 Miscellaneous topics :co#132:li#66:\ :bl=ˆG:cr=ˆM:do=ˆJ:ff=ˆL:le=ˆH:sf=ˆJ:
    [Show full text]
  • GPL-3-Free Replacements of Coreutils 1 Contents
    GPL-3-free replacements of coreutils 1 Contents 2 Coreutils GPLv2 2 3 Alternatives 3 4 uutils-coreutils ............................... 3 5 BSDutils ................................... 4 6 Busybox ................................... 5 7 Nbase .................................... 5 8 FreeBSD ................................... 6 9 Sbase and Ubase .............................. 6 10 Heirloom .................................. 7 11 Replacement: uutils-coreutils 7 12 Testing 9 13 Initial test and results 9 14 Migration 10 15 Due to the nature of Apertis and its target markets there are licensing terms that 1 16 are problematic and that forces the project to look for alternatives packages. 17 The coreutils package is good example of this situation as its license changed 18 to GPLv3 and as result Apertis cannot provide it in the target repositories and 19 images. The current solution of shipping an old version which precedes the 20 license change is not tenable in the long term, as there are no upgrades with 21 bugfixes or new features for such important package. 22 This situation leads to the search for a drop-in replacement of coreutils, which 23 need to provide compatibility with the standard GNU coreutils packages. The 24 reason behind is that many other packages rely on the tools it provides, and 25 failing to do that would lead to hard to debug failures and many custom patches 26 spread all over the archive. In this regard the strict requirement is to support 27 the features needed to boot a target image with ideally no changes in other 28 components. The features currently available in our coreutils-gplv2 fork are a 29 good approximation. 30 Besides these specific requirements, the are general ones common to any Open 31 Source Project, such as maturity and reliability.
    [Show full text]
  • Answers to Selected Problems
    Answers to selected problems Chapter 4 1 Whenever you need to find out information about a command, you should use man. With option -k followed by a keyword, man will display commands related to that keyword. In this case, a suitable keyword would be login, and the dialogue would look like: $ man -k login ... logname (1) - print user’s login name ... The correct answer is therefore logname.Tryit: $ logname chris 3 As in problem 4.1, you should use man to find out more information on date. In this case, however, you need specific information on date, so the command you use is $ man date The manual page for date is likely to be big, but this is not a problem. Remember that the manual page is divided into sections. First of all, notice that under section SYNOPSIS the possible format for arguments to date is given: NOTE SYNOPSIS date [-u] [+format] The POSIX standard specifies only two This indicates that date may have up to two arguments, both of arguments to date – which are optional (to show this, they are enclosed in square some systems may in brackets). The second one is preceded by a + symbol, and if you addition allow others read further down, in the DESCRIPTION section it describes what format can contain. This is a string (so enclose it in quotes) which 278 Answers to selected problems includes field descriptors to specify exactly what the output of date should look like. The field descriptors which are relevant are: %r (12-hour clock time), %A (weekday name), %d (day of week), %B (month name) and %Y (year).
    [Show full text]
  • LINUX Commands
    LINUX Commands alias Create an alias apropos Search Help manual pages (man -k) awk Find and Replace text, database sort/validate/index break Exit from a loop builtin Run a shell builtin bzip2 Compress or decompress named file(s) cal Display a calendar case Conditionally perform a command cat Display the contents of a file cd Change Directory cfdisk Partition table manipulator for Linux chgrp Change group ownership chmod Change access permissions chown Change file owner and group chroot Run a command with a different root directory cksum Print CRC checksum and byte counts clear Clear terminal screen cmp Compare two files comm Compare two sorted files line by line command Run a command - ignoring shell functions continue Resume the next iteration of a loop cp Copy one or more files to another location cron Daemon to execute scheduled commands crontab Schedule a command to run at a later time csplit Split a file into context-determined pieces cut Divide a file into several parts date Display or change the date & time dc Desk Calculator dd Data Dump - Convert and copy a file ddrescue Data recovery tool declare Declare variables and give them attributes df Display free disk space diff Display the differences between two files diff3 Show differences among three files dig DNS lookup dir Briefly list directory contents dircolors Colour setup for `ls' dirname Convert a full pathname to just a path dirs Display list of remembered directories du Estimate file space usage echo Display message on screen egrep Search file(s) for lines that match an
    [Show full text]
  • GNU Coreutils Core GNU Utilities for Version 9.0, 20 September 2021
    GNU Coreutils Core GNU utilities for version 9.0, 20 September 2021 David MacKenzie et al. This manual documents version 9.0 of the GNU core utilities, including the standard pro- grams for text and file manipulation. Copyright c 1994{2021 Free Software Foundation, Inc. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the license is included in the section entitled \GNU Free Documentation License". i Short Contents 1 Introduction :::::::::::::::::::::::::::::::::::::::::: 1 2 Common options :::::::::::::::::::::::::::::::::::::: 2 3 Output of entire files :::::::::::::::::::::::::::::::::: 12 4 Formatting file contents ::::::::::::::::::::::::::::::: 22 5 Output of parts of files :::::::::::::::::::::::::::::::: 29 6 Summarizing files :::::::::::::::::::::::::::::::::::: 41 7 Operating on sorted files ::::::::::::::::::::::::::::::: 47 8 Operating on fields ::::::::::::::::::::::::::::::::::: 70 9 Operating on characters ::::::::::::::::::::::::::::::: 80 10 Directory listing:::::::::::::::::::::::::::::::::::::: 87 11 Basic operations::::::::::::::::::::::::::::::::::::: 102 12 Special file types :::::::::::::::::::::::::::::::::::: 125 13 Changing file attributes::::::::::::::::::::::::::::::: 135 14 File space usage ::::::::::::::::::::::::::::::::::::: 143 15 Printing text :::::::::::::::::::::::::::::::::::::::
    [Show full text]
  • 4.Vim Manual Basics
    Adapted from http://manuals.bioinformatics.ucr.edu/home/linux-basics#TOC-Orientation 1.Viewing and changing the present working directory: pwd # Get full path of the present working directory (same as "echo $HOME") ls # Content of pwd ls -l # Similar as ls, but provides additional info on files and directories ls -a # Includes hidden files (.name) as well ls -R # Lists subdirectories recursively ls -t # Lists files in chronological order cd <dir_name> # Switches into specified directory. cd # Brings you to the highest level of your home directory. cd .. # Moves one directory up cd ../../ # Moves two directories up (and so on) cd - # Go back to you were previously (before the last directory change) 2.Files and directories mkdir <dir_name> # Creates specified directory rmdir <dir_name> # Removes empty directory rm <file_name> # Removes file name rm -r <dir_name> # Removes directory including its content, but asks for confirmation, # 'f' argument turns confirmation off cp <name> <path> # Copy file/directory as specified in path (-r to include content in directories) mv <name1> <name2> # Renames directories or files mv <name> <path> # Moves file/directory as specified in path 3.Text viewing more <my_file> # views text, use space bar to browse, hit 'q' to exit less <my_file> # a more versatile text viewer than 'more', 'q' exits, 'G' moves to end of text, 'g' to beginning, '/' find forward, '?' find backwards cat <my_file> # concatenates files and prints content to standard output 4.Vim Manual Basics vim my_file_name # open/create file with vim Once you are in Vim the most important commands are i , : and ESC. The i key brings you into the insert mode for typing.
    [Show full text]
  • Shell Scripting – Part 2
    Shell Scripting –Part 2 Le Yan/Alex Pacheco HPC User Services @ LSU 3/4/2015 HPC training series Spring 2015 Shell Scripting • Part 1 (Feb 11th) – Simple topics such as creating and executing simple shell scripts, arithmetic operations, flow control, command line arguments and functions. • Part 2 (today) – Advanced topics such as regular expressions and text processing tools (grep, sed, awk etc.) 3/4/2015 HPC training series Spring 2015 1 Outline • Regular Expressions (RegEx) • grep • sed • awk • File manipulation 3/4/2015 HPC training series Spring 2015 2 Wildcards • The Unix shell recognizes wildcards (a very limited form of regular expressions) used with filename substitution ? : match any single character. * : match zero or more characters. [ ] : match list of characters in the list specified [! ] : match characters not in the list specified ls * cp [a-z]* lower_case/ cp [!a-z]* upper_digit/ 3/4/2015 HPC training series Spring 2015 3 Regular Expressions • A regular expression (regex) is a extremely powerful method of using a pattern to describe a set of strings. • Regular expressions enable strings that match a particular pattern within textual data records to be located and modified. • Regular expressions are constructed analogously to arithmetic expressions by using various operators to combine smaller expressions • Regular expressions are often used within utility programs that manipulate textual data. – Command line tools: grep, egrep, sed – Editors: vi, emacs – Languages: awk, perl, python, php, ruby, tcl, java, javascript, .NET 3/4/2015 HPC training series Spring 2015 4 Bracket Expression • [ ] – Matches a single character that is contained within the brackets. For example, [abc] matches "a", "b", or "c".
    [Show full text]