?
No black magic: Text processing at the UNIX command line
Barbara Plank (1994)
Motivation What is Linux?
• Linux is an operating system (OS) based on UNIX
• OS: software layer between computer hardware and other software on top
• UNIX: developed in the US 1969 AT&T / Bell labs
• In 1991 Linus Torvalds developed the first Linux version: “I'm doing a (free) operating system (just a hobby, wont be big …”
• Unix philosophy: Build functionality out of small programs that do one thing and do it well
Slide inspired by: https://software.rc.fas.harvard.edu/training/intro_unix/ Data Storage
• All Linux machines are connected to a central storage
• Data is stored in two types of structures:
• folders/directories
• files File system structure
• Directories (folders) contain files and subdirectories (subfolders)
• Organized in a tree structure
• Each folder is itself part of another folder
• The only exception is the root folder (/), which is not part of any other folder File system
• root of the file system hierarchy is always: /
• paths can be absolute or relative, e.g. /home/bplank/data vs data/ or ../data
• Commonly used directories:
• . current working directory
• .. parent directory
• ~ home directory of user (for me: /home/bplank == ~bplank) Navigating the file system
• cd change directory cd data/001/
• mkdir project creates a directory called “project”
• ls list content of directory ls /home/bplank
• pwd Referencing Files
• A file can be referenced by the folders that lead to it plus the filename itself
• For example, the file example.txt in folder barbara which is under the folder users which is under the root directory can be referenced as:
• /users/barbara/example.txt
• This is called an absolute path What is the command line?
$ command prompt
this window is called the terminal
which is the program that allows us to interact with the shell
the shell is an environment that executes the commands we type in at the command prompt What is the command line?
INPUT OUTPUT
REPL: read-eval-print loop very different from the well-known graphical user interface Input Output process model
• Shell programs do I/O (input/output) with the terminal, using three streams:
Terminal INPUT
Keyboard stdin shell program stderr
Display (print) stdout OUTPUT shell environment (e.g. Bash shell) • Interactively, you rarely notice there's separate stdout and stderr (today we won’t worry about stderr) Unix philosophy
• combine many small programs for more advanced functionality
Terminal
Keyboard stdin shell program stderr stdout
Display (print) shell program
stdout stdout shell program
shell environment (e.g. Bash shell) Why (still) the command line? • Advantages:
• allows you to be agile (REPL vs edit-compile-run-debug cycle) this window is called the terminal • the command line is extensible and complementary which is the program that allows us • automation and reproducibility to interact with the shell • to run jobs on big clusters of computers (HPC computing) the shell executes the commands we type in • Disadvantage: (e.g. Bash shell) • takes some time to get acquainted First shell commands
• Type text (your command) after the prompt ($), followed by ENTER:
• pwd: print working directory (shows the current location in the filesystem)
Shell command: Structure
• A shell command (or shell program) usually takes parameters: arguments (required) and options (optional)
• Shell program with argument(s)
• cat text.txt
• cat text1.txt text2.txt text3.txt
• With argument and option:
• cat -n text.txt (prefix every line by line number) Note
• shell commands are CaSE SeNsItVe
pwd PWD Pwd pwD
• spaces have special meanings (do not use them for file names or folder names) Where to find help
• To know what options and arguments a command takes consult the man (manual) pages:
man whoami man cat
• Use q to exit Tips
• m
• use the arrow up key to reload command from your command history (or more advanced to search history of commands:
•
• Copy the text file from my home directory to yours:
cp /home/bplank/text.txt . command name arg1: what? arg2: where to? (copy)
• Check if the file is in your directory with ls: Inspect files
• head text.txt
prints out the first ten lines of the file
• Try out the following commands - what do they do?
• tail text.txt
• cat text.txt
• less text.txt (continue with SPACE or arrow UP/DOWN; quit by typing q) line-based processing
• head text.txt
prints out the first (by default) ten lines of the file
• head -4 text.txt
prints out the first 4 lines of the file I/O redirection to files
• Shell commands can be redirected to write to files instead of to the screen, or read from files instead of the keyboard
• Append to any command:
> myfile send stdout to file called myfile
< myfile send content of myfile as input to some program
<
2>
>
2> myfile send stderr to file called myfile line-based processing and I/O redirection
• head text.txt
equivalent to
head < text.txt
• head -1 text.txt > tmp
prints out the first line of the file and stores it in file tmp
• Exercise: store the last 4 lines of the file text.txt in a file called footer.txt Recipe for counting words
• An algorithm:
a. split text into one word per line (tokenize)
b. sort words
c. count how often each word appears a) split text: word per line
• translate ‘A’ into ‘B’ A=set of characters B=single character (\n newline) -s squeezes multiple blanks, -c complement
tr -sc ‘[a-zA-Z]’ ‘\n’ < text.txt
• More examples:
• tr -sc ‘[a-zA-Z0-9]’ ‘\n’ < text.txt
• tr -sc ‘[:alnum:]’ ‘\n’ < text.txt
• tr -sc ‘[:alnum:]@#’ ‘\n’ < tweets.txt b) sorting lines of text: sort
• sort FILE
• sort -r (reverse sort)
• sort -n (numeric)
• sort -nr (reverse numeric sort)
• Exercise:
• try out the sort command with the different options above on the the file: /home/bplank/numbers c) count words = count duplicate lines in a sorted text file: uniq -c
• uniq assumes a SORTED file as input!
uniq -c SORTEDFILE
• Exercise: frequency list of numbers in file
• sort the numbers file and save it (> redirect to file) in a new file called numSorted
• now use uniq -c to count how often each number appears
• Solution: sort -n /home/bplank/numbers > numSorted uniq -c numSorted
Now we have seen all necessary “ingredients” for our recipe on counting words
• An algorithm:
a. split text into one word per line (tokenize)
b. sort words
c. count how often each word appears The UNIX game
commands ~ bricks
building more powerful tools by combining bricks using the pipe: | The Pipe |
• Unix philosophy: combine many small programs
Terminal
stdin Keyboard tr -sc ‘[:alnum:]’ ‘\n’ shell program stderr stdout
Display (print) use sort | as shell program “glue” stdout uniq -q stdout shell program
shell environment (e.g. Bash shell) Word frequency list
• combining the three single commands (tr,sort,uniq):
tr -sc ‘[:alnum:]’ ‘\n’ < text.txt | sort | uniq -q Terminal
specify input for first combine commands program using the pipe (the | symbol), i.e., the stdout of the previous is the stdin for the next command The Pipe: | tr -sc ‘[:alnum:]’ ‘\n’ < text.txt | sort | uniq -q
Terminal
stdin Keyboard tr -sc ‘[:alnum:]’ ‘\n’ shell program stderr stdout
Display (print) sort shell program
stdout uniq -q stdout shell program shell environment (e.g. Bash shell) Using pipe to avoid extra files • without pipe (2 commandos = 2 REPLs):
• with pipe (no intermediate file necessary! 1 REPL): alternative to split test: sed
• sed (replace) command: sed ’s/WHAT/WITH/g FILE
sed ’s/ /\n/g’ text.txt
• What happens if you leave out g?
• Try the following (with and without g): sed ’s/I/**YOU**/g’ /home/bplank/ short.txt tr
• Another use of tr:
tr '[:upper:]' '[:lower:]' < text.txt
• Extra exercise: Merge upper and lower case by downcasing everything Exercise
• Extract the 10 most frequent hashtags from the file /home/bplank/tweets.txt (hint: create a word frequency list first and then use sort and head)
Also, use the command grep “^#” (grep “#”) in your pipeline (to extract words that start with a hashtag) — we will see grep again later I. What we have seen so far
• What is UNIX, what is the command line, why
• Inspecting a file on the command line
• Creating a word frequency lists (sed, sort, uniq, tr, and the pipe), extract most frequent words
• File system and navigation Overview
• Bigrams, working with tabular data
• Searching files with grep
• A final tiny story Bigram = word pairs
• Algorithm:
• tokenize by word
• print word_i and word_i+1 next to each other
• count Print words next to each other
• paste command
• paste FILE1 FILE2
• if your two files contain lists of words, prints them next to each other get next word
• create a file with one word per line
• create a second file from the first, but which starts at the second line:
tail -n +2 file > next [start with the second file and output all until the end] Bigrams
Exercise: find the 5 most frequent bigrams of text.txt Solution:
• Find the 5 most common bigrams
• Extra: Find the 5 most common trigrams Tabular data
• paste FILES (in contrast to cat)
• cut -f1 FILE (cut out first column from FILE)
• Exercise: create a frequency list from column 4 in file parses.conll
• cut -f 4 parses.conll | sed '/^$/d' | sort |uniq -c | sort -nr grep
• grep finds lines that match a given pattern
• grep “star” text.txt grep
• grep finds patterns specified as regular expression globally search for regular expression and print
• grep is a filter - you only keep certain lines of the input
• e.g., words that end with -ing:
• grep -w "[a-z]*ing" text.txt
• Exercises: try the above command:
• without -w option
• with the -o and -w option (or -ow for shorthand)
• what does the -v and -i option do? use man grep to find out grep
• grep gh keep lines containing gh
• grep -i gh keep lines containing gh independent of casing (gH GH..)
• grep “^ch” keep lines beginning with ch
• grep “ing$” keep lines ending with ing
• grep -v “gh” do NOT keep lines containing gh Counting: wc
• Counting lines (-l), words and characters in a file:
• wc FILE
• Why is the number of words different? Exercises with grep & wc
• How many uppercase words are in text.txt?
• How many 4-letter words?
• How many “1 syllable” words are there (with exactly one vowel)? stop words a tiny story (real-world example) in the end.. I never seem to remember when the New York Fashion Week takes place… New York Fashion week
• we’ll consult the New York Times (web API) to find out….
• Step 1: get the data
• Step 2: combine the results Extract year-month
• Extract year and month and sort by frequency to get a first impression
References
[1] Nikolaj Lindberg. egrep for Linguists. http:// stts.se/egrep_for_linguists/egrep_for_linguists.pdf
[2] Ken W. Church (1994). Unix for Poets. http:// cst.dk/bplank/refs/UnixforPoets.pdf
[3] Jeroen Janssens (2014). Data Science at the Command Line. O’Reilly.
[4] Jursfky & Martin. Speech and Language Processing. 2nd edition (2009).