<<

?

No black magic: Text processing the line

Barbara Plank (1994)

Motivation What is ?

• Linux is an (OS) based on UNIX

• OS: software layer between computer hardware and other software on

• UNIX: developed in the US 1969 AT&T /

• In 1991 Linus Torvalds developed the first Linux version: “I'm doing a (free) operating system (just a hobby, wont be big …”

• Unix philosophy: Build functionality out of small programs that do one thing and do it well

Slide inspired by: https://software.rc.fas.harvard.edu/training/intro_unix/ Data Storage

• All Linux machines are connected to a central storage

• Data is stored in two types of structures:

• folders/directories

• files system structure

• Directories (folders) contain files and subdirectories (subfolders)

• Organized in a structure

• Each folder is itself part of another folder

• The only exception is the root folder (/), is not part of any other folder

• root of the file system hierarchy is always: /

• paths can be absolute or relative, e.g. /home/bplank/data vs data/ or ../data

• Commonly used directories:

• . current working

• .. parent directory

• ~ home directory of user (for me: /home/bplank == ~bplank) Navigating the file system

change directory cd data/001/

project creates a directory called “project”

list content of directory ls /home/bplank

Referencing Files

• A file can be referenced by the folders that lead to it plus the filename itself

• For example, the file example.txt in folder barbara which is under the folder users which is under the root directory can be referenced as:

• /users/barbara/example.txt

• This is called an absolute What is the command line?

$ command prompt

this window is called the terminal

which is the program that allows us to interact with the shell

the shell is an environment that executes the commands we in at the command prompt What is the command line?

INPUT OUTPUT

REPL: read-eval-print loop very different from the well-known graphical user interface Input Output process model

• Shell programs do I/O (input/output) with the terminal, using three streams:

Terminal INPUT

Keyboard stdin shell program stderr

Display (print) stdout OUTPUT shell environment (e.g. shell) • Interactively, you rarely notice there's separate stdout and stderr (today we won’t worry about stderr) Unix philosophy

• combine many small programs for advanced functionality

Terminal

Keyboard stdin shell program stderr stdout

Display (print) shell program

stdout stdout shell program

shell environment (e.g. Bash shell) Why (still) the command line? • Advantages:

• allows you to be agile (REPL vs edit-compile-run-debug cycle) this window is called the terminal • the command line is extensible and complementary which is the program that allows us • automation and reproducibility to interact with the shell • to run jobs on big clusters of computers (HPC computing) the shell executes the commands we type in • Disadvantage: (e.g. Bash shell) • takes some to get acquainted First shell commands

• Type text (your command) after the prompt ($), followed by ENTER:

• pwd: print (shows the current location in the filesystem)

Shell command: Structure

• A shell command (or shell program) usually takes parameters: arguments (required) and options (optional)

• Shell program with argument(s)

text.txt

• cat text1.txt text2.txt text3.txt

• With argument and option:

• cat -n text.txt (prefix every line by line number) Note

• shell commands are CaSE SeNsItVe

pwd PWD Pwd pwD

• spaces have special meanings (do not use them for file names or folder names) Where to

• To know what options and arguments a command takes consult the man (manual) pages:

man man cat

• Use q to Tips

• m (use auto-completion)

• use the arrow up key to reload command from your command (or more advanced to search history of commands: +)

+d or + or just ‘q’ to quit Word frequency list Prerequisite: Copy file

• Copy the text file from my home directory to yours:

/home/bplank/text.txt . command name arg1: what? arg2: where to? (copy)

• Check if the file is in your directory with ls: Inspect files

text.txt

prints out the first ten lines of the file

• Try out the following commands - what do they do?

text.txt

• cat text.txt

text.txt (continue with SPACE or arrow UP/DOWN; quit by typing q) line-based processing

• head text.txt

prints out the first (by default) ten lines of the file

• head -4 text.txt

prints out the first 4 lines of the file I/O redirection to files

• Shell commands can be redirected to to files instead of to the screen, or read from files instead of the keyboard

• Append to any command:

> myfile send stdout to file called myfile

< myfile send content of myfile as input to some program

<

2>

>

2> myfile send stderr to file called myfile line-based processing and I/O redirection

• head text.txt

equivalent to

head < text.txt

• head -1 text.txt > tmp

prints out the first line of the file and stores it in file tmp

• Exercise: store the last 4 lines of the file text.txt in a file called footer.txt Recipe for counting words

• An algorithm:

a. text into one word per line (tokenize)

b. words

c. count how often each word appears a) split text: word per line

• translate ‘A’ into ‘B’ A=set of characters B=single character (\n ) -s squeezes multiple blanks, -c complement

-sc ‘[a-zA-Z]’ ‘\n’ < text.txt

• More examples:

• tr -sc ‘[a-zA-Z0-9]’ ‘\n’ < text.txt

• tr -sc ‘[:alnum:]’ ‘\n’ < text.txt

• tr -sc ‘[:alnum:]@#’ ‘\n’ < tweets.txt b) sorting lines of text: sort

• sort FILE

• sort -r (reverse sort)

• sort -n (numeric)

• sort -nr (reverse numeric sort)

• Exercise:

• try out the sort command with the different options above on the the file: /home/bplank/numbers c) count words = count duplicate lines in a sorted text file: -c

• uniq assumes a SORTED file as input!

uniq -c SORTEDFILE

• Exercise: frequency list of numbers in file

• sort the numbers file and save it (> redirect to file) in a new file called numSorted

• now use uniq -c to count how often each number appears

• Solution: sort -n /home/bplank/numbers > numSorted uniq -c numSorted

Now we have seen all necessary “ingredients” for our recipe on counting words

• An algorithm:

a. split text into one word per line (tokenize)

b. sort words

c. count how often each word appears The UNIX game

commands ~ bricks

building more powerful tools by combining bricks using the pipe: | The Pipe |

• Unix philosophy: combine many small programs

Terminal

stdin Keyboard tr -sc ‘[:alnum:]’ ‘\n’ shell program stderr stdout

Display (print) use sort | as shell program “glue” stdout uniq -q stdout shell program

shell environment (e.g. Bash shell) Word frequency list

• combining the three single commands (tr,sort,uniq):

tr -sc ‘[:alnum:]’ ‘\n’ < text.txt | sort | uniq -q Terminal

specify input for first combine commands program using the pipe (the | symbol), i.e., the stdout of the previous is the stdin for the next command The Pipe: | tr -sc ‘[:alnum:]’ ‘\n’ < text.txt | sort | uniq -q

Terminal

stdin Keyboard tr -sc ‘[:alnum:]’ ‘\n’ shell program stderr stdout

Display (print) sort shell program

stdout uniq -q stdout shell program shell environment (e.g. Bash shell) Using pipe to avoid extra files • without pipe (2 commandos = 2 REPLs):

• with pipe (no intermediate file necessary! 1 REPL): alternative to split :

• sed (replace) command: sed ’s/WHAT/WITH/g FILE

sed ’s/ /\n/g’ text.txt

• What happens if you leave out g?

• Try the following (with and without g): sed ’s/I/**YOU**/g’ /home/bplank/ short.txt tr

• Another use of tr:

tr '[:upper:]' '[:lower:]' < text.txt

• Extra exercise: Merge upper and lower case by downcasing everything Exercise

• Extract the 10 frequent hashtags from the file /home/bplank/tweets.txt (hint: create a word frequency list first and then use sort and head)

Also, use the command “^#” (grep “#”) in your (to extract words that start with a hashtag) — we will see grep again later I. What we have seen so far

• What is UNIX, what is the command line, why

• Inspecting a file on the command line

• Creating a word frequency lists (sed, sort, uniq, tr, and the pipe), extract most frequent words

• File system and navigation Overview

• Bigrams, working with tabular data

• Searching files with grep

• A final tiny story Bigram = word pairs

• Algorithm:

• tokenize by word

• print word_i and word_i+1 next to each other

• count Print words next to each other

command

• paste FILE1 FILE2

• if your two files contain lists of words, prints them next to each other get next word

• create a file with one word per line

• create a second file from the first, but which starts at the second line:

tail -n +2 file > next [start with the second file and output all until the end] Bigrams

Exercise: find the 5 most frequent bigrams of text.txt Solution:

• Find the 5 most common bigrams

• Extra: Find the 5 most common trigrams Tabular data

• paste FILES (in contrast to cat)

-f1 FILE (cut out first column from FILE)

• Exercise: create a frequency list from column 4 in file parses.conll

• cut -f 4 parses.conll | sed '/^$/d' | sort |uniq -c | sort -nr grep

• grep finds lines that match a given pattern

• grep “star” text.txt grep

• grep finds patterns specified as globally search for regular expression and print

• grep is a - you only keep certain lines of the input

• e.g., words that end with -ing:

• grep - "[a-z]*ing" text.txt

• Exercises: try the above command:

• without -w option

• with the -o and -w option (or -ow for shorthand)

• what does the - and -i option do? use man grep to find out grep

• grep gh keep lines containing gh

• grep -i gh keep lines containing gh independent of casing (gH GH..)

• grep “^ch” keep lines beginning with ch

• grep “ing$” keep lines ending with ing

• grep -v “gh” do NOT keep lines containing gh Counting:

• Counting lines (-l), words and characters in a file:

• wc FILE

• Why is the number of words different? Exercises with grep & wc

• How many uppercase words are in text.txt?

• How many 4-letter words?

• How many “1 syllable” words are there (with exactly one vowel)? stop words a tiny story (real-world example) in the end.. I never seem to remember when the New York Fashion Week takes place… New York Fashion week

• we’ll consult the New York Times (web API) to find out….

• Step 1: get the data

New York Fashion week

• Step 2: combine the results Extract year-month

• Extract year and month and sort by frequency to get a first impression

References

[1] Nikolaj Lindberg. egrep for Linguists. http:// stts.se/egrep_for_linguists/egrep_for_linguists.pdf

[2] Ken W. Church (1994). Unix for Poets. http:// cst.dk/bplank/refs/UnixforPoets.pdf

[3] Jeroen Janssens (2014). Data Science at the Command Line. O’Reilly.

[4] Jursfky & Martin. Speech and Language Processing. 2nd edition (2009).