No Black Magic: Text Processing at the UNIX Command Line

? No black magic: Text processing at the UNIX command line Barbara Plank (1994) Motivation What is Linux? • Linux is an operating system (OS) based on UNIX • OS: software layer between computer hardware and other software on top • UNIX: developed in the US 1969 AT&T / Bell labs • In 1991 Linus Torvalds developed the first Linux version: “I'm doing a (free) operating system (just a hobby, wont be big …” • Unix philosophy: Build functionality out of small programs that do one thing and do it well Slide inspired by: https://software.rc.fas.harvard.edu/training/intro_unix/ Data Storage • All Linux machines are connected to a central storage • Data is stored in two types of structures: • folders/directories • files File system structure • Directories (folders) contain files and subdirectories (subfolders) • Organized in a tree structure • Each folder is itself part of another folder • The only exception is the root folder (/), which is not part of any other folder File system • root of the file system hierarchy is always: / • paths can be absolute or relative, e.g. /home/bplank/data vs data/ or ../data • Commonly used directories: • . current working directory • .. parent directory • ~ home directory of user (for me: /home/bplank == ~bplank) Navigating the file system • cd change directory cd data/001/ • mkdir project creates a directory called “project” • ls list content of directory ls /home/bplank • pwd Referencing Files • A file can be referenced by the folders that lead to it plus the filename itself • For example, the file example.txt in folder barbara which is under the folder users which is under the root directory can be referenced as: • /users/barbara/example.txt • This is called an absolute path What is the command line? $ command prompt this window is called the terminal which is the program that allows us to interact with the shell the shell is an environment that executes the commands we type in at the command prompt What is the command line? INPUT OUTPUT REPL: read-eval-print loop very different from the well-known graphical user interface Input Output process model • Shell programs do I/O (input/output) with the terminal, using three streams: Terminal INPUT Keyboard stdin shell program stderr Display (print) stdout OUTPUT shell environment (e.g. Bash shell) • Interactively, you rarely notice there's separate stdout and stderr (today we won’t worry about stderr) Unix philosophy • combine many small programs for more advanced functionality Terminal Keyboard stdin shell program stderr stdout Display (print) shell program stdout stdout shell program shell environment (e.g. Bash shell) Why (still) the command line? • Advantages: • allows you to be agile (REPL vs edit-compile-run-debug cycle) this window is called the terminal • the command line is extensible and complementary which is the program that allows us • automation and reproducibility to interact with the shell • to run jobs on big clusters of computers (HPC computing) the shell executes the commands we type in • Disadvantage: (e.g. Bash shell) • takes some time to get acquainted First shell commands • Type text (your command) after the prompt ($), followed by ENTER: • pwd: print working directory (shows the current location in the filesystem) Shell command: Structure • A shell command (or shell program) usually takes parameters: arguments (required) and options (optional) • Shell program with argument(s) • cat text.txt • cat text1.txt text2.txt text3.txt • With argument and option: • cat -n text.txt (prefix every line by line number) Note • shell commands are CaSE SeNsItVe pwd PWD Pwd pwD • spaces have special meanings (do not use them for file names or folder names) Where to find help • To know what options and arguments a command takes consult the man (manual) pages: man whoami man cat • Use q to exit Tips • m<TAB> (use auto-completion) • use the arrow up key to reload command from your command history (or more advanced to search history of commands: <CTRL>+r) • <CTRL>+d or <CTRL>+c or just ‘q’ to quit Word frequency list Prerequisite: Copy file • Copy the text file from my home directory to yours: cp /home/bplank/text.txt . command name arg1: what? arg2: where to? (copy) • Check if the file is in your directory with ls: Inspect files • head text.txt prints out the first ten lines of the file • Try out the following commands - what do they do? • tail text.txt • cat text.txt • less text.txt (continue with SPACE or arrow UP/DOWN; quit by typing q) line-based processing • head text.txt prints out the first (by default) ten lines of the file • head -4 text.txt prints out the first 4 lines of the file I/O redirection to files • Shell commands can be redirected to write to files instead of to the screen, or read from files instead of the keyboard • Append to any command: > myfile send stdout to file called myfile < myfile send content of myfile as input to some program < 2> > 2> myfile send stderr to file called myfile line-based processing and I/O redirection • head text.txt equivalent to head < text.txt • head -1 text.txt > tmp prints out the first line of the file and stores it in file tmp • Exercise: store the last 4 lines of the file text.txt in a file called footer.txt Recipe for counting words • An algorithm: a. split text into one word per line (tokenize) b. sort words c. count how often each word appears a) split text: word per line • translate ‘A’ into ‘B’ A=set of characters B=single character (\n newline) -s squeezes multiple blanks, -c complement tr -sc ‘[a-zA-Z]’ ‘\n’ < text.txt • More examples: • tr -sc ‘[a-zA-Z0-9]’ ‘\n’ < text.txt • tr -sc ‘[:alnum:]’ ‘\n’ < text.txt • tr -sc ‘[:alnum:]@#’ ‘\n’ < tweets.txt b) sorting lines of text: sort • sort FILE • sort -r (reverse sort) • sort -n (numeric) • sort -nr (reverse numeric sort) • Exercise: • try out the sort command with the different options above on the the file: /home/bplank/numbers c) count words = count duplicate lines in a sorted text file: uniq -c • uniq assumes a SORTED file as input! uniq -c SORTEDFILE • Exercise: frequency list of numbers in file • sort the numbers file and save it (> redirect to file) in a new file called numSorted • now use uniq -c to count how often each number appears • Solution: sort -n /home/bplank/numbers > numSorted uniq -c numSorted Now we have seen all necessary “ingredients” for our recipe on counting words • An algorithm: a. split text into one word per line (tokenize) b. sort words c. count how often each word appears The UNIX game commands ~ bricks building more powerful tools by combining bricks using the pipe: | The Pipe | • Unix philosophy: combine many small programs Terminal stdin Keyboard tr -sc ‘[:alnum:]’ ‘\n’ shell program stderr stdout Display (print) use sort | as shell program “glue” stdout uniq -q stdout shell program shell environment (e.g. Bash shell) Word frequency list • combining the three single commands (tr,sort,uniq): tr -sc ‘[:alnum:]’ ‘\n’ < text.txt | sort | uniq -q Terminal specify input for ﬁrst combine commands program using the pipe (the | symbol), i.e., the stdout of the previous is the stdin for the next command The Pipe: | tr -sc ‘[:alnum:]’ ‘\n’ < text.txt | sort | uniq -q Terminal stdin Keyboard tr -sc ‘[:alnum:]’ ‘\n’ shell program stderr stdout Display (print) sort shell program stdout uniq -q stdout shell program shell environment (e.g. Bash shell) Using pipe to avoid extra files • without pipe (2 commandos = 2 REPLs): • with pipe (no intermediate file necessary! 1 REPL): alternative to split test: sed • sed (replace) command: sed ’s/WHAT/WITH/g FILE sed ’s/ /\n/g’ text.txt • What happens if you leave out g? • Try the following (with and without g): sed ’s/I/**YOU**/g’ /home/bplank/ short.txt tr • Another use of tr: tr '[:upper:]' '[:lower:]' < text.txt • Extra exercise: Merge upper and lower case by downcasing everything Exercise • Extract the 10 most frequent hashtags from the file /home/bplank/tweets.txt (hint: create a word frequency list first and then use sort and head) Also, use the command grep “^#” (grep “#”) in your pipeline (to extract words that start with a hashtag) — we will see grep again later I. What we have seen so far • What is UNIX, what is the command line, why • Inspecting a file on the command line • Creating a word frequency lists (sed, sort, uniq, tr, and the pipe), extract most frequent words • File system and navigation Overview • Bigrams, working with tabular data • Searching files with grep • A final tiny story Bigram = word pairs • Algorithm: • tokenize by word • print word_i and word_i+1 next to each other • count Print words next to each other • paste command • paste FILE1 FILE2 • if your two files contain lists of words, prints them next to each other get next word • create a file with one word per line • create a second file from the first, but which starts at the second line: tail -n +2 file > next [start with the second file and output all until the end] Bigrams Exercise: ﬁnd the 5 most frequent bigrams of text.txt Solution: • Find the 5 most common bigrams • Extra: Find the 5 most common trigrams Tabular data • paste FILES (in contrast to cat) • cut -f1 FILE (cut out first column from FILE) • Exercise: create a frequency list from column 4 in file parses.conll • cut -f 4 parses.conll | sed '/^$/d' | sort |uniq -c | sort -nr grep • grep finds lines that match a given pattern • grep “star” text.txt grep • grep finds patterns specified as regular expression globally search for regular expression and print • grep is a filter - you only keep certain lines of the input • e.g., words

Load more