Introduction to Unix Robert Kofler
Introduction
What is Unix?
• Unix: Uniplexed Information and Computing System • an operating system that is it very stable and well suited for multiple users and multiple tasks • foundation of many modern OS such Linux, Android and Mac OS X • basic ideas: plain text for storing data, hierarchical file system, many small tools • important innovation is a modular design the “Unix philosophy”, i.e. the OS provides a limited set of simple tools that each perform a single well defined task; • complex tasks and sophisticated workflows can be exectuted by concatenating these commands (piping and shell scripting) instead of using a single, large monolithic tool • Rob Pike, one of the early gurus sayed: “power of the system comes from the relationship among the programans rather than from the programs themselfves” • centerpiece is the kernel - a master control program that provides services to start and stop programs, handle file systems and schedule tasks
Short history of Unix
• MIT and Bell Labs developed an OS called Multics around 1960 • Ken Thompson and Dennies Ritchi and others got very frustrated with Multics as it had many problems and become too large for efficient maintainance • they decided to redo an OS on a much smaller scale • Unix name is actually a joke, making fun of Multics; Unix is pronounced as “eunuchs”, i.e. an emasculated Multics (deprived of power and virility) • 1972 Unix rewritten in C programming language - before that it was written in Assembler; Why is this important? • 1970-1980: large popularity in academia; increasing acceptance also by commercial start ups • 1990 popularity increased even more due to the release of Linux • 2000 Apple released first Mac OS X, based on Unix
Evolution of Unix
1 Figure 1:
2 Unix in bioinformatics?
• allows to deal with large data - try to open a 100GB text file in a Word processor • powerful command line tools (Unix philosophy) allows to do even complex tasks with a single line of code • allows automating workflows => try to change the content of 1000 files with GUI based software, the result will be tons of errors (e.g. when attention shortly drifts to evening beer) and teniditis (thats why I went into bioinformatics) • its probably still used in decades; investment in learning Unix will likely pay of for your entire career (command line magic) • repeatability of the analysis: workflows may be stored as text files; repeatabiltiy is important for you (in case you discover an error) and others (they may want to redo an analysis) • widely used: much support and many people developing tools
Recommended Readings
• Unix command cheat sheat: http://cheatsheetworld.com/programming/unix-linux-cheat-sheet/ • Brian W. Kernighan, Rob Pike: The Unix Programming Environment • Love, Merlino Reed.., Beginning Unix • Powers and Peek: Unix Power Tools
In medias res
Start the command line Finder => Applications => Utilities => Terminal Drag the terminal icon into the quick start bar
The three major questions when you find yourself in a novel environment
• Who am I (whoami) • Where am I (pwd) • Who is the person in my bed (ls)
3 Figure 2:
4 whoami pwd ls
Structure of a unix command: command [parameters] [files]
The strange person in detail ls -l ls -l -h ls -lh # the same as -l -h • -l list file content in detail • -h human readable file sizes
Figure 3:
• col1: permissions and is it a file or a directory • col2: number of links; files in directory or number of hard links • col3: owner • col4: group owner • col5: size • col6,7,8 Last modified, months, day, year (sometimes minutes as well) • last col: file name / directory name
5 Walk of shame: planning the escape
Change directory (cd) cd Desktop # move into the folder Desktop; relative path cd /Users/robertkofler/Desktop # move into folder Desktop; absolute path cd . # go to current direktory; when is this useful? cd / # go to the root directory; absolut path cd .. # go to parent directory cd ~ # go to home folder
Relative path, absolute path
Example of a file tree
6 What is root? Unix has a hierachical file system with root (“/”) as the top folder
Figure 4:
7 Moving around
Figure 5:
8 You are at student and want to move to Desktop? cd Desktop # move into the folder Desktop; relative path cd /Users/student/Desktop # move into folder Desktop; absolute path
Task:
Figure 6:
• You are at student and want to move to tasks? (relativ and absolute) • You are at student and want to move to Users? (relative and absolute) • You are at student and want to move to THE FOLDER? (whats the name again? relative and absolute)
9 Figure 7:
• You are at bin and want to move to Temp1? (relativ and absolute) • You are at bin and want to move to python? (relative and abolsute)
10 Task: • Download the files human-genes.gtf and illumina-reads.fastq.gz from https://drrobertkofler.wikispaces.com/Unix_ Alignments • Navigate to the folder where you stored it • tell me the size in kb
Info about a command
Omg, I forgot all the options of ls, what should I do?? man ls # manual for ls # exit with q
Most important buttons in the command line
1.) Tab: autocomplete
Task • Navigate to the folder where you stored the humang-genes.gtf using the absolute path and autocomplete • What could be the advantages of using autocomplete?
2.) Button up
Previous command
Furnishing your new home
Create directories mkdir mydata mkdir mydata/test/raw # does this work ? mkdir -p mydata/test/raw
Create empty files touch mydata/text.txt touch mydata/test/moretxt.txt
Remove directories rmdir mydata/test/raw rmdir mydata/test # does this work?
Note: rmdir only removes empty directories Remove files rm mydata/test/moretxt.txt rm -rf mydata # killer command; use with uttermost care (gone is gone) # -r recursive (subdirectories) -f force (just delete, no questions asked)
Move files touch test.txt # does this work? mv test.txt oklahoma.fastq.txt # move can be used to rename mkdir usa mv oklahoma.fastq.txt usa/ # move files into a new folder rm -rf usa
11 Task: • Go to your homefolder • create the directory analysis • create the directory analysis/coursework • create the directory analysis/coursework/rawdata • move the two files (human genes and illumina reads) into the directory analysis/coursework/rawdata
Inspecting files wc myfile.txt # word count; display the number of lines, words and characters (in this order) head myfile.txt # output the first ten lines of the file head -50 myfile.txt # output the first 50 lines of the file tail myfile.txt # output the last 10 lines of the file tail -2 # output the last 2 lines of the file cat myfile.txt # output the entire file cat myfile1.txt myfile2.txt # output the entire first file than the entire second file (no separation between the files)
Task • Display the first 20 lines of the human genes and the last 5 lines • How many genes do we have? • Display the first 10 lines of the illumina reads
Piping basics
The symbol “|” is a magic character in Unix, it allows to concatenate commands. THIS IS THE SINGLE MOST IMPORTANT COMMAND OF UNIX With the pipe command rests the entire power of Unix. Remember the Unix Philosophy, small commands that only have a very limited function. With the pipe command these small commands can be concatenated to solve complex tasks.
Meaning of the pipe use the output of the left command as input for the right command cat myfile.txt | head -10
• left cat myfile.txt • rigth head -10 • Note that most Unix commands can be used with a file and within a pipe head -10 myfile.txt # Head with a file cat myfile.txt | head -10 # Head with a pipe
Complex pipes examples cat myfile.txt | tail -10 | wc # Whats going on here?
Question: Is there an upper limit to this pipeing?
Thinking Task: • Relying only on the commands we already know, display the middle 10 lines of human genes :)
12 Editing file: vi
Vi is a powerful text editor operating solely in the command line. Unix supports two such text editors: vi and emacs; There is a battle raging between the religion of vi and the religion of emacs; I’m of the religion of vi. To be useful in the command line you have to be able to know at least one of these two. I’m going to teach vi. However if you already know emacs than just relax and sleep. vi myfile.txt
Vi has two modes the command mode and the text enter mode. For example when you press “y” in the command mode it yanks the current line (like copy) but when you press y in the text mode it simple writes y. • Press “i” (insert) to enter the text mode • Press Esc to exit the text mode and enter the command mode • press “:w” to save the current file (command mode) • press “:wq” to save and exit (back into the command line) • press “:q!” to exit vi without saving • press “A” to append text at the end of the current line (changes from text to command mode) • press “dd” to delete the current line (command mode) • press “2dd” to delete the next two lines • press “yy” to yank the current line (put it into memory); What is “2yy”" doing? • press “p” to paste the line from the memory into the text • here is a list of commands http://www.lagmonster.org/docs/vi.html I urge you to study them in some details; Knowing vi well helps to make you quit powerful in the command line
Task • create a file that contains a one sentence description of your master thesis with vi • create another file that lists your hobbies (one hobby at a line) • from your hobbies delete some lines, copy and paste some others and edit some other lines • undo the damage and create a nice hobby file; use your name as file name (no spaces); Note that you will share your hobbies later on with the others. • display the first two hobbies using head
Copy vs Link
Move files touch original.txt # create a file cp original.txt copy.txt # create a copy of a file ln copy.txt link.txt # create a hard link of a file
Task 4 • Use vi and enter the text “1” into the original.txt, than “2” into the copy.txt and “3” into the link.txt • Display the content of the three files • Notice anything interesting? How can you explain this? • Think of some examples: in which situations would you use copy and whan would you use link?
Addendum copy cp -r folder otherfolder # recursively (including subfolders) copy the content of one folder to another one
13 Remote access to computers ssh (Secure shell) ssh allows to operate on a remote computer ssh user_name@compter_name scp (secure copy) allows to copy files between computers: #scp from to scp myfile user_name@compter_name:/absolute/path/of/destination
Task • ssh as user vetgrid03 to computer i122mc146.vu-wien.ac.at • Explore the files in the root folder • move to folder /Volumes/Temp/coursework/hobbies2017 • scp you hobbies into the folder /Volumes/Temp/coursework/hobbies2017 (you may want to open a second shell window) • explore the hobbies of your colleagues
Managing processes
Running process in background sleep 10 # command terminal sleeps for 10 seconds sleep 30 & # command terminal sleeps but you can continue working # in other words the command is executed in the background
Question: where is this useful? Move command from background to foreground sleep 30 & fg # foreground
Move active command in background sleep 30 ctrl+z # suspend command bg # continue in background
Thus any command can be moved into the background
Display running processes ps -e
NOTE: the left column is the process id (PID)
Monitor running processes in real time top # exit with q (quit)
Sort running process by cpu usage, display most computationally demanding processes first
14 top -o cpu # sort order by cpu
Task • go to vetgrid03 • which are the most demanding processes? • what is the PID of this process kill a task kill 39534 # 39534 is a PID
Bioinformatics with the command line grep
Only retain rows that fit the pattern grep 'blabla' file.txt # get rows that contain 'blabla' from file.txt cat rawdata/human-genes.gtf | grep 'BRCA'
Task • How many lines contain ‘MA’? cut
Display only column 1 of the file cat rawdata/human-genes.gtf | cut -f 1
Display several columns of the file cat rawdata/human-genes.gtf | cut -f 1,9
Use a different field deliminator cat rawdata/human-genes.gtf | cut -f 1 -d""
Whats going on here??
Task • How many genes are on chromosome 19? • How many genes are on chromosome 1? Did you notice any problem? How could we overcome it? grep '^blabla$' # ^ beginning of line # $ end of line uniq and sort
Use vi and create a file (numbers.txt) with the following content:
15 1 1 2 2 2 3 1 1
The uniq command is eliminating all consecutive redundant entries. We want to get only the unique entries in the file. What do we expect? cat numbers.txt | uniq
What is going on? Any ideas? We may also count the number of occurences of each number cat numbers.txt | uniq -c # count the uniq entries
How could we get the counts of occurences of each number? Basically we need to sort the entries before using uniq. cat numbers.txt | sort| uniq -c
NOTE: for this reason sort and uniq frequently comes as pair NOTE: sort is extremly powerful; it allows to sort by a specific column (-k3) several columns (-k3 -k1), numerical or alphanumerical etc etc. . .
Task • How many genes are on each chromosome? • Sort the human genes, first by chromosome and than by start position (use the help of man and google).
Problem: We want to obtain a list of genes located on chromosome 3, how could we do this?
As first part of the problem we need to keep only genes located on chromosome 3; Introducing awk :) cat rawdata/human-genes.gtf|awk '$1=="3"' # column 1 needs to match 3 # $1 is column 1 # $3 is column 3 etc # in case column 1 matches (==) the default behaviour of awk is to print the line
AWK is actually a powerful mini-programming language; Learning awk in more detail can be very useful. However, I only have time for a quick introduction. In case we still have some time we will revisit awk in the end. Now its easy to get a list of the genes on chromosome 3 cat rawdata/human-genes.gtf|awk '$1=="3"'|cut -f 4 -d"" # Who can explain this code to me?
Awk also allows to test for pattern matches, e.g. if the start position of a gene ends with a zero. cat rawdata/human-genes.gtf|awk '$4~/0$/' # $4 column 4 # ~ tilde: match the following pattern within // # $ end position (end position of $4) # 0$ ends with a zero
Redirecting the output
16 cat rawdata/human-genes.gtf|awk '$1=="3"'|cut -f 4 -d"" > chr3-genest.txt # store output into file; # if file already exists overwrite it cat rawdata/human-genes.gtf|awk '$1=="3"'|cut -f 4 -d"" >> chr3-genest.txt # append output into file # if file does not exist, create it
Think Task • Get a list of all genes having exactly 3 exons
Working with zipped files
Zip a file
Zip your hobbies gzip my-hobbies.txt
Unzip a file gzip -d my-hobbies.txt
Tasks • What happened to the file extension of a zipped file? • What happened to the size of the zipped file? By which factors have your hobbies shrunken? • What happens when you read a zipped file (eg with head,tail)?
Use zipped files within a pipe
A fastq file is the standard raw output of Next Generation Sequencing (e.g Illumina) and may also be obtained for PacBio or Oxford Nanopore. gzip -cd rawdata/illumina-reads.fastq.gz | head
Figure 8:
Every fastq entry has four lines • line 1: read name (starts with an @) • line 2: sequence of the entry • line 3: read name (starts with an +)
17 • line 4: quality of the sequence Details about this file will be presented in another lecture.
Task • How many entries does this fastq file have?
Question: obtain all reads that contain the character ‘N’
We want to obtain all reads that contain the character ‘N’ and store it in a separate fastq file. The key problem here is that a single fastq entries has four lines, whereas Unix operates on a line per line basis. It would thus be nice to merge the content of several lines into a single one. paste use vi to create the following file (paste-test.txt) 1 2 3 4 5 6 7 8
Now we can try cat paste-test.txt | paste -- cat paste-test.txt | paste ----
What is the difference?
Back to the fastq gzip -cd rawdata/illumina-reads.fastq.gz|paste ----|head
Next we need to obtain only entries where the sequence contains the character N; Note that column 2 contains the sequence gzip -cd rawdata/illumina-reads.fastq.gz|paste ---- |awk '$2~/N/'|head
Repackaging fastq entries to 4-lines per entry using the translate command
The translate command replaces some characters with some others # tr 'replace_this''with_this_character' echo "Hallo"|tr 'l'' r'
Another info: special characters • \t tabulator • \n new line gzip -cd rawdata/illumina-reads.fastq.gz|paste ---- |awk '$2~/N/'|tr '\t'' \n'|head
Finally zip the output and write it into a novel fastq file
18 gzip -cd rawdata/illumina-reads.fastq.gz|paste ---- |awk '$2~/N/'|tr '\t'' \n'|gzip -c > rawdata/fastq-withN.fastq.gz
Task • Display the content of rawdata/fastq-withN.fastq.gz; • How many entries have a N in the sequence?
Think task • Write all fastq entries that are shorter than 20 bp into a separate zipped fastq-file. You need the awk command length(somecolumn) Send me this file as email rokofler at gmail.com
Advanced topics: iterating over many files
The unix command line also allows to iterate over many files For example we may have forgot to add the exentsion .txt to several text files # lets first create a folder containing our files mkdir iterate touch iterate/1 ... touch iterate/7
Now lets change the file names fori in iterate/*; do mv $i $i.txt; done
Note $i is the variable that contains the file name. You could use any variable name. When you are able to wield for-loops and unix commands you can perform powerful analysis in little time :) Note: you can also destroy your entire data set in little time. . . so be careful when using for-loops and test them before using them.
Summary
I hope this helps to illustrate the Unix philosophy: with a few simple commands, concatenated in creative ways, powerful analysis may be achieved. You don’t need programming for this. You only require a bit of creativity. Also its very short, many tasks can be solved with a single line of code. Note that some commands like awk, the for loops, shell scripting are very powerful, spending more time in learning them will certainly pay off during your further carreer.
19