<<

Introduction to Robert Kofler

Introduction

What is Unix?

• Unix: Uniplexed Information and Computing System • an that is it very stable and well suited for multiple users and multiple tasks • foundation of many modern OS such Linux, Android and Mac OS X • basic ideas: plain text for storing data, hierarchical file system, many small tools • important innovation is a modular design the “Unix philosophy”, i.e. the OS provides a limited set of simple tools that each perform a single well defined task; • complex tasks and sophisticated workflows can be exectuted by concatenating these commands (piping and shell scripting) instead of using a single, large monolithic tool • Rob Pike, one of the early gurus sayed: “power of the system comes from the relationship among the programans rather than from the programs themselfves” • centerpiece is the kernel - a master control program that provides services to start and stop programs, handle file systems and schedule tasks

Short of Unix

• MIT and Bell Labs developed an OS called Multics around 1960 • Ken Thompson and Dennies Ritchi and others got very frustrated with Multics as it had many problems and become too large for efficient maintainance • they decided to redo an OS on a much smaller scale • Unix name is actually a joke, making fun of Multics; Unix is pronounced as “eunuchs”, i.e. an emasculated Multics (deprived of power and virility) • 1972 Unix rewritten in C programming language - before that it was written in Assembler; Why is this important? • 1970-1980: large popularity in academia; increasing acceptance also by commercial start ups • 1990 popularity increased even due to the release of Linux • 2000 Apple released first Mac OS X, based on Unix

Evolution of Unix

1 Figure 1:

2 Unix in bioinformatics?

• allows to deal with large data - try to open a 100GB text file in a Word processor • powerful line tools (Unix philosophy) allows to do even complex tasks with a single line of code • allows automating workflows => try to change the content of 1000 files with GUI based software, the result will be tons of errors (e.g. when attention shortly drifts to evening beer) and teniditis (thats why I went into bioinformatics) • its probably still used in decades; investment in learning Unix will likely pay of for your entire career (command line magic) • repeatability of the analysis: workflows may be stored as text files; repeatabiltiy is important for you (in case you discover an error) and others (they may want to redo an analysis) • widely used: much support and many people developing tools

Recommended Readings

• Unix command cheat sheat: http://cheatsheetworld.com/programming/unix-linux-cheat-sheet/ • Brian . Kernighan, Rob Pike: The Unix Programming Environment • Love, Merlino Reed.., Beginning Unix • Powers and Peek: Unix Power Tools

In medias res

Start the command line Finder => Applications => Utilities => Terminal Drag the terminal icon into the quick start bar

The three major questions when you find yourself in a novel environment

am I () • Where am I () • Who is the person in my bed ()

3 Figure 2:

4 whoami pwd ls

Structure of a unix command: command [parameters] [files]

The strange person in detail ls -l ls -l -h ls -lh # the same as -l -h • -l list file content in detail • -h human readable file sizes

Figure 3:

• col1: permissions and is it a file or a directory • col2: number of links; files in directory or number of hard links • col3: owner • col4: group owner • col5: size • col6,7,8 Last modified, months, day, year (sometimes minutes as well) • last col: file name / directory name

5 Walk of shame: planning the escape

Change directory () cd Desktop # move into the folder Desktop; relative path cd /Users/robertkofler/Desktop # move into folder Desktop; absolute path cd . # go to current direktory; when is this useful? cd / # go to the root directory; absolut path cd .. # go to parent directory cd ~ # go to home folder

Relative path, absolute path

Example of a file

6 What is root? Unix has a hierachical file system with root (“/”) as the folder

Figure 4:

7 Moving around

Figure 5:

8 You are student and want to move to Desktop? cd Desktop # move into the folder Desktop; relative path cd /Users/student/Desktop # move into folder Desktop; absolute path

Task:

Figure 6:

• You are at student and want to move to tasks? (relativ and absolute) • You are at student and want to move to Users? (relative and absolute) • You are at student and want to move to THE FOLDER? (whats the name again? relative and absolute)

9 Figure 7:

• You are at bin and want to move to Temp1? (relativ and absolute) • You are at bin and want to move to python? (relative and abolsute)

10 Task: • Download the files human-genes.gtf and illumina-reads.fastq.gz from https://drrobertkofler.wikispaces.com/Unix_ Alignments • Navigate to the folder where you stored it • tell me the size in kb

Info about a command

Omg, I forgot all the options of ls, what should I do?? man ls # manual for ls # with q

Most important buttons in the command line

1.) Tab: autocomplete

Task • Navigate to the folder where you stored the humang-genes.gtf using the absolute path and autocomplete • What could be the advantages of using autocomplete?

2.) Button up

Previous command

Furnishing your new home

Create directories mydata mkdir mydata//raw # does this work ? mkdir -p mydata/test/raw

Create empty files mydata/text.txt touch mydata/test/moretxt.txt

Remove directories mydata/test/raw rmdir mydata/test # does this work?

Note: rmdir only removes empty directories Remove files mydata/test/moretxt.txt rm -rf mydata # killer command; use with uttermost care (gone is gone) # -r recursive (subdirectories) -f force (just delete, no questions asked)

Move files touch test.txt # does this work? test.txt oklahoma.fastq.txt # move can be used to rename mkdir usa mv oklahoma.fastq.txt usa/ # move files into a new folder rm -rf usa

11 Task: • Go to your homefolder • create the directory analysis • create the directory analysis/coursework • create the directory analysis/coursework/rawdata • move the two files (human genes and illumina reads) into the directory analysis/coursework/rawdata

Inspecting files myfile.txt # word count; display the number of lines, words and characters (in this order) myfile.txt # output the first ten lines of the head -50 myfile.txt # output the first 50 lines of the file myfile.txt # output the last 10 lines of the file tail -2 # output the last 2 lines of the file myfile.txt # output the entire file cat myfile1.txt myfile2.txt # output the entire first file than the entire second file (no separation between the files)

Task • Display the first 20 lines of the human genes and the last 5 lines • How many genes do we have? • Display the first 10 lines of the illumina reads

Piping basics

The symbol “|” is a magic character in Unix, it allows to concatenate commands. THIS IS THE SINGLE IMPORTANT COMMAND OF UNIX With the pipe command rests the entire power of Unix. Remember the Unix Philosophy, small commands that only have a very limited function. With the pipe command these small commands can be concatenated to solve complex tasks.

Meaning of the pipe use the output of the left command as input for the right command cat myfile.txt | head -10

• left cat myfile.txt • rigth head -10 • Note that most Unix commands can be used with a file and within a pipe head -10 myfile.txt # Head with a file cat myfile.txt | head -10 # Head with a pipe

Complex pipes examples cat myfile.txt | tail -10 | wc # Whats going on here?

Question: Is there an upper limit to this pipeing?

Thinking Task: • Relying only on the commands we already know, display the middle 10 lines of human genes :)

12 Editing file:

Vi is a powerful text editor operating solely in the command line. Unix supports two such text editors: vi and emacs; There is a battle raging between the religion of vi and the religion of emacs; I’m of the religion of vi. To be useful in the command line you have to be able to know at least one of these two. I’m going to teach vi. However if you already know emacs than just relax and . vi myfile.txt

Vi has two modes the command mode and the text enter mode. For example when you press “y” in the command mode it yanks the current line (like copy) but when you press y in the text mode it simple writes y. • Press “i” (insert) to enter the text mode • Press Esc to exit the text mode and enter the command mode • press “:w” to save the current file (command mode) • press “:wq” to save and exit (back into the command line) • press “:q!” to exit vi without saving • press “A” to append text at the end of the current line (changes from text to command mode) • press “” to delete the current line (command mode) • press “2dd” to delete the next two lines • press “yy” to yank the current line (put it into memory); What is “2yy”" doing? • press “p” to paste the line from the memory into the text • here is a list of commands http://www.lagmonster.org/docs/vi.html I urge you to study them in some details; Knowing vi well helps to you quit powerful in the command line

Task • create a file that contains a one sentence description of your master thesis with vi • create another file that lists your hobbies (one hobby at a line) • from your hobbies delete some lines, copy and paste some others and edit some other lines • undo the damage and create a hobby file; use your name as file name (no spaces); Note that you will share your hobbies later on with the others. • display the first two hobbies using head

Copy vs

Move files touch original.txt # create a file original.txt copy.txt # create a copy of a file copy.txt link.txt # create a hard link of a file

Task 4 • Use vi and enter the text “1” into the original.txt, than “2” into the copy.txt and “3” into the link.txt • Display the content of the three files • Notice anything interesting? How can you explain this? • Think of some examples: in situations would you use copy and whan would you use link?

Addendum copy cp -r folder otherfolder # recursively (including subfolders) copy the content of one folder to another one

13 Remote access to computers ssh (Secure shell) ssh allows to operate on a remote computer ssh user_name@compter_name scp (secure copy) allows to copy files between computers: #scp from to scp myfile user_name@compter_name:/absolute/path/of/destination

Task • ssh as user vetgrid03 to computer i122mc146.vu-wien.ac.at • Explore the files in the root folder • move to folder /Volumes/Temp/coursework/hobbies2017 • scp you hobbies into the folder /Volumes/Temp/coursework/hobbies2017 (you may want to open a second shell window) • explore the hobbies of your colleagues

Managing processes

Running process in background sleep 10 # command terminal sleeps for 10 seconds sleep 30 & # command terminal sleeps but you can continue working # in other words the command is executed in the background

Question: where is this useful? Move command from background to foreground sleep 30 & fg # foreground

Move active command in background sleep 30 ctrl+z # suspend command bg # continue in background

Thus any command can be moved into the background

Display running processes -e

NOTE: the left column is the process id (PID)

Monitor running processes in real top # exit with q (quit)

Sort running process by cpu usage, display most computationally demanding processes first

14 top -o cpu # order by cpu

Task • go to vetgrid03 • which are the most demanding processes? • what is the PID of this process a task kill 39534 # 39534 is a PID

Bioinformatics with the command line

Only retain rows that fit the pattern grep 'blabla' file.txt # get rows that contain 'blabla' from file.txt cat rawdata/human-genes.gtf | grep 'BRCA'

Task • How many lines contain ‘MA’?

Display only column 1 of the file cat rawdata/human-genes.gtf | cut -f 1

Display several columns of the file cat rawdata/human-genes.gtf | cut -f 1,9

Use a different field deliminator cat rawdata/human-genes.gtf | cut -f 1 -d""

Whats going on here??

Task • How many genes are on chromosome 19? • How many genes are on chromosome 1? Did you notice any problem? How could we overcome it? grep '^blabla$' # ^ beginning of line # $ end of line and sort

Use vi and create a file (numbers.txt) with the following content:

15 1 1 2 2 2 3 1 1

The uniq command is eliminating all consecutive redundant entries. We want to get only the unique entries in the file. What do we expect? cat numbers.txt | uniq

What is going on? Any ideas? We may also count the number of occurences of each number cat numbers.txt | uniq -c # count the uniq entries

How could we get the counts of occurences of each number? Basically we need to sort the entries before using uniq. cat numbers.txt | sort| uniq -c

NOTE: for this reason sort and uniq frequently comes as pair NOTE: sort is extremly powerful; it allows to sort by a specific column (-k3) several columns (-k3 -k1), numerical or alphanumerical etc etc. . .

Task • How many genes are on each chromosome? • Sort the human genes, first by chromosome and than by start position (use the of man and google).

Problem: We want to obtain a list of genes located on chromosome 3, how could we do this?

As first part of the problem we need to keep only genes located on chromosome 3; Introducing :) cat rawdata/human-genes.gtf|awk '$1=="3"' # column 1 needs to match 3 # $1 is column 1 # $3 is column 3 etc # in case column 1 matches (==) the default behaviour of awk is to print the line

AWK is actually a powerful mini-programming language; Learning awk in more detail can be very useful. However, I only have time for a quick introduction. In case we still have some time we will revisit awk in the end. Now its easy to get a list of the genes on chromosome 3 cat rawdata/human-genes.gtf|awk '$1=="3"'|cut -f 4 -d"" # Who can explain this code to me?

Awk also allows to test for pattern matches, e.g. if the start position of a gene ends with a zero. cat rawdata/human-genes.gtf|awk '$4~/0$/' # $4 column 4 # ~ tilde: match the following pattern within // # $ end position (end position of $4) # 0$ ends with a zero

Redirecting the output

16 cat rawdata/human-genes.gtf|awk '$1=="3"'|cut -f 4 -d"" > chr3-genest.txt # store output into file; # if file already exists overwrite it cat rawdata/human-genes.gtf|awk '$1=="3"'|cut -f 4 -d"" >> chr3-genest.txt # append output into file # if file does not exist, create it

Think Task • Get a list of all genes having exactly 3 exons

Working with zipped files

Zip a file

Zip your hobbies gzip my-hobbies.txt

Unzip a file gzip -d my-hobbies.txt

Tasks • What happened to the file extension of a zipped file? • What happened to the size of the zipped file? By which factors have your hobbies shrunken? • What happens when you read a zipped file (eg with head,tail)?

Use zipped files within a pipe

A fastq file is the standard raw output of Next Generation Sequencing (e.g Illumina) and may also be obtained for PacBio or Oxford Nanopore. gzip -cd rawdata/illumina-reads.fastq.gz | head

Figure 8:

Every fastq entry has four lines • line 1: read name (starts with an @) • line 2: sequence of the entry • line 3: read name (starts with an +)

17 • line 4: quality of the sequence Details about this file will be presented in another lecture.

Task • How many entries does this fastq file have?

Question: obtain all reads that contain the character ‘N’

We want to obtain all reads that contain the character ‘N’ and store it in a separate fastq file. The key problem here is that a single fastq entries has four lines, whereas Unix operates on a line per line basis. It would thus be nice to merge the content of several lines into a single one. paste use vi to create the following file (paste-test.txt) 1 2 3 4 5 6 7 8

Now we can try cat paste-test.txt | paste -- cat paste-test.txt | paste ----

What is the difference?

Back to the fastq gzip -cd rawdata/illumina-reads.fastq.gz|paste ----|head

Next we need to obtain only entries where the sequence contains the character N; Note that column 2 contains the sequence gzip -cd rawdata/illumina-reads.fastq.gz|paste ---- |awk '$2~/N/'|head

Repackaging fastq entries to 4-lines per entry using the translate command

The translate command replaces some characters with some others # 'replace_this''with_this_character' "Hallo"|tr 'l'' r'

Another : special characters • \t tabulator • \n new line gzip -cd rawdata/illumina-reads.fastq.gz|paste ---- |awk '$2~/N/'|tr '\t'' \n'|head

Finally zip the output and it into a novel fastq file

18 gzip -cd rawdata/illumina-reads.fastq.gz|paste ---- |awk '$2~/N/'|tr '\t'' \n'|gzip -c > rawdata/fastq-withN.fastq.gz

Task • Display the content of rawdata/fastq-withN.fastq.gz; • How many entries have a N in the sequence?

Think task • Write all fastq entries that are shorter than 20 bp into a separate zipped fastq-file. You need the awk command length(somecolumn) Send me this file as email rokofler at gmail.com

Advanced topics: iterating over many files

The unix command line also allows to iterate over many files For example we may have forgot to add the exentsion .txt to several text files # lets first create a folder containing our files mkdir iterate touch iterate/1 ... touch iterate/7

Now lets change the file names fori in iterate/*; do mv $i $i.txt; done

Note $i is the variable that contains the file name. You could use any variable name. When you are able to wield for-loops and unix commands you can perform powerful analysis in little time :) Note: you can also destroy your entire data set in little time. . . so be careful when using for-loops and test them before using them.

Summary

I hope this helps to illustrate the Unix philosophy: with a few simple commands, concatenated in creative ways, powerful analysis may be achieved. You don’t need programming for this. You only require a bit of creativity. Also its very short, many tasks can be solved with a single line of code. Note that some commands like awk, the for loops, shell scripting are very powerful, spending more time in learning them will certainly pay off during your further carreer.

19