<<

Unix: Beyond the Basics

BaRC Hot Topics – Oct 19, 2015 Bioinformatics and Research Computing Whitehead Institute

http://barc.wi.mit.edu/hot_topics/ Login • Requesting a tak account http://iona.wi.mit.edu/bio/software/unix/bioinfoaccount.php • Windows  PuTTY  Need to setup X-windows for graphical display • Macs Access the Terminal: Go  Utilities  Terminal XQuartz needed for X-windows for newer OS X.

2 Login using Secure Shell

ssh –Y user@tak

PuTTY on Windows

Terminal on Macs

Command Prompt 3 user@tak ~$ Hot Topics website: http://jura.wi.mit.edu/bio/education/hot_topics/ • Create a directory for the exercises and use it as your working directory $ /nfs/BaRC_training $ john_doe $ cd john_doe • Copy all files into your working directory $ –r /nfs/BaRC_training/UnixII/* . • You should have the files below in your working directory: – foo.txt, sample1.txt, exercise.txt, datasets folder – you can check with command

4 Unix Review: Commands

 command [arg1 arg2 … ] [input1 input2 … ] $ –k2,3nr foo.tab -n or -g: -n is recommended, except for scientific notation or start end a leading '+' -r: reverse order $ –f1,5 foo.tab $ cut –f1-5 foo.tab -f: select only these fields -f1,5: select 1st and 5th fields -f1-5: select 1st, 2nd, 3rd, 4th, and 5th fields $ –l foo.txt How many lines are in this ? 5 Unix Review: Common Mistakes • Case sensitive cd /nfs/Barc_Public vs cd /nfs/BaRC_Public -bash: cd: /nfs/Barc_Public: No such file or directory

• Spaces may matter! –f myFiles* vs rm –f myFiles *

6 Unix Review: Pipes • output of one command/program as input for another – Avoid intermediate file(s)

 $cut –f 1 myFile.txt | sort | > uniqCounts.txt pipe

7 What you will learn…

• groupBy (bedtools) • loops • files together • scripting

8 sed: stream editor for filtering and transforming text

• Print lines 10 - 15: $ sed -n '10,15p' bigFile > selectedLines.txt • Delete 5 header lines the beginning of a file:

$ sed '1,5d' file > fileNoHeader • Remove all version numbers (eg: '.1') from the end of a list of sequence accessions: eg. NM_000035.2

$ sed 's/\.[0-9]\+//g' accsWithVersion > accsOnly

s: substitute g: global modifier d: delete line p: print the current pattern space -n: only print those lines matching our pattern

9 Regular Expressions • and easier to search • Commonly used regular expressions:

Regular Matches Expression . All characters * Zero or ; wildcard + One or more ^ Beginning of a line $ End of a line  Examples -list all txt files: ls *.txt -replace CHR with Chr at the beginning of each line: sed 's/^CHR/Chr/' myFile.txt Note: syntax may slightly differ between sed, awk, Unix shell, eg. \+ in sed 10 awk

• Name comes from the original authors: Alfred . Aho, Peter J. Weinberger, Brian . Kernighan • A simple programing language • Good for short programs, including parsing files

11 awk #: comment, ignored by awk By default, awk splits each line by spaces

• Print the 2nd and 1st fields of the file: $ awk ' { print $2"\t"$1 } ' foo.tab

• Convert sequences from tab delimited format to fasta format: $ awk ' { print ">"$1"\n"$2 } ' foo.tab > foo.fa $ -1 foo.tab Seq1 ACTGCATCAC $ head -2 foo.fa >Seq1 ACGCATCAC

Character Description \n \r carriage return \t horizontal tab

12 awk: field separator

• Issues with default separator: white space – one field is gene description with multiple words – consecutive empty cells • To use tab as the separator: $ awk –F "\t" '{ print NF }' foo.txt or $ awk 'BEGIN {FS="\t"} { print NF }' foo.txt

BEGIN: action before read input NF: number of fields in the current record FS: input field separator OFS: output field separator END: action after read input 13 awk: arithmetic operations

Add average values of 4th and 5th fields to the file: $ awk '{ print $0"\t"($4+$5)/2 }' foo.tab

$0: all fields

Operator Description + Addition - Subtraction * Multiplication / Division % Modulo ^ Exponentiation ** Exponentiation

14 awk: making comparisons Print out records if values in 4th or 5th field are above 4: $ awk '{ if( $4>4 || $5>4 ) print $0 } ' foo.tab

Sequence Description > Greater than < Less than <= Less than or equal to >= Greater than or equal to == Equal to != Not equal to ~ Matches !~ Does not match || Logical OR && Logical AND

15 awk • Conditional statements: Display expression levels for the gene NANOG: if (condition1) action1 $ awk '{ if(/NANOG/) print $0 }' foo.txt else if (condition2) action2 or else action3 $ awk '/NANOG/ { print $0 } ' foo.txt sub function for substitution: or sub( regexp, replacement, target) $ awk '/NANOG/' foo.txt

Add line number to the above output: for ( i = 1; i <= NF; i++ ) $ awk '/NANOG/ { print NR"\t"$0 }' foo.txt print $i NR: line number of the current row • Looping: Calculate the average expression (4th, 5th and 6th fields in this case) for each transcript $ awk '{ total= $4 + $5 + $6; avg=total/3; print $0"\t"avg}' foo.txt or $ awk '{ total=0; for (i=4; i<=6; i++) total=total+$i; avg=total/3; print $0"\t"avg }' foo.txt

16 bioawk* • Extension of awk for commonly used file formats in bioinformatics $ bioawk -c bed: 1:chrom 2:start 3:end 4:name 5:score 6:strand 7:thickstart 8:thickend 9:rgb 10:blockcount 11:blocksizes 12:blockstarts : 1:qname 2:flag 3:rname 4:pos 5:mapq 6:cigar 7:rnext 8:pnext 9:tlen 10: 11:qual vcf: 1:chrom 2:pos 3:id 4:ref 5:alt 6:qual 7: 8: gff: 1:seqname 2:source 3:feature 4:start 5:end 6:score 7:filter 8:strand 9:group 10:attribute fastx: 1:name 2:seq 3:qual 4:comment

*https://github.com/lh3/bioawk 17 bioawk: Examples

 Print start coordinates in a gff/gtf file,  bioawk -c gff '{print $start}' Homo_sapiens.GRCh37.75.canonical.gtf or bioawk -c gff '{print $4}' Homo_sapiens.GRCh37.75.canonical.gtf 11869 12613 13221

 Print only the sequences in a fastq file  bioawk -c fastx '{print $seq}' myFastQ.txt or bioawk -c fastx '{print $2}' myFastQ.txt GGAGTTACATTGATGGTGGAGAGTTGGCTGGGTATTCCCAATGTCCCAAAGGTCATAAAGGAGGGAAAAGAAAAAAAGAAAAAAATATTCAAAACAAATC NTAACGGATCACGTTGGCCAGGTTACCCCTCCAGAAGGAGAGGAAGCCCTGCTCCTTAGGGATTCTCACCACACAATCAATGATCCCTTTGTACTGCTTC CGAAGGTGGTGATGGTGCCCGTCTTCACCAGGAACTGGTCCACGCCCACGAGGCCCACAATGTTCCCACAAGGCACATCCTCGATGGGCTCCACGTAGCG CTTGAATTTTCTAAATGCAACTTCATCATTCTGCAAATCAGCAAGACTCACTTCAAACACACGACCCTTGAGACCATCAGATGCAATTTTGGTTCCTTGG TGAGCGAGAGGTTGTGGCGGATTGTGTTTTGCCCGGCGGGGGACCTTTCCCCGGAGGAGGGGGAGCGGCCCCCGATGAACTCACAGAAACTCCTCCGCCT ATTATACTCATCGCGAATATCCTTAAGAGGGCGTTCAGCAGCCAGCTTGCGGCAAAACTGCGTAACCGTCTTCTCGTTCTCTAAAAACCATTTTTCGTCC TCCTGCAGCTTGATGGAGATACCTCTTACTGGGCCTCTCTGAATTCGCTTCATCAGATGCGTGACATAACCTGCTATCTTGTTGCGGAGCTTTTTGCTGG

18 Summarize by Columns: groupBy (from bedtools)

-g grpCols column(s) for grouping -c -opCols column(s) to be summarized -o Operation(s) applied to opCol:

, count, min, max, mean, median, stdev, collapse distinct freqdesc (i.e., print desc. list of values:freq) freqasc (i.e., print asc. list of values:freq)

Print the maximum value (3rd column) for each gene (2nd field) in the file $ groupBy -g 2 -c 3 -o max foo.tab

19 Join files together • Unix join $ join -1 2 -2 3 FILE1 FILE2 Join files on the 2nd field of FILE1 with the 3rd field of FILE2, only showing the common lines. FILE1 and FILE2 must be sorted on the join fields before running join

• BaRC scripts: /nfs/BaRC_Public/BaRC_code//  sorting not required $ join2filesByFirstColumn.pl file1 file2 $ submatrix_selector.pl matrixFile rowIDfile or 0 [columnIDfile]

20 Shell Flavors

• Syntax (for scripting) depends on shell $SHELL • Several shells available, bash is the default on tak and commonly used. • Some of the shells (incomplete listing): Shell Name sh Bourne bash Bourne-Again ksh Korn shell csh C shell

21

Shell Script

• Automation: avoid having to retype the same commands many times • Ease of use and more efficient • Outline of a script: #!/bin/bash : interprets how to run the script commands… set of commands used in the script #comments comments using “#” • Commonly used extension for script is .sh (eg. foo.sh), file must have executable permission

22 Bash Shell: for loop

• Process multiple files with one command • Reduce computational with many cluster nodes for myfile in `ls dataDir` do bsub wc –l $myfile done

When referring to a variable, $ is needed before the variable name ($myfile), but $ is not needed when defining it (myfile). Note: this syntax is different from awk 23 Example #!/bin/sh

# 1. Take two arguments: the first one is a directory with all the datasets, the second one is an output directory # 2. For each data, calculate average gene expression, and save the results in a file in the output directory inDir=$1 # 1st argument outDir=$2 # 2nd argument; outDir must already exist # Define variables: no spaces on either side of the equal sign for i in `/bin/ls $inDir ` # refer to variable with $ do # output file name name="${i}_avg.txt" # {}: $i_avg is not valid; prevent misinterpretation of variable as #characters # calculate average gene expression # NM_001039201 Hdhd2 5.0306 5.3309 5.4998 sort -k2,2 $inDir/$i | groupBy -g 2 -c 3,4,5 -o mean,mean,mean >| $outDir/$name done

# You can use graphical editors such as nedit, gedit, xemacs to create shell scripts 24 Advanced Topics Create multiple inputs to a command without creating intermediate files Named Pipes: stream output from two different commands into another program $ mkfifo temp #named pipe $ sort myFile > temp& #data is NOT written to a file! $ mkfifo temp2 #named pipe $ sort myFile2 > temp2& $ ./someProgram –in temp –in2 temp2 > out $ rm temp temp2 #remove temp temp2 named pipes  Process Substitution # two files in one command $ diff <(sort myFile1) <(sort myFile2) #run fastqc using the fastq file from GTC where it’s tar.gz file $ <(tar xvzfo /lab/solexa_public/…/ACTTGA_sequence.txt.tar.gz /lab/solexa_public/…/ACTTGA_sequence.txt) | fastqc stdin

25 Further Reading

• BaRC one liners: – http://iona.wi.mit.edu/bio/bioinfo/scripts/#unix • Unix Info for Bioinfo: – http://iona.wi.mit.edu/bio/barc_site_map.php • Online books via MIT: – http://proquest.safaribooksonline.com/ – Bioinformatics Data Skills by V.Buffalo (2015) • MIT: Turning Biologists into Bioinformaticists – http://rous.mit.edu/index.php/Teaching • Bash Guide for Beginners: – http://tldp.org/LDP/Bash-Beginners-Guide/html/

26