Unix: Beyond the Basics
Total Page:16
File Type:pdf, Size:1020Kb
Log in using secure shell ssh –Y user@tak PuTTY on Windows Unix: Beyond the Basics George W Bell, Ph.D. Terminal on Macs BaRC Hot Topics – October, 2016 Bioinformatics and Research Computing Whitehead Institute Command prompt user@tak ~$ http://barc.wi.mit.edu/hot_topics/ 3 Hot Topics website: Logging in to our Unix server http://jura.wi.mit.edu/bio/education/hot_topics/ • Create a directory for the exercises and use it as your working • Our main server is called tak directory $ cd /nfs/BaRC_training • Request a tak account: $ mkdir john_doe http://iona.wi.mit.edu/bio/software/unix/bioinfoaccount.php $ cd john_doe • Logging in from Windows • Copy all files into your working directory ¾ PuTTY for ssh $ cp -r /nfs/BaRC_training/UnixII/* . ¾ Xming for graphical display [optional] • You should have the files below in your working directory: • Logging in from Mac – foo.txt, sample1.txt, exercise.txt, datasets folder – You can check they’re there with the ‘ls’ command ¾Access the Terminal: Go Î Utilities Î Terminal ¾XQuartz needed for X-windows for newer OS X. 2 4 Unix Review: Unix Review: Commands Pipes ¾ command [arg1 arg2 … ] [input1 input2 … ] • Stream output of one command/program as $ sort -k2,3nr foo.tab -n or -g: -n is recommended, except for scientific notation or input for another start end a leading '+' -r: reverse order – Avoid intermediate file(s) $ cut -f1,5 foo.tab ¾ $ cut -f 1 myFile.txt | sort | uniq -c > uniqCounts.txt $ cut -f1-5 foo.tab pipe -f: select only these fields -f1,5: select 1st and 5th fields -f1-5: select 1st, 2nd, 3rd, 4th, and 5th fields $ wc -l foo.txt How many lines are in this file? 5 7 Unix Review: What we will discuss today Common Mistakes • Case sensitive • Aliases (to reduce typing) cd /nfs/Barc_Public vs cd /nfs/BaRC_Public -bash: cd: /nfs/Barc_Public: No such file or directory • sed (for file manipulation) • awk/bioawk (to filter by column) • Spaces may matter! • groupBy (bedtools; not typical Unix) rm –f myFiles* vs rm –f myFiles * • join (merge files) • Office applications can convert text to special • loops (one-line and with shell scripts) characters that Unix won’t understand • scripting (to streamline commands) • Ex: smart quotes, dashes 6 8 Aliases Regular Expressions • Pattern matching and easier to search • Add a one-word link to a longer command • Commonly used regular expressions • To get current aliases (from ~/.bashrc) • Examples Matches List all txt files: ls *.txt . All characters alias Replace CHR with Chr at the beginning of each line: * Zero or more; wildcard • Create a new alias (two examples) $ sed 's/^CHR/Chr/' myFile.txt +One or more Delete a dot followed by one or more numbers ?One alias sp='cd /lab/solexa_public/Reddien' $ sed 's/\.[0-9]\+//g' myFile.txt ^ Beginning of a line alias CollectRnaSeqMetrics='java -jar $ End of a line /usr/local/share/picard-tools/CollectRnaSeqMetrics.jar' [ab] Any character in brackets • Make an alias permanent • Note: regular expression syntax may slightly differ – Paste command(s) in ~/.bashrc between sed, awk, Unix shell, and Perl – Ex: \+ in sed is equivalent to + in Perl 9 11 sed: awk stream editor for filtering and transforming text • Print lines 10 - 15: • Name comes from the original authors: $ sed -n '10,15p' bigFile > selectedLines.txt Alfred V. Aho, Peter J. Weinberger, Brian W. • Delete 5 header lines at the beginning of a file: Kernighan $ sed '1,5d' file > fileNoHeader • A simple programing language • Remove all version numbers (eg: '.1') from the end of • Good for filtering/manipulating multiple- a list of sequence accessions: eg. NM_000035.2 column files $ sed 's/\.[0-9]\+//g' accsWithVersion > accsOnly s: substitute g: global modifier (change all) 10 12 awk awk: arithmetic operations • By default, awk splits each line by spaces Add average values of 4th and 5th fields to the file: $ awk '{ print $0 "\t" ($4+$5)/2 }' foo.tab • Print the 2nd and 1st fields of the file: $ awk ' { print $2"\t"$1 } ' foo.tab $0: all fields • Convert sequences from tab delimited format to fasta format: Operator Description + Addition - Subtraction $ head -1 foo.tab * Multiplication Seq1 ACTGCATCAC / Division $ awk ' { print ">" $1 "\n" $2 }' foo.tab > foo.fa % Modulo $ head -2 foo.fa ^Exponentiation >Seq1 ** Exponentiation ACGCATCAC 13 15 awk: field separator awk: making comparisons Print out records if values in 4th or 5th field are above 4: • Issues with default separator (white space) $ awk '{ if( $4>4 || $5>4 ) print $0 } ' foo.tab – one field is gene description with multiple words Sequence Description > Greater than – consecutive empty cells < Less than <= Less than or equal to • To use tab as the separator: >= Greater than or equal to $ awk -F "\t" '{ print NF }' foo.txt == Equal to or Character Description != Not equal to $ awk 'BEGIN {FS="\t"} { print NF }' foo.txt \n newline ~Matches \r carriage return !~ Does not match BEGIN: action before read input \t horizontal tab || Logical OR NF: number of fields in the current record && Logical AND FS: input field separator OFS: output field separator END: action after read input 14 16 awk • Conditional statements: bioawk: Examples Display expression levels for the gene NANOG: • Print transcript info and chr from a gff/gtf file (2 ways) $ awk '{ if(/NANOG/) print $0 }' foo.txt or $ awk '/NANOG/ { print $0 } ' foo.txt bioawk -c gff '{print $group "\t" $seqname}' Homo_sapiens.GRCh37.75.canonical.gtf or bioawk -c gff '{print $9 "\t" $1}' Homo_sapiens.GRCh37.75.canonical.gtf $ awk '/NANOG/' foo.txt Sample output: Add line number to the above output: $ awk '/NANOG/ { print NR"\t"$0 }' foo.txt gene_id "ENSG00000223972"; transcript_id "ENST00000518655"; chr1 NR: line number of the current row gene_id "ENSG00000223972"; transcript_id "ENST00000515242"; chr1 • Looping: Calculate the average expression (4th, 5th and 6th fields in this case) for each transcript • Convert a fastq file into fasta (2 ways) $ awk '{ total= $4 + $5 + $6; avg=total/3; print $0"\t"avg}' foo.txt bioawk -c fastx '{print “>” $name “\n” $seq}' sequences.fastq or bioawk -c fastx '{print “>” $1 “\n” $2}' sequences.fastq $ awk '{ total=0; for (i=4; i<=6; i++) total=total+$i; avg=total/3; print $0"\t"avg }' foo.txt 17 19 Summarize by Columns: bioawk* • Extension of awk for commonly used file groupBy (from bedtools) formats in bioinformatics Input file must be pre-sorted by grouping column(s)! $ bioawk -c help input bed: !Ensembl Gene ID !Ensembl Transcript ID !Symbol -g grpCols column(s) for grouping 1:chrom 2:start 3:end 4:name 5:score 6:strand 7:thickstart 8:thickend 9:rgb ENSG00000281518 ENST00000627423 FOXO6 -c -opCols column(s) to be summarized 10:blockcount 11:blocksizes 12:blockstarts ENSG00000281518 ENST00000630406 FOXO6 -o Operation(s) applied to opCol: sam: ENSG00000280680 ENST00000625523 HHAT ENSG00000280680 ENST00000627903 HHAT sum, count, min, max, mean, median, stdev, 1:qname 2:flag 3:rname 4:pos 5:mapq 6:cigar 7:rnext 8:pnext 9:tlen 10:seq ENSG00000280680 ENST00000626327 HHAT collapse (comma-sep list) ENSG00000281614 ENST00000629761 INPP5D 11:qual distinct (non-redundant comma-sep list) ENSG00000281614 ENST00000630338 INPP5D vcf: 1:chrom 2:pos 3:id 4:ref 5:alt 6:qual 7:filter 8:info Print the gene ID (1st column), the gene symbol , and a list of transcript IDs (2nd field) gff: $ sort -k1,1 Ensembl_info.txt | groupBy -g 1 -c 3,2 -o distinct,collapse 1:seqname 2:source 3:feature 4:start 5:end 6:score 7:filter 8:strand 9:group 10:attribute Partial output fastx: !Ensembl Gene ID !Symbol !Ensembl Transcript ID 1:name 2:seq 3:qual 4:comment ENSG00000281518 FOXO6 ENST00000627423,ENST00000630406 ENSG00000280680 HHAT ENST00000625523,ENST00000626327,ENST00000627903 *https://github.com/lh3/bioawk 18 20 Join files together Shell script advantages With Unix join • Automation: avoid having to retype the same $ join -1 1 -2 2 $ ' \t ' FILE1 FILE2 Join files on the 1st field of FILE1 with the 2nd field of FILE2, commands many times only showing the common lines. • Ease of use and more efficient FILE1 and FILE2 must be sorted on the join fields before running join With BaRC scripts (sorting not required) • Outline of a script: Code in /nfs/BaRC_Public/BaRC_code/Perl/ #!/bin/bash shebang: interprets how to run the script $ join2filesByFirstColumn.pl file1 file2 commands… set of commands used in the script Sample tables to join: #comments write comments using “#” Skeletal Smooth Spinal !Symbol Heart Skin Ensembl Gene ID !Symbol Muscle Muscle cord HHAT 8.15 7.7 5 6.55 6.4 ENSG00000252303 RNU6-280P • Commonly used extension for script is .sh (eg. INPP5D 19.65 5.95 4.55 5.25 14.5 ENSG00000280584 OBP2B NDUFA10 441.8 160.2 24.9 188.85 158.75 ENSG00000280680 HHAT foo.sh), file must have executable permission RPS6KA1 85.2 47.75 46.45 35.85 44.55 ENSG00000280775 RNA5SP136 RYBP 20.45 13.05 11.95 20.7 17.75 ENSG00000280820 LCN1P1 SLC16A1 15.45 20.45 12.2 248.35 27.15 ENSG00000280963 SERTAD4-AS1 21 23 Shell Flavors Bash Shell: ‘for’ loop • Syntax (for scripting) depends the shell • Process multiple files with one command echo $SHELL # /bin/bash (on tak) • Reduce computational time with many cluster nodes • bash is common and the default on tak. for mySam in `/bin/ls *.sam` do • Some Unix shells (incomplete listing): bsub wc -l $mySam Shell Name done sh Bourne When referring to a variable, $ is needed before the variable bash Bourne-Again name ($mySam), but $ is not needed when defining it (mySam). ksh Korn shell csh C shell Identical one-line command: for samFile in `/bin/ls *.sam`; do bsub wc -l $samFile; done 22 24 Shell script example #!/bin/sh # 1.