Essential Skills for Bioinformatics: Unix/Linux SHELL SCRIPTING Overview

Essential Skills for Bioinformatics: Unix/Linux SHELL SCRIPTING Overview • Bash, the shell we have used interactively in this course, is a full-fledged scripting language. Unlike Python, Bash is not a general-purpose language. • Bash is explicitly designed to make running and interfacing command-line programs as simple as possible. For these reason, Bash often takes the role as the glue language of bioinformatics, as it’s used to glue many commands together into a cohesive workflow. Overview • Note that Python is a more suitable language for commonly reused or advanced pipelines. Python is a more modern, fully featured scripting language than Bash. • Compared to Python, Bash lacks several nice features useful for data-processing scripts: better numeric type support, useful data structures, better string processing, refined option parsing, availability of a large number of libraries, and powerful functions that help with structuring your programs. • However, there’s more overhead when calling command-line programs from a Python script compared to Bash. Bash is often the best and quickest “glue” solution. Writing and running bash scripts • Most Bash scripts in bioinformatics are simply commands organized into a re-runnable script with some features to check that files exist and ensuring any error causes the script to abort. • We will learn the basics of writing and executing Bash scripts, paying particular attention to how create robust Bash scripts. A robust Bash header • By convention, Bash scripts have the extension .sh. You can create them in your favorite text editor (e.g. emacs or vi). • Anytime you write a Bash script, you should use the following Bash script header, which sets some Bash options that lead to more robust scripts. #!/bin/bash set –e set –u set –o pipefail A robust Bash header • #!/bin/bash This is called the shebang, and it indicates the path to the interpreter used to execute this script. • set –e By default, a shell script containing a command that fails will not cause the entire shell script to exit: the shell script will just continue on to the next line. We always want errors to be loud and noticeable. This option prevents this, by terminating the script if any command exited with a nonzero exit status. A robust Bash header Note that this option ignores nonzero statuses in if conditionals. Also, it ignores all exit statuses in Unix pipes except the last one. • set –u This option fixes another default behavior of Bash scripts: any command containing a reference to an unset variable name will still run. It prevents this type of error by aborting the script if a variable’s value is unset A robust Bash header • set –o pipefail set –e will cause a script to abort if a nonzero exit status is encountered, with some exceptions. One such exception is if a program runs in a Unix pipe exited unsuccessfully. Including set –o pipefail will prevent this undesirable behavior: any program that returns a nonzero exit status in the pipe will cause the entire pipe to return a nonzero status. With set –e enabled, this will lead the script to abort. Running bash scripts • Running Bash scripts can be done one of two ways: 1. bash script.sh 2. ./script.sh • While we can run any script, calling the script as an executable requires that it has executable permissions. We can set these using: chmod u+x script.sh • This adds executable permissions for the user who owns the file. Then, the script can be run with ./script.sh. Variables • Processing pipelines having numerous settings that should be stored in variables. Storing these settings in a variable defined at the top of the file makes adjusting settings and rerunning your pipelines much easier. • Rather than having to changes numerous hardcoded values in your scripts, using variables to store settings means you only have to change one value. • Bash also reads command-line arguments into variables. Variables • Bash’s variables don’t have data types. It’s helpful to think of Bash’s variables as strings. • We can create a variable and assign it a value with. results_dir=“results/” • Note that spaces matter when setting Bash variables. Do not use spaces around the equal sign. Variables • To access a variable’s value, we use a dollar sign in front of the variable’s name. • Suppose we want to create a directory for a sample’s alignment data, called <sample>_aln/, where <sample> is replaced by the sample’s name. sample=“CNTRL01A” mkdir “${sample}_aln/” Command-line arguments • The variable $0 stores the name of the script, and command- line arguments are assigned to the value $1, $2, $3, etc. Bash assigns the number of command-line arguments to $#. Command-line arguments • If you find your script requires numerous or complicated options, it might be easier to use Python instead of Bash. Python’s argparse module is much easier to use. • Variables created in your Bash script will only be available for the duration of the Bash process running that script. if statement • Bash supports the standard if conditional statement. The basic syntax is: if [commands] then [if-statements] else [else-statements] fi if statement • A command’s exit status provides the true and false. Remember that 0 represents true/success and anything else if false/failure. • if [commands] [commands] could be any command, set of commands, pipeline, or test condition. If the exit status of these commands is 0, execution continues to the block after then. Otherwise execution continues to the block after else. if statement • [if-statements] is a placeholder for all statements executed if [commands] evaluates to true (0). • [else-statements] is a placeholder for all statements executed if [commands] evaluates to false. The else block is optional. if statement • Bash is primarily designed to stitch together other commands. This is an advantage Bash has over Python when writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead to call programs. • Although it can be unpleasant to write complicated programs in Bash, writing simple programs is exceedingly easy because Unix tools and Bash harmonize well. if statement • Suppose we wanted to run a set of commands only if a file contains a certain string. Because grep returns 0 only if it matches a pattern in a file and 1 otherwise. The redirection is to tidy the output of this script such that grep’s output is redirected to /dev/null and not to the script’s standard out. test • Like other programs, test exits with either 0 or 1. However test’s exit status indicates the return value of the test specified through its arguments, rather than exit success or error. test supports numerous standard comparison operators. test String/integer Description -z str String str is null str1 = str2 str1 and str2 are identical str1 != str2 str1 and str2 are different int1 –eq –int2 Integers int1 and int2 are equal int1 –ne –int2 int1 and int2 are not equal int1 –lt –int2 int1 is less than int2 int1 –gt –int2 int1 is greater than int2 int1 –le –int2 int1 is less than or equal to int2 int1 –ge –int2 int1 is greater than or equal to int2 test • In practice, the most common conditions you’ll be checking are to see if files or directories exist and whether you can write to them. test supports numerous file- and directory- related test operations. test File/directory expression Description -d dir dir is a directory -f file file is a file -e file file exists -r file file is readable -w file file is writable -x file file is executable test • Combining test with if statements is simple: if test –f some_file.txt then […] fi • Bash provides a simpler syntactic alternative: if [ –f some_file.txt ] then […] fi • Note the spaces around and within the brackets: these are required. test • When using this syntax, we can chain test expression with –a as logical AND, -o as logical OR, ! as negation. Our familiar && and || operators won’t work in test, because these are shell operators. if [ “$#” –ne 1 –o ! –r “$1” ] then echo “usage: script.sh file_in.txt” fi for loop • In bioinformatics, most of our data is split across multiple files. At the heart of any processing pipeline is some way to apply the same workflow to each of these files, taking care to keep track of sample names. Looping over files with Bash’s for loop is the simplest way to accomplish this. • There are three essential parts to creating a pipeline to process a set of files: 1. Selecting which files to apply the commands to 2. Looping over the data and applying the commands 3. Keeping track of the names of any output files created for loop • Suppose we have a file called samples.txt that tells you basic information about your raw data: sample name, read pair, and where the file is. for loop • Suppose we want to loop over every file, gather quality statistics on each and every file, and save this information to an output file. • First, we load our filenames into a Bash array, which we can then loop over. Bash arrays can be created manually using: for loop • But creating Bash arrays by hand is tedious and error prone. The beauty of Bash is that we can use a command substitution to construct Bash arrays. • We can strip the path and extension from each filename using basename. for loop Learning Unix • https://www.codecademy.com/learn/learn-the-command-line • http://swcarpentry.github.io/shell-novice/ • http://korflab.ucdavis.edu/bootcamp.html • http://korflab.ucdavis.edu/Unix_and_Perl/current.html • https://www.learnenough.com/command-line-tutorial • http://cli.learncodethehardway.org/book/ • https://learnxinyminutes.com/docs/bash/ • http://explainshell.com/ Sequence Alignments DATA FORMATS Overview • Nucleotide (and protein) sequences are stored in two plain- text formats widespread in bioinformatics: FASTA and FASTQ.

Essential Skills for Bioinformatics: Unix/Linux SHELL SCRIPTING Overview

A Highly Configurable High-Level Synthesis Functional Pattern Library

Bash Guide for Beginners

PJM Command Line Interface

TASSEL 3.0 / 4.0 Pipeline Command Line Interface: Guide to Using Tassel Pipeline

Chapter 5. Writing Your Own Shell

Working with the Windows Powershell Pipeline

Introduction to Linux/Unix

Unix Tools As Visual Programming Components in a GUI-Builder

Shell Scripting and System Variables HORT 59000 Lecture 5 Instructor: Kranthi Varala Text Editors

Linux Command Line Basics III: Piping Commands for Text Processing Yanbin Yin

The AWK Manual Edition 1.0 December 1995

An Order-Aware Dataflow Model for Parallel Unix Pipelines