Essential Skills for Bioinformatics: /Linux SCRIPTING Overview

, the shell we have used interactively in this course, is a full-fledged scripting language. Unlike Python, Bash is not a general-purpose language.

• Bash is explicitly designed to make running and interfacing command-line programs as simple as possible. For these reason, Bash often takes the role as the glue language of bioinformatics, as it’s used to glue many commands together into a cohesive workflow. Overview

• Note that Python is a more suitable language for commonly reused or advanced pipelines. Python is a more modern, fully featured scripting language than Bash.

• Compared to Python, Bash lacks several nice features useful for data-processing scripts: better numeric type support, useful data structures, better string processing, refined option parsing, availability of a large number of libraries, and powerful functions that help with structuring your programs.

• However, there’s more overhead when calling command-line programs from a Python compared to Bash. Bash is often the best and quickest “glue” solution. Writing and running bash scripts

• Most Bash scripts in bioinformatics are simply commands organized into a re-runnable script with some features to check that files exist and ensuring any error causes the script to abort.

• We will learn the basics of writing and executing Bash scripts, paying particular attention to how create robust Bash scripts. A robust Bash header

• By convention, Bash scripts have the extension .sh. You can create them in your favorite text editor (e.g. emacs or vi).

• Anytime you write a Bash script, you should use the following Bash script header, which sets some Bash options that lead to more robust scripts. #!/bin/bash set –e set –u set –o pipefail A robust Bash header

• #!/bin/bash This is called the shebang, and it indicates the path to the interpreter used to execute this script.

• set –e By default, a shell script containing a command that fails will not cause the entire shell script to exit: the shell script will just continue on to the next line. We always want errors to be loud and noticeable. This option prevents this, by terminating the script if any command exited with a nonzero exit status. A robust Bash header

Note that this option ignores nonzero statuses in if conditionals. Also, it ignores all exit statuses in Unix pipes except the last one.

• set –u This option fixes another default behavior of Bash scripts: any command containing a reference to an unset variable name will still run. It prevents this type of error by aborting the script if a variable’s value is unset A robust Bash header

• set –o pipefail set –e will cause a script to abort if a nonzero exit status is encountered, with some exceptions. One such exception is if a program runs in a Unix pipe exited unsuccessfully. Including set –o pipefail will prevent this undesirable behavior: any program that returns a nonzero exit status in the pipe will cause the entire pipe to return a nonzero status. With set –e enabled, this will lead the script to abort. Running bash scripts

• Running Bash scripts can be done one of two ways: 1. bash script.sh 2. ./script.sh

• While we can run any script, calling the script as an executable requires that it has executable permissions. We can set these using: chmod u+x script.sh • This adds executable permissions for the user who owns the file. Then, the script can be run with ./script.sh. Variables

• Processing pipelines having numerous settings that should be stored in variables. Storing these settings in a variable defined at the top of the file makes adjusting settings and rerunning your pipelines much easier.

• Rather than having to changes numerous hardcoded values in your scripts, using variables to store settings means you only have to change one value.

• Bash also reads command-line arguments into variables. Variables

• Bash’s variables don’t have data types. It’s helpful to think of Bash’s variables as strings.

• We can create a variable and assign it a value with. results_dir=“results/”

• Note that spaces matter when setting Bash variables. Do not use spaces around the equal sign. Variables

• To access a variable’s value, we use a dollar sign in front of the variable’s name.

• Suppose we want to create a directory for a sample’s alignment data, called _aln/, where is replaced by the sample’s name. sample=“CNTRL01A” mkdir “${sample}_aln/” Command-line arguments

• The variable $0 stores the name of the script, and command- line arguments are assigned to the value $1, $2, $3, etc. Bash assigns the number of command-line arguments to $#. Command-line arguments

• If you find your script requires numerous or complicated options, it might be easier to use Python instead of Bash. Python’s argparse module is much easier to use.

• Variables created in your Bash script will only be available for the duration of the Bash running that script. if statement

• Bash supports the standard if conditional statement. The basic syntax is: if [commands] then [if-statements] else [else-statements] fi if statement

• A command’s exit status provides the true and false. Remember that 0 represents true/success and anything else if false/failure.

• if [commands] [commands] could be any command, set of commands, , or test condition. If the exit status of these commands is 0, execution continues to the block after then. Otherwise execution continues to the block after else. if statement

• [if-statements] is a placeholder for all statements executed if [commands] evaluates to true (0).

• [else-statements] is a placeholder for all statements executed if [commands] evaluates to false. The else block is optional. if statement

• Bash is primarily designed to stitch together other commands. This is an advantage Bash has over Python when writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead to call programs.

• Although it can be unpleasant to write complicated programs in Bash, writing simple programs is exceedingly easy because Unix tools and Bash harmonize well. if statement

• Suppose we wanted to run a set of commands only if a file contains a certain string. Because returns 0 only if it matches a pattern in a file and 1 otherwise.

The is to tidy the output of this script such that grep’s output is redirected to /dev/null and not to the script’s standard out. test

• Like other programs, test exits with either 0 or 1. However test’s exit status indicates the return value of the test specified through its arguments, rather than exit success or error. test supports numerous standard comparison operators. test

String/integer Description -z str String str is null str1 = str2 str1 and str2 are identical str1 != str2 str1 and str2 are different int1 –eq –int2 Integers int1 and int2 are equal int1 –ne –int2 int1 and int2 are not equal int1 –lt –int2 int1 is than int2 int1 –gt –int2 int1 is greater than int2 int1 –le –int2 int1 is less than or equal to int2 int1 –ge –int2 int1 is greater than or equal to int2 test

• In practice, the most common conditions you’ll be checking are to see if files or directories exist and whether you can write to them. test supports numerous file- and directory- related test operations. test

File/directory expression Description -d dir dir is a directory -f file file is a file -e file file exists -r file file is readable -w file file is writable -x file file is executable test

• Combining test with if statements is simple: if test –f some_file.txt then […] fi • Bash provides a simpler syntactic alternative: if [ –f some_file.txt ] then […] fi • Note the spaces around and within the brackets: these are required. test

• When using this syntax, we can chain test expression with –a as logical AND, -o as logical OR, ! as negation. Our familiar && and || operators won’t work in test, because these are shell operators. if [ “$#” –ne 1 –o ! –r “$1” ] then “usage: script.sh file_in.txt” fi for loop

• In bioinformatics, most of our data is split across multiple files. At the heart of any processing pipeline is some way to apply the same workflow to each of these files, taking care to keep track of sample names. Looping over files with Bash’s for loop is the simplest way to accomplish this.

• There are three essential parts to creating a pipeline to process a set of files: 1. Selecting which files to apply the commands to 2. Looping over the data and applying the commands 3. Keeping track of the names of any output files created for loop

• Suppose we have a file called samples.txt that tells you basic information about your raw data: sample name, read pair, and where the file is. for loop

• Suppose we want to loop over every file, gather quality statistics on each and every file, and save this information to an output file. • First, we load our into a Bash array, which we can then loop over. Bash arrays can be created manually using: for loop

• But creating Bash arrays by hand is tedious and error prone. The beauty of Bash is that we can use a to construct Bash arrays.

• We can strip the path and extension from each using basename. for loop Learning Unix

• https://www.codecademy.com/learn/learn-the-command-line • http://swcarpentry.github.io/shell-novice/ • http://korflab.ucdavis.edu/bootcamp.html • http://korflab.ucdavis.edu/Unix_and_Perl/current.html • https://www.learnenough.com/command-line-tutorial • http://cli.learncodethehardway.org/book/ • https://learnxinyminutes.com/docs/bash/ • http://explainshell.com/ Sequence Alignments DATA FORMATS Overview

• Nucleotide (and protein) sequences are stored in two plain- text formats widespread in bioinformatics: FASTA and FASTQ.

• We will discuss each format and their limitations, and then see some tools for working with data in these formats. FASTA

• The FASTA format originates from the FASTA alignment suites, created by William R. Pearson and David J. Lipman. The FASTA format is used to store any of sequence data not requiring per-base pair quality scores.

• This includes: reference genome files, protein sequences, coding DNA sequences (CDS), transcript sequences, and so on. FASTA

• FASTA files are composed of sequence entries, each containing two parts: a description and the sequence data.

• The description line begins with a greater than symbol (>) and contains the sequence identifier and other optional information

• The sequence data begins on the next line after the description, and continues until there’s another description line. FASTA

• An example FASTA file:

• The FASTA format’s simplicity and flexibility comes with an unfortunate downside: the FASTA format is a loosely defined ad hoc format. FASTA

• In general, the following rules should be observed: 1. Sequence lines should not be too long. While a FASTA file that contains the sequence of the entire human chromosome 1 on a single line is a valid FASTA file, most tools that run on such a file would fail. 2. Some tools may accept data containing alphabets beyond those that they know how to deal with. For example, the standard alphabet for nucleotides would contain ATGC. An extended alphabet may also contain 1) N: A, T, G, or C 2) W: A or T 3) Search the web for “IUPAC nucleotides to get a list of all such symbols. FASTA

3. The sequence lines should always at the same width with the exception of the last line. Some tools will fail to operate correctly and may not even warn the users if this condition is not satisfied. The following is technically a valid FASTA but it may cause various problems:

It should be reformatted to: FASTA

4. Use upper-case letters. Whereas both lower-case and upper-case letters are allowed by the specification, the different capitalization may carry additional meaning and some tools and methods will operate differently when encoutering upper- or lower-case letters. Some communities (e.g. Ensembl) chose to designate the lower- case letters as all repeats and low complexity regions. FASTQ

• The FASTQ format extends FASTA by including a numeric quality score to each base in the sequence. The FASTQ format is widely used to store high-throughput sequencing data, which is reported with a per-base quality score indicating the confidence of each base call.

• It is the de facto standard by which all sequencing instruments represent data. FASTQ

• The FASTQ format looks like:

• Line1: The description line beginning with @. This contains the record identifier and other information. • Line2: Sequence data, which can be on one or many lines. • Line3: The line beginning with + indicates the end of the sequence. • Line4: Quality data, which can also be on one or many lines, but must be the same length as the sequence. Each numeric base quality is encoded with ASCII characters. FASTQ

• The FASTQ format is a multi-line format just as the FASTA format is. In the early days of high-throughput sequencing, instruments always produced the entire FASTQ sequence on a single line.

• The FASTQ format suffers from the unexpected flaw that the @ sign is both a FASTQ record separator and a valid value of the quality string. For that reason it is a little more difficult to design a correct FASTQ parsing program.