<<

Essential Skills for Bioinformatics: / PIPES A simple program with and pipes

• Suppose we are working with a FASTA and a program warns that it contains non-nucleotide characters in sequences. We can check for non-nucleotide characters easily with a Unix one-liner using pipes and the grep.

• The grep Unix tool searches files or standard input for that match patterns. These patterns can be simple strings, or regular expressions. See man grep for details. A simple program with grep and pipes • tb1-protein.fasta

• Approach: - remove all header lines (those that begin with >) - the remaining sequences are piped to grep - print lines containing non-nucleotide characters A simple program with grep and pipes

• “^>”: regular expression pattern matches all lines that start with a > character. • -v: invert the matching lines because we want to exclude lines starting with > • |: pipe the standard output to the next command with the pipe character | • “[^ACGT]”: When used in brackets, a caret symbol (^) matches anything that’s not one of the characters in these brackets. [^ACGT] matches any character that’s not A, , G, or T. • -i: We ignore case. • --color: to color the matching non-nucleotide character. • Both regular expressions are in quotes, which is a good habit to get into. Combining pipes and redirection

• Large bioinformatics programs will often use multiple streams simultaneously. Results are output via the standard output stream while diagnostic messages, warnings, or errors are output to the standard error stream. In such cases, we need to combine pipes and redirection to manage all streams from a running program.

• For example, suppose we have two imaginary programs: program1 and program2. - The first program, program1, does some processing on an input file called input.txt and output results to the standard output stream and diagnostic messages to the standard error stream.

- The second program, program2, takes standard output from program1 as input and processes it. It also outputs its own diagnostic messages to standard error, and results to standard output. - The tricky part is that we now have two processes outputting to both standard error and standard output. Combining pipes and redirection

• program1 processes the input.txt input file and then outputs its results to standard output. program1’s standard error stream is redirected to the program1.stderr logfile. • program2 uses the standard output from program1 as its standard input. The shell redirects program2’s standard error stream to the program2.stderr logfile, and program2’s standard output to results.txt. A in your pipe

• We do occasionally need to intermediate files to disk in Unix pipelines. These intermediate files can be useful when debugging a or when you wish to store intermediate files for steps that take a long to complete.

• The Unix program tee diverts a copy of your pipeline’s standard output stream to an intermediate file while still passing it through its standard output A tee in your pipe

program1’s standard output is both written to intermediate-file.txt and piped directly into program2’s standard input. MANAGING AND INTERACTING WITH PROCESSES Overview

• When we run programs through the , they become processes until they successfully finish or terminate with error.

• There are multiple processes running on your machine simultaneously. In bioinformatics, we often work with processes that run for a large amount of time, so it’s important we know how to work with and manage processes from the Unix shell.

• We will learn the basics of manipulating processes: running and managing processes in the background, killing errant processes, and checking status. Background processes

• When we a command in the shell and press Enter, we lose access to that shell prompt. This is fine for short tasks, but waiting for a long-running program to complete before continuing work in the shell would our productivity.

• Rather than running your program in the foreground, the shell also gives you the option to run programs in the background. Running a process in the background frees the prompt so you can continue working. Background processes

• We can tell the Unix shell to run a program in the backgroun d by appending an & to the end of our command.

• The number returned by the shell is the process ID (or PID). This is a unique ID that allows you to identify and check the status of your program later on. We can check what processes we have running in the background with jobs. Background processes

• To bring a background process into the foreground again, we can use fg. It will bring the recent process to the foregr ound.

• To return a specific background job to the foreground, use fg % where is its number in the job list (returned by jobs). Background processes

• It is also possible to place a process already running in the foreground into the background. To do this, we first need to suspend the process, and then use the bg command to run it in the background.

• We can suspend processes by sending a stop through the key combination Control-z. Background processes

• Although background processes run in the background and seem disconnected from our terminal, closing our terminal window would cause these processes to be killed.

• Whenever our terminal window closes, it sends a hangup signal. Nearly all Unix command-line programs stop running as soon as they receive this signal.

• nohup is a simple command that executes a command and catches hangup signals sent from the terminal. Because the nohup command is catching and ignoring these hangup signals, the program you are running will not be interrupted.

• nohup returns the process ID, which is how you can monitor or terminate this process. Because we lose access to this process when we run it through nohup, our only way of terminating it is by referring to it by its process ID. Killing processes

• Occasionally we need to kill a process. It is not uncommon for a process to demand too many of our computer’s resources or become nonresponsive, requiring that we send a special signal to kill the process.

• If the process is currently running in the foreground, you can kill it by entering Control-c. If it is in the background, you will have to use the fg. Exit status

• One concern with long-running processes is that you are not going to around to monitor them. Unix programs exit with an exit status, which indicates whether a program terminated without a problem or with an error.

• An exit status of 0 indicates the process ran successfully, and any nonzero status indicates some of error has occurred.

• The shell will set the exit status value to a variable named $?.

• We can use the command to look this variable’s value after running a commandecho $? Exit status

• The exit status allows us to programmatically chain commands together in the shell. A subsequent command in a chain is run conditionally on the last command’s exit status.

• The shell provides two operators that implement this: • &&: run the next command only if the previous command completed successfully • ||: run the next command only if the previous command completed unsuccessfully Exit status

• Suppose we want to run program1, have it write its output to a file, and then have program2 read this output. To avoid the problem of program2 reading a file that’s not complete because program1 terminated with an error, we want to start program2 only after program1 returns a zero exit status. Exit status

• Using the operator, we can have the shell execute a command only if the previous command has failed. This is useful for warning messages.

• Additionally, if you don’t care about the exit status and you just wish to execute two commands sequentially, you can use a single semicolon (;)

• Another type of useful shell expansion is command substitution. It allows a command to be run and its output to be pasted back on the command line as arguments to another command.

• The syntax is $(…) RETRIEVING DATA Overview

• Whether downloading large sequencing datasets or accessing a web application hundreds of times to download specific files, retrieving data in bioinformatics can require special tools and skills.

• Downloading a large scale of sequencing data through your web browser is not feasible. Web browsers are not designed to download such large datasets. Additionally, you’d need to download the sequencing data to your server, not your local workstation.

• To do this, you’d need to SSH into your data-crunching server and download your data directly to this machine using command-line tools.

• We will take a look at some of these tools. wget

• wget is useful for quickly downloading a file from the command line.

• For example, human gene annotation GTF file from Ensembl: wget

• wget downloads this file to your current directory and provides a useful progress indicator.

• It can handle both http and ftp links. In general, FTP is preferable to HTTP for large files.

• For simple HTTP or FTP authentication with a user name and password, you can authenticate using “--user=“ and “-- ask-password” options. wget: useful options Option Values Use -A, --accept Either a suffix like “.gtf” or a Only download files matching this pattern with *,?, or [ and ]. criteria -R, --reject Same as with --accept Don’t download files matching this -nd, --no-directory No value Don’t place locally downloaded files in the same directory hierarchy as remote files. -r, --recursive No value Follow and download links on a page -np, --no-parent No value Don’t move above the parent directory --limit-rate Number of bytes per second Maximum download bandwidth --user=user User name User name --ask-password No value Prompt for password -O Output file name Download file to filename specified wget:--recursive

• With this recursive option, wget will retrieve all the data from the given directory on the remote server up to the specified depth.

• By default, wget limits itself to only follow five links deep (can be customized using the option --level or –l) to avoid downloading the entire Internet.

• Recursive downloading can be useful for downloading all files of a certain type from a page. wget:--recursive

• Example:

• This recursive downloading can be quite aggressive. If not properly constrained, it will download everything it can reach. In the above example, we limited wget in two ways:  with --no-parent to prevent wget from downloading pages higher in the directory structure.  with --accept “*.gtf”, which only allow wget to download filenames matching this pattern. wget:--recursive

• Exercise caution when using this recursive option. It can utilize a lot of network resources and overload the remote server. In some cases, the remote hot may block your IP address if you are downloading too much. DATA INTEGRITY Overview

• Data we download is the starting point of all future analyses. The risk of data corruption during transfers is a concern when transferring large datasets. In addition to verifying your transfer finished without error, it’s also important to explicitly check the transferred data’s integrity with checksums.

• Checksums are very compressed summaries of data, computed in a way that even if just one bit of the data is changed, the checksum will be different. SHA and MD5 checksums

• The two most common checksum algorithms for ensuring data integrity are MD5 and SHA-1. MD5 is an older checksum algorithm, but one that is still commonly used. Both MD5 and SHA-1 behave similarly, but SHA-1 is newer and generally preferred.

• The long string of numbers and letters (hexadecimal format, 0-9 and a-f) is the checksum SHA and MD5 checksums

• Checksums are completely deterministic, meaning that regardless of the time the checksum is calculated or the system used, checksum values will only differ if the input differs (even one bit of data).

• When downloading many files, it can get rather tedious to check each checksum individually. The program shasum can create and validate against a file containing the checksums of files. SHA and MD5 checksums

• Then, we can use shasum’s check option (-c) to validate that these files match the original versions: SHA and MD5 checksums

• There are only 1640 possible different checksums for SHA-1 c hecksum. However, 1640 is a huge number so the probability of a checksum collision is very small. For the purpose of chec king data integrity, the risk of a collision is negligible.

• Some servers use an antiquated checksum implementation ch as or chsum. How we use these older command-line checksum programs is similar to using shasum and md5sum.