<<

Unix Activities – Week 1 Assignment 1. Examine the contents of my .bash_profile

~mgribsko/.bash_profile_public

2. Edit your .bash_profile using nano (or other editor of your choice, see the Unix manual) and add the lines

module load bioinfo scratch=" $RCAC_SCRATCH"

3. If you do not have a file called .bash_profile, you will need to create it.

4. Copy the following text into your .bash_profile

# prevent deleting files with * rmstar() { for i do if [ "$i" = '*' ] ;then -n "Remove all files? Are you sure [y/n]? " read j if [ "$j" != y ] ;then continue ;fi fi set +o noglob eval command rm $i set -o noglob done set +o noglob } alias rm='set -o noglob ; rmstar'

5. Save your .bash profile. whether it works by typing

source .bash_profile test cd test rm *

6. Move to your scratch directory using cd. Your scratch directory on any of the clusters is can always be referenced with the symbol $RCAC_SCRATCH.

7. Return to your home directory by typing “cd” with no directory.

8. Move to your scratch directory using the new command you added to your .bash_profile in activity 2, above. Just scratch.

9. You should now be in your scratch directory. Confirm this with the command. You should see something like /scratch/scholar/mgribsko but with your account name, not mine.

10. Create a new working directory in your scratch directory using the mkdir command. Name the directory “raw”. Confirm that this directory exists using the command

11. Move into the “raw” directory using cd. 12. Copy the course data from the course directory

/depot/hpcls/week1/*.gz .

The period the end of this command indicates that the copied files should have the same filenames as they originally had in my directory. If this was a real project, it would be a good idea to back up the raw data to Fortress at this point. You should always an archival copy of your data before you start to work on it in case you ever need to start over, or in case your files get purged or deleted. Files stored on Fortress are secure essentially forever.

13. Uncompress the data using

gunzip - *.gz

14. Use ls –lh to see details about the uncompressed files, including their size.

15. Use less to examine some of the data in the files.

16. Use the wc command to count the number of lines in each file. Since these are FASTQ sequence data files, each sequence read entry has four lines. Based on wc, are all the entries complete, and how many total sequence reads are there?

17. The wc command is briefly covered in the Unix manual, but you can use the man command to get more information. Try typing man wc – what is the command option used to get the length of the longest line in a file?

Data cleaning is the first step in every big data project. Many customized are available for cleaing FASTQ sequence data files (such as the ones you just copied).We can also use general Unix commands such as to quickly look at aspects of our data. One of the common artefacts that must removed from sequence data are sequences corresponding to the adapters that were used to construct the sequencing library. Because sequencing approaches use a random fragmentation approach (AKA shotgun sequencing), it is possible to sequence completely through a small insert and into the adapter sequence on the other side. Such sequence need to be detected and removed before further analyses can be performed. The Illumin TruSeq system is one of the most commonly used systems for library construction. One of the adapters used in this system has the sequence GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT We will use grep to see if this adapter is present in our files. For activities 17a - 17d, examples of the code can be found on the next page.

a. grep is used to detect every occurance of a simple string (or more complex patterns called regular expressions) in files. The syntax is

grep

where the angle brackets, < >, indicate information that must be filled in. First we will use grep to see if there are any matches to the first six bases of the adapter sequence.

b. grep shows you just the lines in the file that match to the pattern – since there are many matches, the output scrolls by very quickly. Try piping the output of grep into the less program so that you can more easily see it (this is described in the Unix manual). You should see that there are, indeed, quite a few matching lines, but it is hard to see exactly where the matching part of the sequence is. c. To highlight the matching parts of the file, add the option --color=always. The matching parts will now appear in red. d. Since DNA has only four bases, it is fairly common for random sequences to match to six base long query patterns (there are only 46 DNA hexamers).Our search, however, put no constraints on the bases following the six letter query pattern, hence when subsequent sequence matches the adapter, even after the first six bases, it is highly likely to be an adapter containing sequence. Based on this reasoning, see if you can estimate what fraction of the matches represent true adapters, and what fraction of the sequences probably contain adapters. An acceptable level of adapter contamination would be below about 0.5 %. You may want to use the option –c to get a count of the matching lines. e. The longer the query pattern, the fewer random matches will be seen. At the same , Illumina sequences are only 100-150 bases long, and longer queries are likely to have fewer matches because they are truncated by reading the end of the sequence read, or because of sequencing errors (~0.1% for Illumina technology). Using grep, try to determine how long a query seqeunce must be used to enusre that almost all (let’s say more than 95%) of the matches represent actual adapter sequences. To do this, repeat the approach of 17d using longer query patterns drawn from the beginning of the adapter sequence.

grep GATCGG week1.r1.fq # 17a

grep GATCGG week1.r1.fq | less # 17b

grep –-color=always GATCGG week1.r1.fq | less # 17c

grep –c GATCGG week1.r1.fq # 17d