Exercise I: Basic Unix for Manipulating NGS Data
Total Page:16
File Type:pdf, Size:1020Kb
Exercise I: Basic Unix for manipulating NGS data C. Hahn, July 2014 The purpose of this exercise is to practice manipulating NGS (Next Generation Sequencing) data using simple but incredibly powerful Unix commands. Try to solve the below tasks using your Unix skills. Do not hesitate to consult Google for help!! Hints for every task can be found on page 3 below. Possible solutions are suggested on page 4. 1. Download the testdata from https://www.dropbox.com/s/wcanmej6z03yfmt/testdata_1.tar?dl=0. Create a directory (exercise_1) in your home directory and copy the archive testdata.tar from ~/Downloads/ to this directory. 2. Decompress the archive. 3. Examine the content of the data files. For basic information on fastq format please visit: http://en.wikipedia.org/wiki/FASTQ_format. With this information try to interpret the content of your data files. Do you Know what all the lines represent? a) Which quality encoding (which offset) do these files use? b) What is the header of the third read in the file? Is this a single-end (forward) or a paired-end (reverse) read? c) What is the header of the last read in the file? Is this a forward or reverse read? 4. How many lines does the file have? 5. How many reads does the file contain? 6. How many single-end (“se” also referred to as forward reads) reads and how many paired-end (“pe” also reverse reads) reads does the file contain? 7. Extract all se/pe reads and write them to separate files testdata_1.fastq and testdata_2.fastq. 8. a) Count the number of reads that contain the sequence TGCACTAC in testdata_1.fastq. b) Count the number of reads that start with TGCACTAC (referred to as in-line barcode) in testdata_1.fastq. 9. Modify all headers in the file testdata_1.fastq. Replace the part of the header, which identifies the read as se by “/1” and write the data to testdata_1_newheader.fastq. 10. Extract the first 1000 reads from testdata_1_newheader.fastq and save them to a file called testdata_1_sub1000.fastq. Gzip this file. 11. Perform the tasks of 10. (except change to “/2”) and 11. above in one single command using pipes for testdata_2.fastq and write the data into a compressed file called testdata_2_sub1000.fastq.gz. 12. Identify all reads with the in-line barcode TGCACTAC from the file testdata_1_sub1000.fastq.gz and write them to the file sample_TGCACTAC_sub.1.fastq. 1 Advanced: 13. Which are the 24 most common barcodes (of length 8bp) in the file testdata_1.fastq and how often do they occur? 14. Extract the pe reads corresponding to the reads in sample_TGCACTAC_sub.1.fastq from testdata_2_sub1000.fastq.gz and write them to sample_TGCACTAC_sub.2.fastq. 15. The file “barcodes” contains a list of the in-line barcodes used during the preparation of the current library. Count the number of reads for every barcode in testdata_1.fastq. 2 Hints: The following hints suggest commands that might be useful to solve the above problems. Not all commands will be needed in all cases, but different combinations and applications of different subsets of commands can be applied. 1. cd; mkdir; cp; 2. tar; 3. gunzip; cat; zcat; less; more; head; tail; b) Is the offset of the quality encoding Phred+33 or Phred+64? visit http://en.wiKipedia.org/wiKi/FASTQ_format for help. 4. gunzip; cat; wc; zcat; 5. cat; zcat; grep; wc; |; 6. cat; zcat; grep; wc; |; 7. cat; zcat; grep (-A, -v); >; |; 8. cat; zcat; grep; wc; |; For many applications it might be useful/applicable to sequence DNA from several DNA extracts (different individuals, species, etc) during the same run of a NGS instrument (often termed multiplexing). Specific short DNA sequences (barcodes) are ligated to the DNA fragments of different samples during NGS library preparation. After sequencing reads can be assigned back to individual DNA samples based on this barcode. In our case individuals are identified by an 8bp in-line barcode, i.e. the first 8 bp of the se reads. 9. cat; sed; >; |; (example would be: @DHK1:324:C2:4:23:19:41 1:N:0:TGCAACTGG -> @DHK1:324:C2:4:23:19:41/1;) 10. head; >; gzip; 11. head; |; gzip; >; 12. cat; zcat; grep (-A, -B); >; Advanced: 13. cat; zcat; sed -n; cut -c; sort; uniq; head; |; 14. for loop; cat; grep; |; >; 15. for loop; cat; sed; grep; cut; uniq; sort; |; 3 Solutions: Note that there are usually several ways to solve the problems and the ones stated below represent just examples. If you found another way to do it – Congratulations! Lines starting with “$” represent commands to be executed in the terminal window. Italicized text after the “#” gives some extra info on what the command is doing. 1. $ cd ~/your_directory $ mkdir exercise_1 $ cd exercise_1 $ cp ~/Downloads/testdata_1.tar . 2. $ tar xvf testdata_1.tar #this will produce the directory “testdata_1”, which contains the gzipped file testdata_interleaved.fastq.gz $ ls –hlrt testdata_1/ #look whats in the directory and have the content listed with some information on filesize in human readable format 3. $ cd testdata_1 $ gunzip testdata_interleaved.fastq.gz #decompresses the “gzipped” file. Note that per default a new decompressed file is created, while the gzipped version disappears. This behavior can be modified in various ways. See manual. $ less testdata_interleaved.fastq #less is a useful program to look at large text files. It does not have to read the entire text file before starting so it is much faster to open large files than your standard text editor. Navigate with up/down key. Many, many functions including pattern search. Look at manual for details. Quit with “q”. $ more testdata_interleaved.fastq #similar to less but slightly different functionality $ head testdata_interleaved.fastq #writes the first 15 lines of the file to your screen (usually called standard output or STDOUT. Number of lines to be written out can be controlled (see man) $ tail testdata_interleaved.fastq #writes the last 15 lines of the file to STDOUT. $ cat testdata_interleaved.fastq #writes the entire content of the file to STDOUT); not very helpful at first glance (stop the process by pressing “CTRL-c”), but you can use “pipe” (“|”) to forward the content of the STDOUT directly to another program without displaying -> very powerful!! See below.. You may perform all these actions also directly on compressed files. $ gzip testdata_interleaved.fastq $ gunzip –c testdata_interleaved.fastq.gz #gunzippes the file, but instead of creating a new file this command writes the content of the file to the STDOUT. $ zcat testdata_interleaved.fastq.gz #writes the content of a compressed file to STDOUT. 4 $ zcat testdata_interleaved.fastq.gz | head #head can not directly display the content of a gzipped file in human readable form (try it). It can however be combined with other commands using pipe. $ gunzip –c testdata_interleaved.fastq.gz | tail a) The quality scores in the file are encoded in Phred+33 (Sanger) format. b) The header of the third read (forward read) is: @DHKW5DQ1:324:C2G0EACXX:4:2308:19447:41921 1:N:0:TGAACTGG c) The header of the last read (reverse read) is: @DHKW5DQ1:324:C2G0EACXX:4:2316:21327:100822 2:N:0:TGAACTGG Note: This data file contains both se and pe reads in what is sometimes referred to as interleaved format. That means that se and pe reads from a given DNA fragment occur in the file in consecutive order. Quite often se and pe reads are provided in separate files. In this case the first read in the se (often named something_1.fastq) file corresponds to the first read in the pe file (e.g. something_2.fastq) and so on. 4. $ wc –l testdata_interleaved.fastq $ cat testdata_interleaved.fastq | wc -l $ zcat testdata_interleaved.fastq.gz | wc –l The file contains 6705040 lines. 5. You already Know that the standard fastq format per definition has 4 lines per read, so the number of reads in the file is simply the number of lines/4, i.e. 6705040/4 = 1676260. But you could also simply count all lines containing headers in the file. Can you find a pattern you could search for that is true for every header and does not occur in any other line? Per definition the header of a fastq file starts with “@” so if you would count all lines in the file that start with a “@”. $ grep “^@” testdata_interleaved.fastq | wc –l #grep is a very powerful command for pattern search. The pattern in this case is “@”. The “^” in the command is a special character and defines the start of the line. So by “^@” you search for lines starting with @. The result of the linecount is: 2159438, i.e. not the expected 1676260. What’s wrong? By chance some of the quality lines might also start with @. Have a look: $ grep “^@” testdata_interleaved.fastq | less 5 So, our pattern has to be more specific. We Know that a fastq header usually starts with an id specific to the machine, which was used to generate the data. In our case “@DHKW5DQ1” should thus be a pattern that occurs only in headers. $ grep “^@DHKW5DQ1” testdata_interleaved.fastq | wc –l or $ grep –c “^@DHKW5DQ1” testdata_interleaved.fastq Result: 1676260 You can try to find other patterns that are specific to headers. 6. Se/pe reads can be identified immediately by looKing at their header. In our case a typical se read would be identified by a header containing “ 1”, while a pe read would contain “ 2”. There are different conventions. Sometimes se/pe reads are identified by headers that end in “/1” and “/2”, respectively. Use grep to count the number of se/pe reads, by e.g.: $ grep –c “ 1:” testdata_interleaved.fastq $ zcat testdata_interleaved.fastq.gz | grep –c “ 1:” $ grep “ 2:” testdata_interleaved.fastq |wc –l The file contains 838130 se/pe reads respectively.