<<

Introduction to high-throughput file formats – advanced exercises

Daniel Vodák ( Core Facility, The Norwegian Radium Hospital) ( [email protected] )

All example files are located in directory “/data/file_formats/data”. You can set your current directory to that path (“cd /data/file_formats/data”)

Exercises: a) FASTA format

File “Uniprot_Mammalia.fasta” contains a collection of mammalian protein sequences obtained from the UniProt database ( http://www.uniprot.org/uniprot ).

1) Count the number of lines in the file. 2) Count the number of header lines (i.e. the number of sequences) in the file. 3) Count the number of header lines with “Homo” in the organism/species name. 4) Count the number of header lines with “Homo sapiens” in the organism/species name. 5) Display the header lines which have “Homo” in the organism/species names, but only such that do not have “Homo sapiens” there.

b) FASTQ format

File NKTCL_31T_1M_R1.fq contains a reduced collection of tumor read sequences obtained from The ( http://www.ncbi.nlm.nih.gov/sra ).

1) Count the number of lines in the file. 2) Count the number of lines beginning with the “at” symbol (“@”). 3) How many reads are there in the file?

c) SAM format

File NKTCL_31T_1M. contains alignments of pair-end reads stored in files NKTCL_31T_1M_R1.fq and NKTCL_31T_1M_R2.fq.

1) Which program was used for the alignment? 2) How many header lines are there in the file? 3) How many non-header lines are there in the file? 4) What do the non-header lines represent? 5) Reads from how many template sequences have been used in the alignment process? c) BED format

File NKTCL_31T_1M. holds information about alignment locations stored in file NKTCL_31T_1M.sam.

1) How many regions are listed in the BED file? 2) Why does the number of regions doesn’t match the number of alignment lines in file NKTCL_31T_1M.sam? 3) How many regions are there for the individual chromosomes?

Solutions (there might be multiple ways to get the results): a) FASTA format

1) wc -l Uniprot_Mammalia.fasta # 11327260 lines 2) grep "^>" Uniprot_Mammalia.fasta | wc -l # 1277029 lines 3) grep "^>" Uniprot_Mammalia.fasta | grep "OS=Homo" | wc -l # 138576 lines 4) grep "^>" Uniprot_Mammalia.fasta | grep "OS=Homo sapiens" | wc -l # 138560 lines 5) grep "^>" Uniprot_Mammalia.fasta | grep "OS=Homo" | grep "OS=Homo sapiens" -v # check for yourself!

b) FASTQ format

1) wc -l NKTCL_31T_1M_R1.fq # 4000000 lines 2) grep "^@" NKTCL_31T_1M_R1.fq | wc -l # 1013052 lines 3) A well-formed FASTQ file will have 4 times fewer reads than lines (4000000/4 = 1 million reads in this case).

c) SAM format

1) One can find out by looking at the header line marked with tag “PG”: grep "^@PG" NKTCL_31T_1M.sam # novoalign, version 2.07.13 2) grep "^@" NKTCL_31T_1M.sam | wc -l # 27 lines 3) grep "^@" -v NKTCL_31T_1M.sam | wc -l # 2363216 lines 4) Alignments of individual reads as well as unmapped reads. 5) grep "^@" -v NKTCL_31T_1M.sam | cut -f 1 | sort | uniq -c | wc -l # reads from 1000000 template sequences

c) BED format

1) wc -l NKTCL_31T_1M.bed # 2009579 regions 2) Alignment lines representing unmapped reads do not provide any regions. 3) cut -f 1 NKTCL_31T_1M.bed | sort | uniq -c # check for yourself!