File Formats Exercises

File Formats Exercises

Introduction to high-throughput sequencing file formats – advanced exercises Daniel Vodák (Bioinformatics Core Facility, The Norwegian Radium Hospital) ( [email protected] ) All example files are located in directory “/data/file_formats/data”. You can set your current directory to that path (“cd /data/file_formats/data”) Exercises: a) FASTA format File “Uniprot_Mammalia.fasta” contains a collection of mammalian protein sequences obtained from the UniProt database ( http://www.uniprot.org/uniprot ). 1) Count the number of lines in the file. 2) Count the number of header lines (i.e. the number of sequences) in the file. 3) Count the number of header lines with “Homo” in the organism/species name. 4) Count the number of header lines with “Homo sapiens” in the organism/species name. 5) Display the header lines which have “Homo” in the organism/species names, but only such that do not have “Homo sapiens” there. b) FASTQ format File NKTCL_31T_1M_R1.fq contains a reduced collection of tumor read sequences obtained from The Sequence Read Archive ( http://www.ncbi.nlm.nih.gov/sra ). 1) Count the number of lines in the file. 2) Count the number of lines beginning with the “at” symbol (“@”). 3) How many reads are there in the file? c) SAM format File NKTCL_31T_1M.sam contains alignments of pair-end reads stored in files NKTCL_31T_1M_R1.fq and NKTCL_31T_1M_R2.fq. 1) Which program was used for the alignment? 2) How many header lines are there in the file? 3) How many non-header lines are there in the file? 4) What do the non-header lines represent? 5) Reads from how many template sequences have been used in the alignment process? c) BED format File NKTCL_31T_1M.bed holds information about alignment locations stored in file NKTCL_31T_1M.sam. 1) How many regions are listed in the BED file? 2) Why does the number of regions doesn’t match the number of alignment lines in file NKTCL_31T_1M.sam? 3) How many regions are there for the individual chromosomes? Solutions (there might be multiple ways to get the results): a) FASTA format 1) wc -l Uniprot_Mammalia.fasta # 11327260 lines 2) grep "^>" Uniprot_Mammalia.fasta | wc -l # 1277029 lines 3) grep "^>" Uniprot_Mammalia.fasta | grep "OS=Homo" | wc -l # 138576 lines 4) grep "^>" Uniprot_Mammalia.fasta | grep "OS=Homo sapiens" | wc -l # 138560 lines 5) grep "^>" Uniprot_Mammalia.fasta | grep "OS=Homo" | grep "OS=Homo sapiens" -v # check for yourself! b) FASTQ format 1) wc -l NKTCL_31T_1M_R1.fq # 4000000 lines 2) grep "^@" NKTCL_31T_1M_R1.fq | wc -l # 1013052 lines 3) A well-formed FASTQ file will have 4 times fewer reads than lines (4000000/4 = 1 million reads in this case). c) SAM format 1) One can find out by looking at the header line marked with tag “PG”: grep "^@PG" NKTCL_31T_1M.sam # novoalign, version 2.07.13 2) grep "^@" NKTCL_31T_1M.sam | wc -l # 27 lines 3) grep "^@" -v NKTCL_31T_1M.sam | wc -l # 2363216 lines 4) Alignments of individual reads as well as unmapped reads. 5) grep "^@" -v NKTCL_31T_1M.sam | cut -f 1 | sort | uniq -c | wc -l # reads from 1000000 template sequences c) BED format 1) wc -l NKTCL_31T_1M.bed # 2009579 regions 2) Alignment lines representing unmapped reads do not provide any regions. 3) cut -f 1 NKTCL_31T_1M.bed | sort | uniq -c # check for yourself! .

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    3 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us