SAM and BAM Formats

SAM and BAM Formats

Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 SAM and BAM formats 1 SAM, BAM formats • After mapping the FASTQ file to the reference genome you will end up with a SAM or BAM alignment file Raw sequence data: Guerfali Fastq files • SAM stands for Sequence Alignment/Map Fatma & format Nov 2016 Mapping Achouri rd • A single SAM file can store mapped, (Bowtie, BWA 23 – or others) Emna unmapped, and even QC-failed reads from a , IPP IPP – sequencing run, and indexed to allow rapid Chica access. This means that the raw sequencing data can be fully recapitulated from the BAM/SAM files on NGS course on NGS course - SAM/BAM file. Ghouila, Claudia Claudia Ghouila, 2 Amel C3BI Hands Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 SAM Format Li Shen , 2014 SAM, BAM formats Raw sequence data: Guerfali Fastq files • SAM is rarely helpful and really takes up too much space which is why we use only the BAM Fatma & in principle Nov 2016 Mapping Achouri rd • (Bowtie, BWA 23 A BAM file (.bam) is the binary version of a – or others) Emna , IPP IPP SAM file (saving storage and faster – manipulation) Chica BAM/SAM files on NGS course on NGS course - Ghouila, Claudia Claudia Ghouila, 4 Amel C3BI Hands SAM, BAM formats § A SAM file (.sam) is a tab-delimited text file that contains sequence alignment data Raw sequence data: Fastq files § SAM files can be opened using a text editor Guerfali or viewed using the UNIX "more" command Fatma § & Most alignment programs will supply: Mapping (Bowtie, BWA Nov 2016 Achouri rd - a header: describing the format version, or others) 23 – Emna sorting order of the reads, genomic , IPP IPP – sequences to which the reads were mapped Chica - an alignment section: contains the information for each sequence about BAM/SAM files on NGS course on NGS course - where/how it aligns to the reference genome Ghouila, Claudia Claudia Ghouila, 5 Amel C3BI Hands Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 SAM, BAM formats Header: Alignment section 11 columns (tab - separated) 6 Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 QNAME FLAG RNAME POS MAPQ SAM Format CIGAR RNEXT PNEXT TLEN http:// http SEQ :// samtools.sourceforge.net genome.sph.umich.edu /SAM1.pdf /wiki/ SAM 7 QUAL Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 to come from the same template. A QNAME ‘*’ indicates the information is unavailable. QNAME ( http:// samtools.github.io : Query template NAME. Reads/segments having identical QNAME are regarded / hts - specs/SAMv1.pdf) SAM fomat 8 SAM fomat (2) Guerfali Fatma & (http://samtools.github.io/hts-specs/SAMv1.pdf) Nov 2016 Achouri rd 23 FLAG: FLAG: bitwise FLAG (ideal for compression). 11 boolean flags all stotred in a – Emna singe column , IPP IPP – Chica on NGS course on NGS course - Ghouila, Claudia Claudia Ghouila, 9 Amel C3BI Hands SAM flag: example Guerfali Fatma & Nov 2016 SAM file Achouri rd 23 – Emna , IPP IPP – read mapped to position 7: Chica FLAG 163 (=1 + 2 + 32 + 128): - Read is the second read in the pair (128) - Read is properly paired (1 + 2) on NGS course on NGS course - - its mate is mapped to 37 on the reverse strand (32) Ghouila, Claudia Claudia Ghouila, 10 Amel C3BI Hands Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 Explain flag tool: https:// broadinstitute.github.io Decoding SAM flags / picard /explain - flags.html 11 SAM fomat (3) Guerfali Fatma & Nov 2016 Achouri rd (http://samtools.github.io/hts-specs/SAMv1.pdf) 23 – Emna , It equals −10 log10 Pr{mapping position is wrong}, rounded to the nearest integer. IPP IPP – Chica The MAPQ value can be used to figure out how unique an alignment is in the genome. ü Large number, >10 indicates it's likely the alignment is unique. on NGS course on NGS course - ü 255 indicates that the mapping quality is not available Ghouila, Claudia Claudia Ghouila, 12 Amel C3BI Hands SAM fomat: CIGAR string • The CIGAR string is a sequence of numbers and letters representing the associated information on bases alignment used to indicate things like which bases align (either a match/mismatch) with the reference, are deleted from the Guerfali reference, and if there are insertions that are not in the reference Fatma & Nov 2016 Achouri rd 23 – Emna , IPP IPP – Chica More information about these formats available here: http://samtools.sourceforge.net on NGS course on NGS course - https://samtools.github.io/hts-specs/SAMv1.pdf Ghouila, Claudia Claudia Ghouila, 13 Amel C3BI Hands SAM fomat: CIGAR string Mapped and unmapped reads are imported into SAM/BAM format The standard CIGAR description of pairwise alignment defines three operations: ‘M’ for alignment match, ‘I’ for insertion compared with the reference and ‘D’ for deletion. Guerfali (NB: The POS indicates that the read aligns starting at position 5 on the reference) Fatma The CIGAR : & 3M = 3 bases in the read sequence align with the reference. 1I = The next base in the read does not exist in the reference. Nov 2016 1D = The reference base does not exist in the read sequence Achouri rd 23 – Emna , IPP IPP – Chica on NGS course on NGS course - http://genome.sph.umich.edu/wiki/SAM POS: 5 Ghouila, Claudia Claudia Ghouila, CIGAR: 3M1I3M1D2M 14 Amel C3BI Hands Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 Examples SAM file Alignments of CIGAR strings for SAM fomat: different CIGAR string types of alignments (Li et al., 2009) 15 SAM fomat (5) Guerfali Fatma & SAM format (5) Nov 2016 Achouri rd 23 (http://samtools.github.io/hts-specs/SAMv1.pdf) – Emna , IPP IPP – Name of mate (mate pair information for paired-end sequencing) Chica Position of mate (mate pair information) Obviously, the chromsome and position are important. The CIGAR string is also important on NGS course on NGS course - to know where insertions (i.e. introns) might exist in your read. Ghouila, Claudia Claudia Ghouila, 16 Amel C3BI Hands.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    16 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us