SAM and BAM Formats

Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 SAM and BAM formats 1 SAM, BAM formats • After mapping the FASTQ file to the reference genome you will end up with a SAM or BAM alignment file Raw sequence data: Guerfali Fastq files • SAM stands for Sequence Alignment/Map Fatma & format Nov 2016 Mapping Achouri rd • A single SAM file can store mapped, (Bowtie, BWA 23 – or others) Emna unmapped, and even QC-failed reads from a , IPP IPP – sequencing run, and indexed to allow rapid Chica access. This means that the raw sequencing data can be fully recapitulated from the BAM/SAM files on NGS course on NGS course - SAM/BAM file. Ghouila, Claudia Claudia Ghouila, 2 Amel C3BI Hands Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 SAM Format Li Shen , 2014 SAM, BAM formats Raw sequence data: Guerfali Fastq files • SAM is rarely helpful and really takes up too much space which is why we use only the BAM Fatma & in principle Nov 2016 Mapping Achouri rd • (Bowtie, BWA 23 A BAM file (.bam) is the binary version of a – or others) Emna , IPP IPP SAM file (saving storage and faster – manipulation) Chica BAM/SAM files on NGS course on NGS course - Ghouila, Claudia Claudia Ghouila, 4 Amel C3BI Hands SAM, BAM formats § A SAM file (.sam) is a tab-delimited text file that contains sequence alignment data Raw sequence data: Fastq files § SAM files can be opened using a text editor Guerfali or viewed using the UNIX "more" command Fatma § & Most alignment programs will supply: Mapping (Bowtie, BWA Nov 2016 Achouri rd - a header: describing the format version, or others) 23 – Emna sorting order of the reads, genomic , IPP IPP – sequences to which the reads were mapped Chica - an alignment section: contains the information for each sequence about BAM/SAM files on NGS course on NGS course - where/how it aligns to the reference genome Ghouila, Claudia Claudia Ghouila, 5 Amel C3BI Hands Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 SAM, BAM formats Header: Alignment section 11 columns (tab - separated) 6 Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 QNAME FLAG RNAME POS MAPQ SAM Format CIGAR RNEXT PNEXT TLEN http:// http SEQ :// samtools.sourceforge.net genome.sph.umich.edu /SAM1.pdf /wiki/ SAM 7 QUAL Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 to come from the same template. A QNAME ‘*’ indicates the information is unavailable. QNAME ( http:// samtools.github.io : Query template NAME. Reads/segments having identical QNAME are regarded / hts - specs/SAMv1.pdf) SAM fomat 8 SAM fomat (2) Guerfali Fatma & (http://samtools.github.io/hts-specs/SAMv1.pdf) Nov 2016 Achouri rd 23 FLAG: FLAG: bitwise FLAG (ideal for compression). 11 boolean flags all stotred in a – Emna singe column , IPP IPP – Chica on NGS course on NGS course - Ghouila, Claudia Claudia Ghouila, 9 Amel C3BI Hands SAM flag: example Guerfali Fatma & Nov 2016 SAM file Achouri rd 23 – Emna , IPP IPP – read mapped to position 7: Chica FLAG 163 (=1 + 2 + 32 + 128): - Read is the second read in the pair (128) - Read is properly paired (1 + 2) on NGS course on NGS course - - its mate is mapped to 37 on the reverse strand (32) Ghouila, Claudia Claudia Ghouila, 10 Amel C3BI Hands Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 Explain flag tool: https:// broadinstitute.github.io Decoding SAM flags / picard /explain - flags.html 11 SAM fomat (3) Guerfali Fatma & Nov 2016 Achouri rd (http://samtools.github.io/hts-specs/SAMv1.pdf) 23 – Emna , It equals −10 log10 Pr{mapping position is wrong}, rounded to the nearest integer. IPP IPP – Chica The MAPQ value can be used to figure out how unique an alignment is in the genome. ü Large number, >10 indicates it's likely the alignment is unique. on NGS course on NGS course - ü 255 indicates that the mapping quality is not available Ghouila, Claudia Claudia Ghouila, 12 Amel C3BI Hands SAM fomat: CIGAR string • The CIGAR string is a sequence of numbers and letters representing the associated information on bases alignment used to indicate things like which bases align (either a match/mismatch) with the reference, are deleted from the Guerfali reference, and if there are insertions that are not in the reference Fatma & Nov 2016 Achouri rd 23 – Emna , IPP IPP – Chica More information about these formats available here: http://samtools.sourceforge.net on NGS course on NGS course - https://samtools.github.io/hts-specs/SAMv1.pdf Ghouila, Claudia Claudia Ghouila, 13 Amel C3BI Hands SAM fomat: CIGAR string Mapped and unmapped reads are imported into SAM/BAM format The standard CIGAR description of pairwise alignment defines three operations: ‘M’ for alignment match, ‘I’ for insertion compared with the reference and ‘D’ for deletion. Guerfali (NB: The POS indicates that the read aligns starting at position 5 on the reference) Fatma The CIGAR : & 3M = 3 bases in the read sequence align with the reference. 1I = The next base in the read does not exist in the reference. Nov 2016 1D = The reference base does not exist in the read sequence Achouri rd 23 – Emna , IPP IPP – Chica on NGS course on NGS course - http://genome.sph.umich.edu/wiki/SAM POS: 5 Ghouila, Claudia Claudia Ghouila, CIGAR: 3M1I3M1D2M 14 Amel C3BI Hands Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 Examples SAM file Alignments of CIGAR strings for SAM fomat: different CIGAR string types of alignments (Li et al., 2009) 15 SAM fomat (5) Guerfali Fatma & SAM format (5) Nov 2016 Achouri rd 23 (http://samtools.github.io/hts-specs/SAMv1.pdf) – Emna , IPP IPP – Name of mate (mate pair information for paired-end sequencing) Chica Position of mate (mate pair information) Obviously, the chromsome and position are important. The CIGAR string is also important on NGS course on NGS course - to know where insertions (i.e. introns) might exist in your read. Ghouila, Claudia Claudia Ghouila, 16 Amel C3BI Hands.

Load more