<<

Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 SAM and BAM formats 1 Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 • • • alignment file you will end up with a After mapping the FASTQ file to the reference SAM/BAM file. data can be fully recapitulated from the access. This means that the raw sequencing run, and indexed to allow rapid unmapped, and even QC A single format SAM stands SAM file for /Map SAM, BAM formats can store mapped, - failed reads from a SAM or BAM Raw sequence data: (Bowtie, BWA BAM/SAM files or others) Mapping Fastq files 2 Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 SAM Format Li Shen , 2014 Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 • • manipulation) SAM file (saving storage and faster A BAM file (.bam) is the in much SAM is principle space rarely helpful and really takes up too which SAM, BAM formats is why we use only the BAM binary version of a Raw sequence data: (Bowtie, BWA BAM/SAM files or others) Mapping Fastq files 4 Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 where/how it aligns to the reference genome information for each sequence about sequences to which the reads were mapped sorting order of the reads, genomic § § § Most alignment programs will supply: or viewed using the UNIX "more" command SAM files can be opened using a text editor that contains sequence alignment data A SAM file (. - - an alignment section: a header: describing the format version, SAM, BAM formats ) is a tab - contains the delimited text file Raw sequence data: (Bowtie, BWA BAM/SAM files or others) Mapping Fastq files 5 Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 SAM, BAM formats Header: Alignment section 11 columns (tab - separated) 6 Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 QNAME FLAG RNAME POS MAPQ SAM Format CIGAR

RNEXT

PNEXT

TLEN http:// http SEQ :// .sourceforge.net genome.sph.umich.edu /SAM1.pdf /wiki/ SAM 7 QUAL Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 to come from the same template. A QNAME ‘*’ indicates the information is unavailable. QNAME ( http:// samtools.github.io : Query template NAME. Reads/segments having identical QNAME are regarded / hts - specs/SAMv1.pdf) SAM fomat 8 Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 FLAG ( http:// : FLAG: bitwise samtools.github.io FLAG (ideal for compression). / hts - specs/SAMv1.pdf) SAM fomat (2) singe column 11 boolean flags all stotred in a 9 Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 FLAG read SAM file - - - its Read Read mapped 163 (=1 + 2 + 32 + 128 mate is is properly the second is mapped to position paired read to 37 on the reverse SAM 7: (1 + 2 in the pair (128) ): ) flag: example strand (32) 10 Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 Explain flag tool: https:// broadinstitute.github.io Decoding SAM flags / picard /explain - flags.html 11 Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 It equals ü ü genome. The ( http:// 255 L arge MAPQ samtools.github.io indicates that the mapping quality is not available −10 log10 number, >10 value can be used to figure out how unique an alignment is in the Pr / SAM fomat hts {mapping position is wrong} indicates - specs/SAMv1.pdf) it's likely the alignment is (3) , rounded to the nearest integer. unique. 12 Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 • https://samtools.github.io/hts http://samtools.sourceforge.net More information about these formats available here: used representing The reference match/ CIGAR to mismatch , indicate and SAM string the if ) there associated with fomat: things is - specs/SAMv1.pdf are a the insertions sequence like reference CIGAR string information which that of , bases are are numbers on not deleted bases align in the and alignment ( reference from either letters 13 the a Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 1D = The 1I = The 3M = 3 The (NB: The POS ‘ The standard CIGAR description of M Mapped ’ CIGAR: 3M1I3M1D2M POS: 5 for CIGAR alignment bases in the next reference : and base in the indicates SAM unmapped match base read that , does read sequence ‘ fomat: CIGAR string I ’ for insertion the does not reads read exist pairwise align not aligns are in the exist with compared imported starting alignment in the the read reference sequence reference at with into position 5 on the defines the SAM/BAM format . . reference three operations and ‘ reference http : // D genome ’ for : ) deletion . sph . umich . edu . /wiki/SAM 14 Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 Examples SAM file Alignments of CIGAR strings for SAM fomat: different CIGAR string types of alignments (Li et al., 2009) 15 Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23rd Nov 2016 to know where insertions (i.e. ) might exist in your read. Obviously, the Position of mate Name of mate ( http:// samtools.github.io chromsome (mate pair information for paired (mate pair information) / hts - specs/SAMv1.pdf) and position are important. The CIGAR string is also important SAM SAM format (5) fomat (5) - end sequencing) 16