The Structure of the SAM File

10/9/14 The majority of data analyses split off aer 2014 - BMMB 852: Applied Bioinformacs generang a SAM file Week 7, Lecture 14 Sequencing Reads István Albert Alignment So8ware Bioinformacs Consul:ng Center Penn State SAM file Chip-Seq RNA-Seq SNPs Create test queries: The structure of the SAM file exact match, one mismatch, 1 dele:on etc. SAM Headers Alignments 1 10/9/14 Resource: SAM specificaon SAM format: tabular text format 11 required column + op:onal fields Published as The Sequence Alignment/Map format and SAMtools by Heng Li et al Bioinformacs 25, Volume 25, Issue 16, 2009 “A TAB-delimited text format consis:ng of a header sec:on, which is op:onal, and an alignment sec:on.” An aligment consist of 11 tab Column 1 and 2: QNAME and FLAG delimited columns (Query name and bitwise flags) QNAME: the name of the query sequence 2 10/9/14 Column 2: FLAG Columns 3, 4: RNAME and POS the bitwise representaon Reference and Posi:on 1 = 00000001 à paired end read 2 = 00000010 à mapped as proper pair 4 = 00000100 à unmappable read 8 = 00001000 à read mate unmapped 16 = 00010000 à read mapped on reverse strand The flag 11 à 1 + 2 + 8 = 0001011 (condions 1, 2 and 8 sasfied) It is used to save space – but it does make things a bit more difficult. Usually very few flags are needed in prac:ce – 0, 4, 16 are the most generic ones Column 4 POS: 1-based lemost mapping POSion of the first matching base. If you need to construct a more complex flag search for explain SAM flags: Very important to remember later when we need to find the 5’ end (the actual start) hp://picard.sourceforge.net/explain-flags.html Details of the mapping quality Column 5: MAPQ - Mapping Quality computaon – hard to find good answers • • Tool specific – there is no standard of what it should be Phred score, iden:cal to the quality measure in the fastq file. quality Q, probability P: • The repeat structure of the reference. Reads falling in repe::ve regions usually get very low mapping quality. P = 10 ^ (-Q / 10.0) • The base quality of the read. Low quality means the observed read sequence is possibly wrong, and wrong sequence may lead to a wrong alignment. If Q=30, P=1/1000 à on average, one of out 1000 • The sensi:vity of the alignment algorithm. The true hit is more likely to be alignments will be wrong missed by an algorithm with low sensi:vity, which also causes mapping errors. As good as this sounds it is not easy to compute • Paired end or not. Reads mapped in pairs are more likely to be correct. such a quality. (from the MAQ manual) 3 10/9/14 BWA specific high scores BWA specific low scores A read alignment with a mapping quality 30 or Surprisingly difficult to track down the exact behavior above usually implies • Q=0 à if a read can be aligned equally well to mul:ple • The overall base quality of the read is good. posi:ons, BWA will randomly pick one posi:on and give it a mapping quality zero. • The best alignment has few mismatches. • Q=25 à the edit distance equals mismatches and is greater • The read has few or Just one `good' hit on the than zero reference, which means the current alignment is s:ll the best even if one or two bases are actually mutaons or sequencing errors. Columns 7, 8, 9: RNEXT, PNEXT, TLEN Column 6: CIGAR (used in paired end read sequencing) • CIGAR = Compact Idiosyncrac Gapped Alignment Report • RNEXT: the name of the pair • PNEXT: the posi:on of the pair • TLEN: the distance between the leMmost posi:ons of the pairs We will discuss these in more detail later. These can show the posi:on of reads that are distant – allow us to infer genomic variaons 4 10/9/14 Column 10, 11 Column 12 and beyond • SEQ: the query sequence • QUAL: the phred encoded quality sequence SEQ may contain the original sequence or the segment it was aligned to. Not all tools do the same thing. Specific informaon about the alignment process that the tools was able to establish. more details in later lectures Homework 14 Download data-14.tar.gz Align the data in paired end more with bwa. How many dis:nct alignment flags do you get? What does each flag mean? 5 .

The Structure of the SAM File

Analysis of the Impact of Sequencing Errors on Blast Using Fault Injection

Advancing Solutions to the Carbohydrate Sequencing Challenge † † † ‡ § Christopher J

Gene Discovery and Annotation Using LCM-454 Transcriptome Sequencing Scott J

"Phylogenetic Analysis of Protein Sequence Data Using The

EMBL-EBI Powerpoint Presentation

A SARS-Cov-2 Sequence Submission Tool for the European Nucleotide

An Open-Sourced Bioinformatic Pipeline for the Processing of Next-Generation Sequencing Derived Nucleotide Reads

Genomic Sequencing of SARS-Cov-2: a Guide to Implementation for Maximum Impact on Public Health

Sequence Alignment/Map Format Specification

The Biogrid Interaction Database

The Interpro Database, an Integrated Documentation Resource for Protein

Multi-Scale Analysis and Clustering of Co-Expression Networks