Sequence Alignment and Comparison

10/10/2016 Sequence Comparison Abhishek Niroula Protein Structure and Bioinformatics Department of Experimental Medical Science Lund University 2016-10-11 1 Learning goals • What is a sequence alignment? • What approaches are used for aligning sequences? • How to choose the best alignment? • What are substitution matrices? • Which tools are available for aligning two or more sequences? • How to use the alignment tools? • How to interpret results obtained from the tools? 2016- 10-11 2 1 10/10/2016 What is sequence alignment? • A way of arranging two or more sequences to identify regions of similarity • Shows locations of similarities and differences between the sequences • The aligned residues correspond to original residue in their common ancestor • Insertions and deletions are represented by gaps in the alignment • An 'optimal' alignment exhibits the most similarities and the least differences • Examples Protein sequence alignment MSTGAVLIY--TSILIKECHAMPAGNE----- ---GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** *** Nucleotide sequence alignment attcgttggcaaatcgcccctatccggccttaa att---tggcggatcg-cctctacgggcc---- *** **** **** ** ****** 2016-10-11 3 Why align sequences? • Reveal structural, functional and evolutionary relationship between biological sequences • Similar sequences may have similar structure and function • Similar sequences are likely to have common ancestral sequence • Annotation of new sequences • Modelling of protein structures 2016-10-11 4 2 10/10/2016 Sequence alignment: Types • Global alignment – Aligns each residue in each sequence by introducing gaps – Example: Needleman-Wunsch algorithm L G P S S K Q T G K G S - S R I W D N L N - I T K S A G K G A I M R L G D A 2016-10-11 5 Sequence alignment: Types • Local alignment – Finds regions with the highest density of matches locally – Example: Smith-Waterman algorithm F T F T A L I L L - A V A V - - F T A L - L L A A V - - - - - - - - - T G K G H R R K S P R S D E L K A A G K G - - - - - - 2016-10-11 6 3 10/10/2016 How to find the best alignment? • Seq1: TACGGGCAG • Seq2: ACGGCG T A C G G G C A G T A C G G G C A G T A C G G G C A G - A C - G G C - G - A C G G - C - G - A C G - G C - G Option 1 Option 2 Option 3 Find the alignment score!!! 2016-10-11 7 How to find the alignment score? • Scoring matrices are used to assign scores to each comparison of a pair of characters • Identities and substitutions by similar amino acids are assigned positive scores • Mismatches, or matches that are unlikely to have been a result of evolution, are given negative scores A C D E F G H I K Matches +5 A C Y E F G R I K Mismatches -5 +5 +5 -5 +5 +5 +5 -5 +5 +5 2016-10-11 8 4 10/10/2016 PAM-1 substitution matrix 2016-10-11 9 What is a PAM matrix? • PAM matrices – PAM - Percent Accepted Mutations – PAM gives the probability that a given amino acid will be replaced by any other amino acid – An accepted point mutation in a protein is a replacement of one amino acid by another, accepted by natural selection – Derived from global alignments of closely related sequences – The numbers with the matrix (PAM40, PAM100) refer to the evolutionary distance (greater numbers mean greater distances) – 1-PAM matrix refers to the amount evolution that would change 1% of the residues/bases (on average) – 2-PAM matrix does NOT refer to change in 2% of residues • Refers 1-PAM twice • Some variations may change back to original residue 2016-10-11 10 5 10/10/2016 BLOSUM62 2016-10-11 11 What is BLOSUM? • BLOSUM matrices – BLOSUM - Blocks Substitution Matrix – Score for each position refers to obtained frequencies of substitutions in blocks of local alignments of protein sequences. – For example BLOSUM62 is derived from sequence alignments with no more than 62% identity. 2016-10-11 12 6 10/10/2016 Which scoring matrix to use? For global alignments use PAM matrices. • Lower PAM matrices tend to find short alignments of highly similar regions • Higher PAM matrices will find weaker, longer alignments For local alignments use BLOSUM matrices • BLOSUM matrices with HIGH number, are better for similar sequences • BLOSUM matrices with LOW number, are better for distant sequences 2016-10-11 13 Sequence alignment: Methods • Pairwise alignment – Finding best alignment of two sequences – Often used for searching sequences with highest similarity in the sequence databases • Dot Matrix Analysis • Dynamic Programming (DP) • Short word matching • Multiple Sequence Alignment (MSA) – Alignment of more than two sequences – Often used to find conserved domains, regions or sites among many sequences • Dynamic programming • Progressive methods • Iterative methods • Structural alignments – Alignments based on structure 2016-10-11 14 7 10/10/2016 Sequence alignment: Methods • Pairwise alignment – Finding best alignment of two sequences – Often used for searching best similar sequences in the sequence databases • Dot Matrix Analysis • Dynamic Programming (DP) • Short word matching • Multiple Sequence Alignment (MSA) – Alignment of more than two sequences – Often used to find conserved domains, regions or sites among many sequences • Dynamic programming • Progressive methods • Iterative methods • Structural alignments – Alignments based on structure 2016-10-11 15 What is a Dot Matrix? • Method for comparing two sequences (amino acid or nucleotide) Sequence A A G C T A G G A • Lets align two sequences using G A dot matrix C A: A G C T A G G A T B: G A C T A G G C A Sequence Sequence B G – Sequence A is organized in X-axis G and sequence B in Y-axis C 2016-10 -11 16 8 10/10/2016 Find the matching nucleotides – Starting from the first nucleotide in B, Sequence A move along the first row placing a dot in columns with matching nucleotide A G C T A G G A G ● ● ● A C T A Sequence Sequence B G G C 2016-10-11 17 Continue to fill the table – Starting from the first nucleotide in B, Sequence A move along the first row placing a dot in columns with matching nucleotide A G C T A G G A G ● ● ● – Repeat the procedure for all the A ● ● ● nucleotides in B C T A Sequence Sequence B G G C 2016-10-11 18 9 10/10/2016 Why are some cells empty in a dot matrix? – Starting from the first nucleotide in B, Sequence A move along the first row placing a dot in columns with matching nucleotide A G C T A G G A G ● ● ● – Repeat the procedure for all the A ● ● ● nucleotides in B C ● T ● A ● ● ● Cells corresponding to Sequence B mismatching nucleotides G ● ● ● are empty G ● ● ● C ● 2016-10-11 19 Is there something interesting in the matrix? – Starting from the first nucleotide in B, Sequence A move along the first row placing a dot in columns with matching nucleotide A G C T A G G A G ● ● ● – Repeat the procedure for all the A ● ● ● nucleotides in B C ● – Region of similarity is revealed by a T ● diagonal row of dots A ● ● ● Sequence B G ● ● ● – Other isolated dots represent random matches G ● ● ● C ● 2016-10-11 20 10 10/10/2016 How to interpret dot plots? Two similar, but not identical, An insertion or deletion A tandem duplication sequences 2016-10-11 21 How to interpret dot plots? An inversion Joining sequences 2016-10-11 22 11 10/10/2016 Limitations of dot matrix • Sequences with low-complexity regions give false diagonals – Sequence regions with little diversity • Noisy and space inefficient • Limited to 2 sequences 2016-10-11 23 Dotplot exercise • Use the following three tools to generate dot plots for two sequences • YASS:: genomic similarity search tool – http://bioinfo.lifl.fr/yass/yass.php • Lalign/Palign – http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=lalign • multi-zPicture – http://zpicture.dcode.org/ 2016-10-11 24 12 10/10/2016 Sequence alignment: Methods • Pairwise alignment – Finding best alignment of two sequences – Often used for searching best similar sequences in the sequence databases • Dot Matrix Analysis • Dynamic Programming (DP) • Short word matching • Multiple Sequence Alignment (MSA) – Alignment of more than two sequences – Often used to find conserved domains, regions or sites among many sequences • Dynamic programming • Progressive methods • Iterative methods • Structural alignments – Alignments based on structure 2016-10-11 25 Dynamic programming • Breaks down the alignment problem into smaller problems • Example – Needleman-Wunsch algorithm: global alignment – Smith-Waterman algorithm: local alignment • Three steps – Initialization – Scoring – Traceback 2016-10-11 26 13 10/10/2016 Where to place gaps in the alignment? • Insertion of gaps in the alignment • Gaps should be penalized • Gap opening should be penalized higher than gap extension (or at least equal) • In BLOSUM62 – Gap opening score = -11 – Gap extension score = -1 A A A G A G A A A A A A G A G A A A A A A G A G A A A - A A A - A - A A A A A A - - A A A A A A A - - - A A A A Gap extention Gap extention Gap initiation 2016-10-11 27 Local and global pairwise alignment • Needleman-Wunsch (global) • Smith-Waterman (local) – Match =+2 – Match =+2 – Mismatch =-1 – Mismatch =-1 – Gap =-1 – Gap =-1 • All negative values are replaced by 0 • Traceback starts at the highest value and ends at 0 - A G T T A - A G T T A - 0 -1 -2 -3 -4 -5 - 0 0 0 0 0 0 A -1 2 A 0 2 G -2 G 0 T -3 T 0 G -4 G 0 C -5 C 0 A -6 A 0 2016-10-11 28 14 10/10/2016 Needleman-Wunsch vs Smith-Waterman Sequence alignment teacher (http://melolab.org/websoftware/web/?sid=3) 2016-10-11 29 Dynamic programming: example • http://www.avatar.se/molbioinfo2001/dynprog/dynamic.html • Scoring – Match = +2 – Mismatch = -2 – Gap = -1 2016-10-11 30 15 10/10/2016 Dynamic programming exercise • Generate a scoring matrix

Load more