Sequence Alignment and Comparison

Sequence Alignment and Comparison

10/10/2016 Sequence Comparison Abhishek Niroula Protein Structure and Bioinformatics Department of Experimental Medical Science Lund University 2016-10-11 1 Learning goals • What is a sequence alignment? • What approaches are used for aligning sequences? • How to choose the best alignment? • What are substitution matrices? • Which tools are available for aligning two or more sequences? • How to use the alignment tools? • How to interpret results obtained from the tools? 2016- 10-11 2 1 10/10/2016 What is sequence alignment? • A way of arranging two or more sequences to identify regions of similarity • Shows locations of similarities and differences between the sequences • The aligned residues correspond to original residue in their common ancestor • Insertions and deletions are represented by gaps in the alignment • An 'optimal' alignment exhibits the most similarities and the least differences • Examples Protein sequence alignment MSTGAVLIY--TSILIKECHAMPAGNE----- ---GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** *** Nucleotide sequence alignment attcgttggcaaatcgcccctatccggccttaa att---tggcggatcg-cctctacgggcc---- *** **** **** ** ****** 2016-10-11 3 Why align sequences? • Reveal structural, functional and evolutionary relationship between biological sequences • Similar sequences may have similar structure and function • Similar sequences are likely to have common ancestral sequence • Annotation of new sequences • Modelling of protein structures 2016-10-11 4 2 10/10/2016 Sequence alignment: Types • Global alignment – Aligns each residue in each sequence by introducing gaps – Example: Needleman-Wunsch algorithm L G P S S K Q T G K G S - S R I W D N L N - I T K S A G K G A I M R L G D A 2016-10-11 5 Sequence alignment: Types • Local alignment – Finds regions with the highest density of matches locally – Example: Smith-Waterman algorithm F T F T A L I L L - A V A V - - F T A L - L L A A V - - - - - - - - - T G K G H R R K S P R S D E L K A A G K G - - - - - - 2016-10-11 6 3 10/10/2016 How to find the best alignment? • Seq1: TACGGGCAG • Seq2: ACGGCG T A C G G G C A G T A C G G G C A G T A C G G G C A G - A C - G G C - G - A C G G - C - G - A C G - G C - G Option 1 Option 2 Option 3 Find the alignment score!!! 2016-10-11 7 How to find the alignment score? • Scoring matrices are used to assign scores to each comparison of a pair of characters • Identities and substitutions by similar amino acids are assigned positive scores • Mismatches, or matches that are unlikely to have been a result of evolution, are given negative scores A C D E F G H I K Matches +5 A C Y E F G R I K Mismatches -5 +5 +5 -5 +5 +5 +5 -5 +5 +5 2016-10-11 8 4 10/10/2016 PAM-1 substitution matrix 2016-10-11 9 What is a PAM matrix? • PAM matrices – PAM - Percent Accepted Mutations – PAM gives the probability that a given amino acid will be replaced by any other amino acid – An accepted point mutation in a protein is a replacement of one amino acid by another, accepted by natural selection – Derived from global alignments of closely related sequences – The numbers with the matrix (PAM40, PAM100) refer to the evolutionary distance (greater numbers mean greater distances) – 1-PAM matrix refers to the amount evolution that would change 1% of the residues/bases (on average) – 2-PAM matrix does NOT refer to change in 2% of residues • Refers 1-PAM twice • Some variations may change back to original residue 2016-10-11 10 5 10/10/2016 BLOSUM62 2016-10-11 11 What is BLOSUM? • BLOSUM matrices – BLOSUM - Blocks Substitution Matrix – Score for each position refers to obtained frequencies of substitutions in blocks of local alignments of protein sequences. – For example BLOSUM62 is derived from sequence alignments with no more than 62% identity. 2016-10-11 12 6 10/10/2016 Which scoring matrix to use? For global alignments use PAM matrices. • Lower PAM matrices tend to find short alignments of highly similar regions • Higher PAM matrices will find weaker, longer alignments For local alignments use BLOSUM matrices • BLOSUM matrices with HIGH number, are better for similar sequences • BLOSUM matrices with LOW number, are better for distant sequences 2016-10-11 13 Sequence alignment: Methods • Pairwise alignment – Finding best alignment of two sequences – Often used for searching sequences with highest similarity in the sequence databases • Dot Matrix Analysis • Dynamic Programming (DP) • Short word matching • Multiple Sequence Alignment (MSA) – Alignment of more than two sequences – Often used to find conserved domains, regions or sites among many sequences • Dynamic programming • Progressive methods • Iterative methods • Structural alignments – Alignments based on structure 2016-10-11 14 7 10/10/2016 Sequence alignment: Methods • Pairwise alignment – Finding best alignment of two sequences – Often used for searching best similar sequences in the sequence databases • Dot Matrix Analysis • Dynamic Programming (DP) • Short word matching • Multiple Sequence Alignment (MSA) – Alignment of more than two sequences – Often used to find conserved domains, regions or sites among many sequences • Dynamic programming • Progressive methods • Iterative methods • Structural alignments – Alignments based on structure 2016-10-11 15 What is a Dot Matrix? • Method for comparing two sequences (amino acid or nucleotide) Sequence A A G C T A G G A • Lets align two sequences using G A dot matrix C A: A G C T A G G A T B: G A C T A G G C A Sequence Sequence B G – Sequence A is organized in X-axis G and sequence B in Y-axis C 2016-10 -11 16 8 10/10/2016 Find the matching nucleotides – Starting from the first nucleotide in B, Sequence A move along the first row placing a dot in columns with matching nucleotide A G C T A G G A G ● ● ● A C T A Sequence Sequence B G G C 2016-10-11 17 Continue to fill the table – Starting from the first nucleotide in B, Sequence A move along the first row placing a dot in columns with matching nucleotide A G C T A G G A G ● ● ● – Repeat the procedure for all the A ● ● ● nucleotides in B C T A Sequence Sequence B G G C 2016-10-11 18 9 10/10/2016 Why are some cells empty in a dot matrix? – Starting from the first nucleotide in B, Sequence A move along the first row placing a dot in columns with matching nucleotide A G C T A G G A G ● ● ● – Repeat the procedure for all the A ● ● ● nucleotides in B C ● T ● A ● ● ● Cells corresponding to Sequence B mismatching nucleotides G ● ● ● are empty G ● ● ● C ● 2016-10-11 19 Is there something interesting in the matrix? – Starting from the first nucleotide in B, Sequence A move along the first row placing a dot in columns with matching nucleotide A G C T A G G A G ● ● ● – Repeat the procedure for all the A ● ● ● nucleotides in B C ● – Region of similarity is revealed by a T ● diagonal row of dots A ● ● ● Sequence B G ● ● ● – Other isolated dots represent random matches G ● ● ● C ● 2016-10-11 20 10 10/10/2016 How to interpret dot plots? Two similar, but not identical, An insertion or deletion A tandem duplication sequences 2016-10-11 21 How to interpret dot plots? An inversion Joining sequences 2016-10-11 22 11 10/10/2016 Limitations of dot matrix • Sequences with low-complexity regions give false diagonals – Sequence regions with little diversity • Noisy and space inefficient • Limited to 2 sequences 2016-10-11 23 Dotplot exercise • Use the following three tools to generate dot plots for two sequences • YASS:: genomic similarity search tool – http://bioinfo.lifl.fr/yass/yass.php • Lalign/Palign – http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=lalign • multi-zPicture – http://zpicture.dcode.org/ 2016-10-11 24 12 10/10/2016 Sequence alignment: Methods • Pairwise alignment – Finding best alignment of two sequences – Often used for searching best similar sequences in the sequence databases • Dot Matrix Analysis • Dynamic Programming (DP) • Short word matching • Multiple Sequence Alignment (MSA) – Alignment of more than two sequences – Often used to find conserved domains, regions or sites among many sequences • Dynamic programming • Progressive methods • Iterative methods • Structural alignments – Alignments based on structure 2016-10-11 25 Dynamic programming • Breaks down the alignment problem into smaller problems • Example – Needleman-Wunsch algorithm: global alignment – Smith-Waterman algorithm: local alignment • Three steps – Initialization – Scoring – Traceback 2016-10-11 26 13 10/10/2016 Where to place gaps in the alignment? • Insertion of gaps in the alignment • Gaps should be penalized • Gap opening should be penalized higher than gap extension (or at least equal) • In BLOSUM62 – Gap opening score = -11 – Gap extension score = -1 A A A G A G A A A A A A G A G A A A A A A G A G A A A - A A A - A - A A A A A A - - A A A A A A A - - - A A A A Gap extention Gap extention Gap initiation 2016-10-11 27 Local and global pairwise alignment • Needleman-Wunsch (global) • Smith-Waterman (local) – Match =+2 – Match =+2 – Mismatch =-1 – Mismatch =-1 – Gap =-1 – Gap =-1 • All negative values are replaced by 0 • Traceback starts at the highest value and ends at 0 - A G T T A - A G T T A - 0 -1 -2 -3 -4 -5 - 0 0 0 0 0 0 A -1 2 A 0 2 G -2 G 0 T -3 T 0 G -4 G 0 C -5 C 0 A -6 A 0 2016-10-11 28 14 10/10/2016 Needleman-Wunsch vs Smith-Waterman Sequence alignment teacher (http://melolab.org/websoftware/web/?sid=3) 2016-10-11 29 Dynamic programming: example • http://www.avatar.se/molbioinfo2001/dynprog/dynamic.html • Scoring – Match = +2 – Mismatch = -2 – Gap = -1 2016-10-11 30 15 10/10/2016 Dynamic programming exercise • Generate a scoring matrix

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    30 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us