10/10/2016
Sequence Comparison
Abhishek Niroula Protein Structure and Bioinformatics Department of Experimental Medical Science Lund University
2016-10-11 1
Learning goals
• What is a sequence alignment?
• What approaches are used for aligning sequences?
• How to choose the best alignment?
• What are substitution matrices?
• Which tools are available for aligning two or more sequences?
• How to use the alignment tools?
• How to interpret results obtained from the tools?
2016- 10-11 2
1 10/10/2016
What is sequence alignment?
• A way of arranging two or more sequences to identify regions of similarity
• Shows locations of similarities and differences between the sequences
• The aligned residues correspond to original residue in their common ancestor
• Insertions and deletions are represented by gaps in the alignment
• An 'optimal' alignment exhibits the most similarities and the least differences
• Examples Protein sequence alignment MSTGAVLIY--TSILIKECHAMPAGNE------GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** ***
Nucleotide sequence alignment attcgttggcaaatcgcccctatccggccttaa att---tggcggatcg-cctctacgggcc---- *** **** **** ** ******
2016-10-11 3
Why align sequences?
• Reveal structural, functional and evolutionary relationship between biological sequences
• Similar sequences may have similar structure and function
• Similar sequences are likely to have common ancestral sequence
• Annotation of new sequences
• Modelling of protein structures
2016-10-11 4
2 10/10/2016
Sequence alignment: Types
• Global alignment
– Aligns each residue in each sequence by introducing gaps – Example: Needleman-Wunsch algorithm
L G P S S K Q T G K G S - S R I W D N L N - I T K S A G K G A I M R L G D A
2016-10-11 5
Sequence alignment: Types
• Local alignment
– Finds regions with the highest density of matches locally – Example: Smith-Waterman algorithm
F T F T A L I L L - A V A V
- - F T A L - L L A A V - -
------T G K G H R R K S P R S D E L K A A G K G ------
2016-10-11 6
3 10/10/2016
How to find the best alignment?
• Seq1: TACGGGCAG • Seq2: ACGGCG
T A C G G G C A G T A C G G G C A G T A C G G G C A G - A C - G G C - G - A C G G - C - G - A C G - G C - G Option 1 Option 2 Option 3
Find the alignment score!!!
2016-10-11 7
How to find the alignment score?
• Scoring matrices are used to assign scores to each comparison of a pair of characters • Identities and substitutions by similar amino acids are assigned positive scores • Mismatches, or matches that are unlikely to have been a result of evolution, are given negative scores
A C D E F G H I K Matches +5 A C Y E F G R I K Mismatches -5 +5 +5 -5 +5 +5 +5 -5 +5 +5
2016-10-11 8
4 10/10/2016
PAM-1 substitution matrix
2016-10-11 9
What is a PAM matrix?
• PAM matrices – PAM - Percent Accepted Mutations – PAM gives the probability that a given amino acid will be replaced by any other amino acid – An accepted point mutation in a protein is a replacement of one amino acid by another, accepted by natural selection – Derived from global alignments of closely related sequences – The numbers with the matrix (PAM40, PAM100) refer to the evolutionary distance (greater numbers mean greater distances) – 1-PAM matrix refers to the amount evolution that would change 1% of the residues/bases (on average) – 2-PAM matrix does NOT refer to change in 2% of residues • Refers 1-PAM twice • Some variations may change back to original residue
2016-10-11 10
5 10/10/2016
BLOSUM62
2016-10-11 11
What is BLOSUM?
• BLOSUM matrices – BLOSUM - Blocks Substitution Matrix
– Score for each position refers to obtained frequencies of substitutions in blocks of local alignments of protein sequences.
– For example BLOSUM62 is derived from sequence alignments with no more than 62% identity.
2016-10-11 12
6 10/10/2016
Which scoring matrix to use?
For global alignments use PAM matrices. • Lower PAM matrices tend to find short alignments of highly similar regions • Higher PAM matrices will find weaker, longer alignments
For local alignments use BLOSUM matrices • BLOSUM matrices with HIGH number, are better for similar sequences • BLOSUM matrices with LOW number, are better for distant sequences
2016-10-11 13
Sequence alignment: Methods
• Pairwise alignment – Finding best alignment of two sequences – Often used for searching sequences with highest similarity in the sequence databases • Dot Matrix Analysis • Dynamic Programming (DP) • Short word matching
• Multiple Sequence Alignment (MSA) – Alignment of more than two sequences – Often used to find conserved domains, regions or sites among many sequences • Dynamic programming • Progressive methods • Iterative methods
• Structural alignments – Alignments based on structure
2016-10-11 14
7 10/10/2016
Sequence alignment: Methods
• Pairwise alignment – Finding best alignment of two sequences – Often used for searching best similar sequences in the sequence databases • Dot Matrix Analysis • Dynamic Programming (DP) • Short word matching
• Multiple Sequence Alignment (MSA) – Alignment of more than two sequences – Often used to find conserved domains, regions or sites among many sequences • Dynamic programming • Progressive methods • Iterative methods
• Structural alignments – Alignments based on structure
2016-10-11 15
What is a Dot Matrix?
• Method for comparing two sequences (amino acid or nucleotide)
Sequence A A G C T A G G A • Lets align two sequences using G
dot matrix A
C A: A G C T A G G A T
B: G A C T A G G C A Sequence Sequence B G – Sequence A is organized in X-axis G and sequence B in Y-axis C
2016-10 -11 16
8 10/10/2016
Find the matching nucleotides
– Starting from the first nucleotide in B, Sequence A move along the first row placing a dot in columns with matching nucleotide A G C T A G G A G ● ● ●
A
C T
A Sequence Sequence B G G C
2016-10-11 17
Continue to fill the table
– Starting from the first nucleotide in B, Sequence A move along the first row placing a dot in columns with matching nucleotide A G C T A G G A G ● ● ● – Repeat the procedure for all the A ● ● ●
nucleotides in B
C T
A Sequence Sequence B G G C
2016-10-11 18
9 10/10/2016
Why are some cells empty in a dot matrix?
– Starting from the first nucleotide in B, Sequence A move along the first row placing a dot in columns with matching nucleotide A G C T A G G A G ● ● ● – Repeat the procedure for all the A ● ● ●
nucleotides in B
C ● T ● A ● ● ● Cells corresponding to Sequence B mismatching nucleotides G ● ● ● are empty G ● ● ● C ●
2016-10-11 19
Is there something interesting in the matrix?
– Starting from the first nucleotide in B, Sequence A move along the first row placing a dot in columns with matching nucleotide A G C T A G G A G ● ● ● – Repeat the procedure for all the A ● ● ●
nucleotides in B
C ● – Region of similarity is revealed by a T ● diagonal row of dots A ● ● ● Sequence B G ● ● ● – Other isolated dots represent random matches G ● ● ● C ●
2016-10-11 20
10 10/10/2016
How to interpret dot plots?
Two similar, but not identical, An insertion or deletion A tandem duplication sequences
2016-10-11 21
How to interpret dot plots?
An inversion Joining sequences
2016-10-11 22
11 10/10/2016
Limitations of dot matrix
• Sequences with low-complexity regions give false diagonals – Sequence regions with little diversity
• Noisy and space inefficient
• Limited to 2 sequences
2016-10-11 23
Dotplot exercise
• Use the following three tools to generate dot plots for two sequences
• YASS:: genomic similarity search tool – http://bioinfo.lifl.fr/yass/yass.php • Lalign/Palign – http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=lalign • multi-zPicture – http://zpicture.dcode.org/
2016-10-11 24
12 10/10/2016
Sequence alignment: Methods
• Pairwise alignment – Finding best alignment of two sequences – Often used for searching best similar sequences in the sequence databases • Dot Matrix Analysis • Dynamic Programming (DP) • Short word matching
• Multiple Sequence Alignment (MSA) – Alignment of more than two sequences – Often used to find conserved domains, regions or sites among many sequences • Dynamic programming • Progressive methods • Iterative methods
• Structural alignments – Alignments based on structure
2016-10-11 25
Dynamic programming
• Breaks down the alignment problem into smaller problems • Example – Needleman-Wunsch algorithm: global alignment – Smith-Waterman algorithm: local alignment
• Three steps – Initialization – Scoring – Traceback
2016-10-11 26
13 10/10/2016
Where to place gaps in the alignment?
• Insertion of gaps in the alignment • Gaps should be penalized • Gap opening should be penalized higher than gap extension (or at least equal) • In BLOSUM62 – Gap opening score = -11 – Gap extension score = -1
A A A G A G A A A A A A G A G A A A A A A G A G A A A - A A A - A - A A A A A A - - A A A A A A A - - - A A A A Gap extention Gap extention Gap initiation
2016-10-11 27
Local and global pairwise alignment
• Needleman-Wunsch (global) • Smith-Waterman (local) – Match =+2 – Match =+2 – Mismatch =-1 – Mismatch =-1 – Gap =-1 – Gap =-1 • All negative values are replaced by 0 • Traceback starts at the highest value and ends at 0 - A G T T A - A G T T A - 0 -1 -2 -3 -4 -5 - 0 0 0 0 0 0 A -1 2 A 0 2 G -2 G 0 T -3 T 0 G -4 G 0 C -5 C 0 A -6 A 0 2016-10-11 28
14 10/10/2016
Needleman-Wunsch vs Smith-Waterman
Sequence alignment teacher (http://melolab.org/websoftware/web/?sid=3)
2016-10-11 29
Dynamic programming: example
• http://www.avatar.se/molbioinfo2001/dynprog/dynamic.html • Scoring – Match = +2 – Mismatch = -2 – Gap = -1
2016-10-11 30
15 10/10/2016
Dynamic programming exercise
• Generate a scoring matrix for nucleotides (A, C, G, and T) • Align two sequences using dynamic programming and scoring matrix you just generated • Align two sequences using following tools – EMBOSS Needle • http://www.ebi.ac.uk/Tools/psa/emboss_needle/ – EMBOSS Water • http://www.ebi.ac.uk/Tools/psa/emboss_water/
2016-10-11 31
DP exercise: Needle vs Water
2016-10-11 32
16 10/10/2016
Sequence alignment: Methods
• Pairwise alignment – Finding best alignment of two sequences – Often used for searching best similar sequences in the sequence databases • Dot Matrix Analysis • Dynamic Programming (DP) • Short word matching
• Multiple Sequence Alignment (MSA) – Alignment of more than two sequences – Often used to find conserved domains, regions or sites among many sequences • Dynamic programming • Progressive methods • Iterative methods
• Structural alignments – Alignments based on structure
2016-10-11 33
MSA: What and Why?
• A multiple sequence alignment (MSA) is an alignment of three or more sequences
• Why MSA? – To identify patterns of conservation across more than 2 sequences – To characterize protein families and generate profiles of protein families – To infer relationships within and among gene families – To predict secondary and tertiary structures of new sequences – To perform phylogenetic studies
2016-10-11 34
17 10/10/2016
MSA application in variation interpretation
• Interpreting the impacts of genetic variants
• All tools for variation interpretation use MSAs of protein and nucleotide sequences
• http://structure.bmc.lu.se/services.php
• PON-P2 • PON-BTK • PON-MMR2 • PON-mt-tRNA • PON-Diso • PON-Sol • PON-PS • PPSC
2016-10-11 35
Are there tools to perform MSA?
2016-10-11 36
18 10/10/2016
Sequence alignment: Methods
• Pairwise alignment – Finding best alignment of two sequences – Often used for searching best similar sequences in the sequence databases • Dot Matrix Analysis • Dynamic Programming (DP) • Short word matching
• Multiple Sequence Alignment (MSA) – Alignment of more than two sequences – Often used to find conserved domains, regions or sites among many sequences • Dynamic programming • Progressive methods • Iterative methods
• Structural alignments – Alignments based on structure
2016-10-11 37
Do you remember dynamic programming?
2 sequences 3 sequences
http://ai.stanford.edu/~serafim/CS262_2005/LectureNotes/Lecture17.pdf 2016-10-11 38
19 10/10/2016
How to align three sequences using DP?
• Dynamic programming – Align each pair of sequences – Sum scores for each pair at each position
2016-10-11 39
Sequence alignment: Methods
• Pairwise alignment – Finding best alignment of two sequences – Often used for searching best similar sequences in the sequence databases • Dot Matrix Analysis • Dynamic Programming (DP) • Short word matching
• Multiple Sequence Alignment (MSA) – Alignment of more than two sequences – Often used to find conserved domains, regions or sites among many sequences • Dynamic programming • Progressive methods • Iterative methods
• Structural alignments – Alignments based on structure
2016-10-11 40
20 10/10/2016
MSA: Progressive alignment
• Progressive sequence alignment – Hierarchical or tree based method – E.g. ClustalW, T-Coffee, Clustal Omega
• Basic steps – Calculate pairwise distances based on pairwise alignments between the sequences – Build a guide tree, which is an inferred phylogeny for the sequences – Align the sequences
2016-10-11 41
Progressive MSA d 1 3
1 3 2 5
1 3 2 5
1 3 2 5 4 2016-10-11 42
21 10/10/2016
Sequence alignment: Methods
• Pairwise alignment – Finding best alignment of two sequences – Often used for searching best similar sequences in the sequence databases • Dot Matrix Analysis • Dynamic Programming (DP) • Short word matching
• Multiple Sequence Alignment (MSA) – Alignment of more than two sequences – Often used to find conserved domains, regions or sites among many sequences • Dynamic programming • Progressive methods • Iterative methods
• Structural alignments – Alignments based on structure
2016-10-11 43
MSA: Progressive alignment
• Iterative sequence alignment – Improved progressive alignment – Aligns the sequences repeatedly – E.g. MUSCLE
2016-10-11 44
22 10/10/2016
Iterative MSA
• Follows 3 steps
Progressive alignment
Second progressive alignment
Refinement
2016-10-11 45
Phylogenetic tree
• A phylogenetic tree shows evolutionary relationships between the sequences • Types: – Rooted • Nodes represent most recent common ancestor • Edge lengths represents time estimates – Unrooted • No ancestry and time estimates • Algorithms to generate phylogenetic tree – Neighbor-joining – Unweighted Pair Group Method with Arithmetic Mean (UPGMA) – Maximum parsimony
2016-10-11 46
23 10/10/2016
Neighbor joining method
http://en.wikipedia.org/wiki/Neighbor_joining 2016-10-11 47
MSA exercise
• Align the protein sequences SET 1 and SET 2 using MSA tools and compare the alignments
• Clustalw Omega – http://www.ebi.ac.uk/Tools/msa/clustalo/
• MUSCLE – http://www.ebi.ac.uk/Tools/msa/muscle/
2016-10-11 48
24 10/10/2016
What to align: DNA or protein sequence?
If ORF exists, then always align at protein level
• Many mis-matches in DNA sequences are synonymous
• DNA sequences contain non-coding regions, which should be avaided in homology searching
• Matches are more reliable in protein sequence – Probability to occur randomly at any position in a sequence • Amino acids: 1/20 = 0.05 • Nucleotides: 1/4 = 0.25 ACT TTT CAT GGG ...
Thr Phe His Gly ... • Searcing at protein level: In case of frameshifts, the alignment score for protein sequence may be very low even though the DNA sequence are similar ACT TTT TCA TGG G..
Thr Phe Ser Trp 2016-10-11 49
Searching bioinformatics databases
2016-10-11 50
25 10/10/2016
Learning goals
• How to find information using Keywords?
• How to find information using Sequences?
• How to find correct gene names?
2016-10-11 51
Search strategy
• Keyword search – Find information related to specific keywords – Each bioinformatics database has its own search tool – Some search tools have a wide spectrum which access multiple databases and gather results together – Gquery, EBI search
• Sequence search – Use a sequence of interest to find more information about the sequence – BLAST, FASTA
2016-10-11 52
26 10/10/2016
Keyword search
• Find information related to specific keywords • Gquery – A central search tool to find information in NCBI databases – Searches in large number of NCBI databases and shows them in one page – http://www.ncbi.nlm.nih.gov/gquery
• EBI search – Search tool to find infroamtion from databases developed, managed and hosted by EMBL-EBI – http://www.ebi.ac.uk/services
2016-10-11 53
Gquery
2016-10-11 54
27 10/10/2016
EBI search
2016-10-11 55
Limitations
HIV 1 • Synonyms ELA2 ELANE HIV-1 • Misspellings
• Old and new names/terms 110 8 64 59 20
PubMed ClinVar
• NOTES: – Use different synonyms and read literature to find more approriate keywords – Use boolean operators to combine different keywords – Do not expect to find all the information using keyword search alone – Note the database version or the version of entries in the databases you used
2016- 10-11 56
28 10/10/2016
Gene nomenclature
• HUGO Gene Nomenclature Committee (HGNC) – Assigns standardized nomenclature to human genes – Each symbol is unique and each gene is given only one name
• Species specific nomenclature committees – Mouse Genome Informatics Database • http://www.informatics.jax.org/mgihome/nomen/ – Rat Genome Database • http://rgd.mcw.edu/nomen/nomen.shtml
2016-10-11 57
HGNC symbol report
• Approved symbol • Approved name • Synonyms – Terms used in literature to indicate the gene – HGNC, Ensembl, Entrez Gene, OMIM • Previous symbols and names – Previous HGNC approved symbol
• NOTE: HGNC does not approve protein names. Usually genes and proteins have the same name and gene names are written in italics.
2016-10-11 58
29 10/10/2016
HGNC search
2016-10-11 59
Keyword search
• Exercise
2016-10-11 60
30