10/10/2016

Sequence Comparison

Abhishek Niroula Protein Structure and Department of Experimental Medical Science Lund University

2016-10-11 1

Learning goals

• What is a ?

• What approaches are used for aligning sequences?

• How to choose the best alignment?

• What are substitution matrices?

• Which tools are available for aligning two or more sequences?

• How to use the alignment tools?

• How to interpret results obtained from the tools?

2016- 10-11 2

1 10/10/2016

What is sequence alignment?

• A way of arranging two or more sequences to identify regions of similarity

• Shows locations of similarities and differences between the sequences

• The aligned residues correspond to original residue in their common ancestor

• Insertions and deletions are represented by gaps in the alignment

• An 'optimal' alignment exhibits the most similarities and the least differences

• Examples Protein sequence alignment MSTGAVLIY--TSILIKECHAMPAGNE------GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** ***

Nucleotide sequence alignment attcgttggcaaatcgcccctatccggccttaa att---tggcggatcg-cctctacgggcc---- *** **** **** ** ******

2016-10-11 3

Why align sequences?

• Reveal structural, functional and evolutionary relationship between biological sequences

• Similar sequences may have similar structure and function

• Similar sequences are likely to have common ancestral sequence

• Annotation of new sequences

• Modelling of protein structures

2016-10-11 4

2 10/10/2016

Sequence alignment: Types

• Global alignment

– Aligns each residue in each sequence by introducing gaps – Example: Needleman-Wunsch algorithm

L G P S S K Q T G K G S - S R I W D N L N - I T K S A G K G A I M R L G D A

2016-10-11 5

Sequence alignment: Types

• Local alignment

– Finds regions with the highest density of matches locally – Example: Smith-Waterman algorithm

F T F T A L I L L - A V A V

- - F T A L - L L A A V - -

------T G K G H R R K S P R S D E L K A A G K G ------

2016-10-11 6

3 10/10/2016

How to find the best alignment?

• Seq1: TACGGGCAG • Seq2: ACGGCG

T A C G G G C A G T A C G G G C A G T A C G G G C A G - A C - G G C - G - A C G G - C - G - A C G - G C - G Option 1 Option 2 Option 3

Find the alignment score!!!

2016-10-11 7

How to find the alignment score?

• Scoring matrices are used to assign scores to each comparison of a pair of characters • Identities and substitutions by similar amino acids are assigned positive scores • Mismatches, or matches that are unlikely to have been a result of evolution, are given negative scores

A C D E F G H I K Matches +5 A C Y E F G R I K Mismatches -5 +5 +5 -5 +5 +5 +5 -5 +5 +5

2016-10-11 8

4 10/10/2016

PAM-1 substitution matrix

2016-10-11 9

What is a PAM matrix?

• PAM matrices – PAM - Percent Accepted Mutations – PAM gives the probability that a given amino acid will be replaced by any other amino acid – An accepted point mutation in a protein is a replacement of one amino acid by another, accepted by natural selection – Derived from global alignments of closely related sequences – The numbers with the matrix (PAM40, PAM100) refer to the evolutionary distance (greater numbers mean greater distances) – 1-PAM matrix refers to the amount evolution that would change 1% of the residues/bases (on average) – 2-PAM matrix does NOT refer to change in 2% of residues • Refers 1-PAM twice • Some variations may change back to original residue

2016-10-11 10

5 10/10/2016

BLOSUM62

2016-10-11 11

What is BLOSUM?

• BLOSUM matrices – BLOSUM - Blocks Substitution Matrix

– Score for each position refers to obtained frequencies of substitutions in blocks of local alignments of protein sequences.

– For example BLOSUM62 is derived from sequence alignments with no more than 62% identity.

2016-10-11 12

6 10/10/2016

Which scoring matrix to use?

For global alignments use PAM matrices. • Lower PAM matrices tend to find short alignments of highly similar regions • Higher PAM matrices will find weaker, longer alignments

For local alignments use BLOSUM matrices • BLOSUM matrices with HIGH number, are better for similar sequences • BLOSUM matrices with LOW number, are better for distant sequences

2016-10-11 13

Sequence alignment: Methods

• Pairwise alignment – Finding best alignment of two sequences – Often used for searching sequences with highest similarity in the sequence databases • Dot Matrix Analysis • Dynamic Programming (DP) • Short word matching

• Multiple Sequence Alignment (MSA) – Alignment of more than two sequences – Often used to find conserved domains, regions or sites among many sequences • Dynamic programming • Progressive methods • Iterative methods

• Structural alignments – Alignments based on structure

2016-10-11 14

7 10/10/2016

Sequence alignment: Methods

• Pairwise alignment – Finding best alignment of two sequences – Often used for searching best similar sequences in the sequence databases • Dot Matrix Analysis • Dynamic Programming (DP) • Short word matching

• Multiple Sequence Alignment (MSA) – Alignment of more than two sequences – Often used to find conserved domains, regions or sites among many sequences • Dynamic programming • Progressive methods • Iterative methods

• Structural alignments – Alignments based on structure

2016-10-11 15

What is a Dot Matrix?

• Method for comparing two sequences (amino acid or nucleotide)

Sequence A A G C T A G G A • Lets align two sequences using G

dot matrix A

C A: A G C T A G G A T

B: G A C T A G G C A Sequence Sequence B G – Sequence A is organized in X-axis G and sequence B in Y-axis C

2016-10 -11 16

8 10/10/2016

Find the matching nucleotides

– Starting from the first nucleotide in B, Sequence A move along the first row placing a dot in columns with matching nucleotide A G C T A G G A G ● ● ●

A

C T

A Sequence Sequence B G G C

2016-10-11 17

Continue to fill the table

– Starting from the first nucleotide in B, Sequence A move along the first row placing a dot in columns with matching nucleotide A G C T A G G A G ● ● ● – Repeat the procedure for all the A ● ● ●

nucleotides in B

C T

A Sequence Sequence B G G C

2016-10-11 18

9 10/10/2016

Why are some cells empty in a dot matrix?

– Starting from the first nucleotide in B, Sequence A move along the first row placing a dot in columns with matching nucleotide A G C T A G G A G ● ● ● – Repeat the procedure for all the A ● ● ●

nucleotides in B

C ● T ● A ● ● ● Cells corresponding to Sequence B mismatching nucleotides G ● ● ● are empty G ● ● ● C ●

2016-10-11 19

Is there something interesting in the matrix?

– Starting from the first nucleotide in B, Sequence A move along the first row placing a dot in columns with matching nucleotide A G C T A G G A G ● ● ● – Repeat the procedure for all the A ● ● ●

nucleotides in B

C ● – Region of similarity is revealed by a T ● diagonal row of dots A ● ● ● Sequence B G ● ● ● – Other isolated dots represent random matches G ● ● ● C ●

2016-10-11 20

10 10/10/2016

How to interpret dot plots?

Two similar, but not identical, An insertion or deletion A tandem duplication sequences

2016-10-11 21

How to interpret dot plots?

An inversion Joining sequences

2016-10-11 22

11 10/10/2016

Limitations of dot matrix

• Sequences with low-complexity regions give false diagonals – Sequence regions with little diversity

• Noisy and space inefficient

• Limited to 2 sequences

2016-10-11 23

Dotplot exercise

• Use the following three tools to generate dot plots for two sequences

• YASS:: genomic similarity search tool – http://bioinfo.lifl.fr/yass/yass.php • Lalign/Palign – http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=lalign • multi-zPicture – http://zpicture.dcode.org/

2016-10-11 24

12 10/10/2016

Sequence alignment: Methods

• Pairwise alignment – Finding best alignment of two sequences – Often used for searching best similar sequences in the sequence databases • Dot Matrix Analysis • Dynamic Programming (DP) • Short word matching

• Multiple Sequence Alignment (MSA) – Alignment of more than two sequences – Often used to find conserved domains, regions or sites among many sequences • Dynamic programming • Progressive methods • Iterative methods

• Structural alignments – Alignments based on structure

2016-10-11 25

Dynamic programming

• Breaks down the alignment problem into smaller problems • Example – Needleman-Wunsch algorithm: global alignment – Smith-Waterman algorithm: local alignment

• Three steps – Initialization – Scoring – Traceback

2016-10-11 26

13 10/10/2016

Where to place gaps in the alignment?

• Insertion of gaps in the alignment • Gaps should be penalized • Gap opening should be penalized higher than gap extension (or at least equal) • In BLOSUM62 – Gap opening score = -11 – Gap extension score = -1

A A A G A G A A A A A A G A G A A A A A A G A G A A A - A A A - A - A A A A A A - - A A A A A A A - - - A A A A Gap extention Gap extention Gap initiation

2016-10-11 27

Local and global pairwise alignment

• Needleman-Wunsch (global) • Smith-Waterman (local) – Match =+2 – Match =+2 – Mismatch =-1 – Mismatch =-1 – Gap =-1 – Gap =-1 • All negative values are replaced by 0 • Traceback starts at the highest value and ends at 0 - A G T T A - A G T T A - 0 -1 -2 -3 -4 -5 - 0 0 0 0 0 0 A -1 2 A 0 2 G -2 G 0 T -3 T 0 G -4 G 0 C -5 C 0 A -6 A 0 2016-10-11 28

14 10/10/2016

Needleman-Wunsch vs Smith-Waterman

Sequence alignment teacher (http://melolab.org/websoftware/web/?sid=3)

2016-10-11 29

Dynamic programming: example

• http://www.avatar.se/molbioinfo2001/dynprog/dynamic.html • Scoring – Match = +2 – Mismatch = -2 – Gap = -1

2016-10-11 30

15 10/10/2016

Dynamic programming exercise

• Generate a scoring matrix for nucleotides (A, C, G, and T) • Align two sequences using dynamic programming and scoring matrix you just generated • Align two sequences using following tools – EMBOSS Needle • http://www.ebi.ac.uk/Tools/psa/emboss_needle/ – EMBOSS Water • http://www.ebi.ac.uk/Tools/psa/emboss_water/

2016-10-11 31

DP exercise: Needle vs Water

2016-10-11 32

16 10/10/2016

Sequence alignment: Methods

• Pairwise alignment – Finding best alignment of two sequences – Often used for searching best similar sequences in the sequence databases • Dot Matrix Analysis • Dynamic Programming (DP) • Short word matching

• Multiple Sequence Alignment (MSA) – Alignment of more than two sequences – Often used to find conserved domains, regions or sites among many sequences • Dynamic programming • Progressive methods • Iterative methods

• Structural alignments – Alignments based on structure

2016-10-11 33

MSA: What and Why?

• A multiple sequence alignment (MSA) is an alignment of three or more sequences

• Why MSA? – To identify patterns of conservation across more than 2 sequences – To characterize protein families and generate profiles of protein families – To infer relationships within and among gene families – To predict secondary and tertiary structures of new sequences – To perform phylogenetic studies

2016-10-11 34

17 10/10/2016

MSA application in variation interpretation

• Interpreting the impacts of genetic variants

• All tools for variation interpretation use MSAs of protein and nucleotide sequences

• http://structure.bmc.lu.se/services.php

• PON-P2 • PON-BTK • PON-MMR2 • PON-mt-tRNA • PON-Diso • PON-Sol • PON-PS • PPSC

2016-10-11 35

Are there tools to perform MSA?

2016-10-11 36

18 10/10/2016

Sequence alignment: Methods

• Pairwise alignment – Finding best alignment of two sequences – Often used for searching best similar sequences in the sequence databases • Dot Matrix Analysis • Dynamic Programming (DP) • Short word matching

• Multiple Sequence Alignment (MSA) – Alignment of more than two sequences – Often used to find conserved domains, regions or sites among many sequences • Dynamic programming • Progressive methods • Iterative methods

• Structural alignments – Alignments based on structure

2016-10-11 37

Do you remember dynamic programming?

2 sequences 3 sequences

http://ai.stanford.edu/~serafim/CS262_2005/LectureNotes/Lecture17.pdf 2016-10-11 38

19 10/10/2016

How to align three sequences using DP?

• Dynamic programming – Align each pair of sequences – Sum scores for each pair at each position

2016-10-11 39

Sequence alignment: Methods

• Pairwise alignment – Finding best alignment of two sequences – Often used for searching best similar sequences in the sequence databases • Dot Matrix Analysis • Dynamic Programming (DP) • Short word matching

• Multiple Sequence Alignment (MSA) – Alignment of more than two sequences – Often used to find conserved domains, regions or sites among many sequences • Dynamic programming • Progressive methods • Iterative methods

• Structural alignments – Alignments based on structure

2016-10-11 40

20 10/10/2016

MSA: Progressive alignment

• Progressive sequence alignment – Hierarchical or tree based method – E.g. ClustalW, T-Coffee, Omega

• Basic steps – Calculate pairwise distances based on pairwise alignments between the sequences – Build a guide tree, which is an inferred phylogeny for the sequences – Align the sequences

2016-10-11 41

Progressive MSA d 1 3

1 3 2 5

1 3 2 5

1 3 2 5 4 2016-10-11 42

21 10/10/2016

Sequence alignment: Methods

• Pairwise alignment – Finding best alignment of two sequences – Often used for searching best similar sequences in the sequence databases • Dot Matrix Analysis • Dynamic Programming (DP) • Short word matching

• Multiple Sequence Alignment (MSA) – Alignment of more than two sequences – Often used to find conserved domains, regions or sites among many sequences • Dynamic programming • Progressive methods • Iterative methods

• Structural alignments – Alignments based on structure

2016-10-11 43

MSA: Progressive alignment

• Iterative sequence alignment – Improved progressive alignment – Aligns the sequences repeatedly – E.g. MUSCLE

2016-10-11 44

22 10/10/2016

Iterative MSA

• Follows 3 steps

Progressive alignment

Second progressive alignment

Refinement

2016-10-11 45

Phylogenetic tree

• A phylogenetic tree shows evolutionary relationships between the sequences • Types: – Rooted • Nodes represent most recent common ancestor • Edge lengths represents time estimates – Unrooted • No ancestry and time estimates • Algorithms to generate phylogenetic tree – Neighbor-joining – Unweighted Pair Group Method with Arithmetic Mean (UPGMA) – Maximum parsimony

2016-10-11 46

23 10/10/2016

Neighbor joining method

http://en.wikipedia.org/wiki/Neighbor_joining 2016-10-11 47

MSA exercise

• Align the protein sequences SET 1 and SET 2 using MSA tools and compare the alignments

• Clustalw Omega – http://www.ebi.ac.uk/Tools/msa/clustalo/

• MUSCLE – http://www.ebi.ac.uk/Tools/msa/muscle/

2016-10-11 48

24 10/10/2016

What to align: DNA or protein sequence?

If ORF exists, then always align at protein level

• Many mis-matches in DNA sequences are synonymous

• DNA sequences contain non-coding regions, which should be avaided in homology searching

• Matches are more reliable in protein sequence – Probability to occur randomly at any position in a sequence • Amino acids: 1/20 = 0.05 • Nucleotides: 1/4 = 0.25 ACT TTT CAT GGG ...

Thr Phe His Gly ... • Searcing at protein level: In case of frameshifts, the alignment score for protein sequence may be very low even though the DNA sequence are similar ACT TTT TCA TGG G..

Thr Phe Ser Trp 2016-10-11 49

Searching bioinformatics databases

2016-10-11 50

25 10/10/2016

Learning goals

• How to find information using Keywords?

• How to find information using Sequences?

• How to find correct gene names?

2016-10-11 51

Search strategy

• Keyword search – Find information related to specific keywords – Each bioinformatics database has its own search tool – Some search tools have a wide spectrum which access multiple databases and gather results together – Gquery, EBI search

• Sequence search – Use a sequence of interest to find more information about the sequence – BLAST, FASTA

2016-10-11 52

26 10/10/2016

Keyword search

• Find information related to specific keywords • Gquery – A central search tool to find information in NCBI databases – Searches in large number of NCBI databases and shows them in one page – http://www.ncbi.nlm.nih.gov/gquery

• EBI search – Search tool to find infroamtion from databases developed, managed and hosted by EMBL-EBI – http://www.ebi.ac.uk/services

2016-10-11 53

Gquery

2016-10-11 54

27 10/10/2016

EBI search

2016-10-11 55

Limitations

HIV 1 • Synonyms ELA2 ELANE HIV-1 • Misspellings

• Old and new names/terms 110 8 64 59 20

PubMed ClinVar

• NOTES: – Use different synonyms and read literature to find more approriate keywords – Use boolean operators to combine different keywords – Do not expect to find all the information using keyword search alone – Note the database version or the version of entries in the databases you used

2016- 10-11 56

28 10/10/2016

Gene nomenclature

• HUGO Gene Nomenclature Committee (HGNC) – Assigns standardized nomenclature to human genes – Each symbol is unique and each gene is given only one name

• Species specific nomenclature committees – Mouse Genome Informatics Database • http://www.informatics.jax.org/mgihome/nomen/ – Rat Genome Database • http://rgd.mcw.edu/nomen/nomen.shtml

2016-10-11 57

HGNC symbol report

• Approved symbol • Approved name • Synonyms – Terms used in literature to indicate the gene – HGNC, Ensembl, Entrez Gene, OMIM • Previous symbols and names – Previous HGNC approved symbol

• NOTE: HGNC does not approve protein names. Usually genes and proteins have the same name and gene names are written in italics.

2016-10-11 58

29 10/10/2016

HGNC search

2016-10-11 59

Keyword search

• Exercise

2016-10-11 60

30