Basics on bioinforma cs Lecture 4
Nunzio D’Agostino [email protected]; [email protected] Why compare sequences
Sequence comparison is a way of arranging the sequences of DNA, RNA or protein to iden fy regions of similarity that may be a consequence of func onal, structural or evolu onary rela onships between the sequences.
The best way to compare two sequences (protein or nucleic acid) is to align them. This is a basic problem in bioinforma cs that recurs in different forms, in many cases.
2 Conserved residues
Perfect Match
ATGCGTGTGTGCATGCAATGCGTGA ************ TGTGCATGCAAT
Sequence variations
Mismatch
ATGCGTGTGTGCATGCAATGCGTGA ************ TGTGCATGCAAT
ATGCGTGTGTGCATGCAATGCGTGA *** **** *** TGTCCATGTAAT
Sequence variations
Deletion
ATGCGTGTGTGCATGCAATGCGTGA ****** *! TGTGCAGCAAT Sequence variations
Deletion
ATGCGTGTGTGCATGCAATGCGTGA ****** *! TGTGCAGCAAT
ATGCGTGTGTGCATGCAATGCGTGA ****** ///// TGTGCAGCAAT Sequence variations Gap ATGCGTGTGTGCATGCAATGCGTGA ****** ///// TGTGCAGCAAT
ATGCGTGTGTGCATGCAATGCGTGA ****** ***** TGTGCA-GCAAT Complexity of pairwise alignment
• simple approach: compute & score all possible alignments
• but there are
possible global alignments for 2 sequences of length n
• e.g. two sequences of length 100 have about 1077 possible alignments Dot plot Dot plot is a graphical method that allows the comparison of two biosequences and iden fy regions of close similarity between them.
The dotplot is a table or matrix. In the simplest form of dotplot, when the residues are different the corresponding posi ons are le blank, while are filled when the residues correspond.
The dotplot capture in a single image the overall similarity between two sequences as well as the complete set of the different possible alignments.
The biggest asset of dot matrix analysis is it allows you to visualize the en re comparison at once, not concentra ng on any one ‘op mal’ region, but rather giving you the ‘Gestalt’ of the whole thing.
9 Dot-plot
ATGCGTGTGTGCATGCAATGCGTGA T * * * * * * * G * * * * * * * * * T G C A G C A A T Dot-plot
ATGCGTGTGTGCATGCAATGCGTGA T * * * * * * * G * * * * * * * * * T * * * * * * * G * * * * * * * * * C * * * * A* * ** * G * * * * * * * * * C * * * * A* * ** * A* * ** * T * * * * * * * Dot-plot
ATGCGTGTGTGCATGCAATGCGTGA T * * * * * * * The dot plots of G * * * * * * * * * very closely T * * * * * * * related sequences G * * * * * * * * * will appear as a single line along C * * * * the matrix's A* * ** * G * * * * * * * * * C * * * * A* * ** * A* * ** * T * * * * * * * Word length=1 Interpretation of dot plot
A problem with dot matrices for long sequences is that they can be very noisy due to lots of insignificant matches.
Solu on use a window and a threshold o compare character by character within a window (have to choose window size) o require certain frac on of matches within window in order to display it with a “dot”
13 Dot-plot
ATGCGTGTGTGCATGCAATGCGTGA T * * * * * * * Noise reduc on G * * * * can be obtained T * * * * * * * increasing the G * * * * word size C * * A G * * * * C * * A * A* * * T Word length=2 Dot-plot
ATGCGTGTGTGCATGCAATGCGTGA T * * Over a certain G * * * * word size may be T * * * * reduced also the G * * diagonal length C A G * * C * A * A T Word length=3 Dot-plot
ATGCGTGTGTGCATGCAATGCGTGA T * * * * * G * * * * * T * * * * G * * C A * G * * C * A A T Word length=4 Dot plot - one path Dot plot - one path A--CTGACTG-TCGACTGCCTG! * ** *** ** ****! ATGCTG-CTGCTCCACTG! Dot plot - one path Dot plot - one path ACTGACTG-T-C-GACTGC-CTG! * ** * * * ** * ***! A-TGCTGCTG-CTCCACTG! Dot plot - more paths Dottup
W=1 W=2
W=3 W=5 W=6 Dot plot - gap Dot plot - duplication Dot plot - duplication
1 2 Penalties
ü The scoring scheme consists of character subs tu on scores (i.e. score for each possible character replacement) plus penal es for gaps.
ü The alignment score is the sum of subs tu on scores and gap penal es. The alignment score reflects goodness of alignment. Dot plot – example 1
Match 1! Mismatch 0! Gap 0! Dot plot – example 1
A--CTGACTG-TCGACTGCCTG! Match 1! 100110011101101111! Mismatch 0! ATGCTG-CTGCTCCACTG! Gap 0! Dot plot – example 1
Match 1! Mismatch 0! Gap 0!
6!
4! 13! 12! Dot plot – example 2
Match 1! Mismatch 0! Gap -1! Dot plot – example 2
Match 1! Mismatch 0! Gap -1!
4!
4! 10! 6! Dot plot – example 3
Match 1! Mismatch -1! Gap -1! Dot plot – example 3
Match 1! Mismatch -1! Gap -1!
4!
4! 9! 5! Scoring scheme
match mismatch gap A B C D 1 0 0 4 13 12 6 1 0 -1 4 10 6 4 1 -1 -1 4 9 5 4 1 -1 -2 4 6 -1 2 2 0 0 2 26 24 12 Linear vs affine gap penalties oLinear gap penalty: same penalty subtracted from each space in the gap oAffine gap penalty: first space in the gap has a larger score than subsequent spaces in the gap; i.e., easier to lose/gain more subunits from a gap than to start a new gap/inser on (this makes sense evolu onarily)
Match = +1, mismatch = -1, gap = -2; CCTGGGCTATGC Score = 1 CC-GG-TT-TGC Same as above but with affine penalty = -1 CCTGGGCTATGC Score = 2 CC--GGTT-TGC
35 Distance between two sequences
Since we are dealing with biological sequences, the problem can be approached using two different points of view, which lead to the same result.
It must be searched: 1. the minimum distance between the two sequences 2. the maximum similarity between the two sequences
36 Distance between two sequences
In the first case we refer to the evolu onary process, for which we say that if two orthologous sequences, for example, a mouse and a frog, have had separate evolu ons from a certain point in me onwards, it is expected that the differences between the two sequences will give us an indica on of their divergence.
In the second case, reference is made most directly to the search of similar substring, to be able to derive the structural and func onal rela onships.
In the scien fic literature are o en used interchangeably the minimum distance or maximum similarity between two sequences.
37 Similarity vs homology
Homologous sequences are descended from a common ancestral sequence.
Homology is either true or false. It can never be par al! Saying two sequences are 45% homologous is a misuse of the term.
Sequence iden ty and similarity can be described as a percentage and are used as evidence of homology
38 Distance between two strings
In informa on theory, the Hamming distance between two strings of equal length is the number of posi ons at which the corresponding symbols are different.
TONED and ROSES HD= 3. The Levenshtein distance between two sequences is the minimum number of single-character edits (inser on, dele on, subs tu on) required to change one sequences into the other.
KITTEN and SITTING LD = 3 ki en → si en (subs tu on of "s" for "k") si en → si in (subs tu on of "i" for "e") si n → si ng (inser on of "g" at the end).
The phrase edit distance is o en used to refer specifically to Levenshtein distance.
Edit Distance is a standard Dynamic Programming problem. Given two strings s1 and s2 , the edit distance between s1 and s2 is the minimum number of opera ons required to convert string s1 to s2 .
The following opera ons are typically used: Replacing one character of string by another character. Dele ng a character from string Adding a character to string
40 Edit distance
Assess the Hamming distance between DECLENSION and RECREATION DECLENSION 4 RECREATION
Evaluate the Levenshtein distance between BIOINFORMATICS and CONFORMATION
BIOINFORMATICS 5 -CO-NFORMATION
41 Edit distance calculation
Transform S1=“winter” in S2=“writers”
w r i t e r s
Edits: 0 1 2 3 4 5 6 7 iàr sobs tu on w 1 0 1 2 3 4 5 6 nài sobs tu on i 2 1 1 1 2 3 4 5 s inser on n 3 2 2 2 2 3 4 5
t 4 3 3 3 2 3 4 5
e 5 4 4 4 3 2 3 4
r 6 5 4 5 4 3 2 3 42