Hamming Distance Between Two Strings of Equal Length Is the Number of PosiOns at Which the Corresponding Symbols Are Diﬀerent

Basics on bioinforma-cs Lecture 4 Nunzio D’Agostino [email protected]; [email protected] Why compare sequences Sequence comparison is a way of arranging the sequences of DNA, RNA or protein to iden;fy regions of similarity that may be a consequence of funconal, structural or evoluonary relaonships between the sequences. The best way to compare two sequences (protein or nucleic acid) is to align them. This is a basic problem in bioinformacs that recurs in different forms, in many cases. 2 Conserved residues Perfect Match ATGCGTGTGTGCATGCAATGCGTGA ************ TGTGCATGCAAT Sequence variations Mismatch ATGCGTGTGTGCATGCAATGCGTGA ************ TGTGCATGCAAT ATGCGTGTGTGCATGCAATGCGTGA *** **** *** TGTCCATGTAAT Sequence variations Deletion ATGCGTGTGTGCATGCAATGCGTGA ****** *! TGTGCAGCAAT Sequence variations Deletion ATGCGTGTGTGCATGCAATGCGTGA ****** *! TGTGCAGCAAT ATGCGTGTGTGCATGCAATGCGTGA ****** ///// TGTGCAGCAAT Sequence variations Gap ATGCGTGTGTGCATGCAATGCGTGA ****** ///// TGTGCAGCAAT ATGCGTGTGTGCATGCAATGCGTGA ****** ***** TGTGCA-GCAAT Complexity of pairwise alignment • simple approach: compute & score all possible alignments • but there are possible global alignments for 2 sequences of length n • e.g. two sequences of length 100 have about 1077 possible alignments Dot plot Dot plot is a graphical method that allows the comparison of two biosequences and iden<fy regions of close similarity between them. The dotplot is a table or matrix. In the simplest form of dotplot, when the residues are different the corresponding posi<ons are leH blank, while are filled when the residues correspond. The dotplot capture in a single image the overall similarity between two sequences as well as the complete set of the different possible alignments. The biggest asset of dot matrix analysis is it allows you to visualize the en<re comparison at once, not concentrang on any one ‘op<mal’ region, but rather giving you the ‘Gestalt’ of the whole thing. 9 Dot-plot ATGCGTGTGTGCATGCAATGCGTGA T * * * * * * * G * * * * * * * * * T G C A G C A A T Dot-plot ATGCGTGTGTGCATGCAATGCGTGA T * * * * * * * G * * * * * * * * * T * * * * * * * G * * * * * * * * * C * * * * A* * ** * G * * * * * * * * * C * * * * A* * ** * A* * ** * T * * * * * * * Dot-plot ATGCGTGTGTGCATGCAATGCGTGA T * * * * * * * The dot plots of G * * * * * * * * * very closely T * * * * * * * related sequences G * * * * * * * * * will appear as a single line along C * * * * the matrix's A* * ** * G * * * * * * * * * C * * * * A* * ** * A* * ** * T * * * * * * * Word length=1 Interpretation of dot plot A problem with dot matrices for long sequences is that they can be very noisy due to lots of insignificant matches. Solu<on use a window and a threshold o compare character by character within a window (have to choose window size) o require certain frac<on of matches within window in order to display it with a “dot” 13 Dot-plot ATGCGTGTGTGCATGCAATGCGTGA T * * * * * * * Noise reducon G * * * * can be obtained T * * * * * * * increasing the G * * * * word size C * * A G * * * * C * * A * A* * * T Word length=2 Dot-plot ATGCGTGTGTGCATGCAATGCGTGA T * * Over a certain G * * * * word size may be T * * * * reduced also the G * * diagonal length C A G * * C * A * A T Word length=3 Dot-plot ATGCGTGTGTGCATGCAATGCGTGA T * * * * * G * * * * * T * * * * G * * C A * G * * C * A A T Word length=4 Dot plot - one path Dot plot - one path A--CTGACTG-TCGACTGCCTG! * ** *** ** ****! ATGCTG-CTGCTCCACTG! Dot plot - one path Dot plot - one path ACTGACTG-T-C-GACTGC-CTG! * ** * * * ** * ***! A-TGCTGCTG-CTCCACTG! Dot plot - more paths Dottup W=1 W=2 W=3 W=5 W=6 Dot plot - gap Dot plot - duplication Dot plot - duplication 1 2 Penalties ü The scoring scheme consists of character subs<tu<on scores (i.e. score for each possible character replacement) plus penal<es for gaps. ü The alignment score is the sum of subs<tu<on scores and gap penal<es. The alignment score reflects goodness of alignment. Dot plot – example 1 Match 1! Mismatch 0! Gap 0! Dot plot – example 1 A--CTGACTG-TCGACTGCCTG! Match 1! 100110011101101111! Mismatch 0! ATGCTG-CTGCTCCACTG! Gap 0! Dot plot – example 1 Match 1! Mismatch 0! Gap 0! 6! 4! 13! 12! Dot plot – example 2 Match 1! Mismatch 0! Gap -1! Dot plot – example 2 Match 1! Mismatch 0! Gap -1! 4! 4! 10! 6! Dot plot – example 3 Match 1! Mismatch -1! Gap -1! Dot plot – example 3 Match 1! Mismatch -1! Gap -1! 4! 4! 9! 5! Scoring scheme match mismatch gap A B C D 1 0 0 4 13 12 6 1 0 -1 4 10 6 4 1 -1 -1 4 9 5 4 1 -1 -2 4 6 -1 2 2 0 0 2 26 24 12 Linear vs affine gap penalties oLinear gap penalty: same penalty subtracted from each space in the gap oAffine gap penalty: first space in the gap has a larger score than subsequent spaces in the gap; i.e., easier to lose/gain more subunits from a gap than to start a new gap/inser<on (this makes sense evolu<onarily) Match = +1, mismatch = -1, gap = -2; CCTGGGCTATGC Score = 1 CC-GG-TT-TGC Same as above but with affine penalty = -1 CCTGGGCTATGC Score = 2 CC--GGTT-TGC 35 Distance between two sequences Since we are dealing with biological sequences, the problem can be approached using two different points of view, which lead to the same result. It must be searched: 1. the minimum distance between the two sequences 2. the maximum similarity between the two sequences 36 Distance between two sequences In the first case we refer to the evolu<onary process, for which we say that if two orthologous sequences, for example, a mouse and a frog, have had separate evolu<ons from a certain point in <me onwards, it is expected that the differences between the two sequences will give us an indicaon of their divergence. In the second case, reference is made most directly to the search of similar substring, to be able to derive the structural and func<onal relaonships. In the scien<fic literature are oHen used interchangeably the minimum distance or maximum similarity between two sequences. 37 Similarity vs homology Homologous sequences are descended from a common ancestral sequence. Homology is either true or false. It can never be par;al! Saying two sequences are 45% homologous is a misuse of the term. Sequence iden<ty and similarity can be described as a percentage and are used as evidence of homology 38 Distance between two strings In informaon theory, the Hamming distance between two strings of equal length is the number of posi<ons at which the corresponding symbols are different. TONEd and ROSES HD= 3. The Levenshtein distance between two sequences is the minimum number of single-character edits (inser<on, dele<on, subs<tu<on) required to change one sequences into the other. KITTEN and SITTING Ld = 3 kieen → sien (subs<tu<on of "s" for "k") sien → siin (subs<tu<on of "i" for "e") sin → sing (inser<on of "g" at the end). 39 Edit distance The phrase edit distance is oHen used to refer specifically to Levenshtein distance. Edit Distance is a standard dynamic Programming problem. Given two strings s1 and s2, the edit distance between s1 and s2 is the minimum number of opera-ons required to convert string s1 to s2. The following operaons are typically used: Replacing one character of string by another character. dele<ng a character from string Adding a character to string 40 Edit distance Assess the Hamming distance between DECLENSION and RECREATION DECLENSION 4 RECREATION Evaluate the Levenshtein distance between BIOINFORMATICS and CONFORMATION BIOINFORMATICS 5 -CO-NFORMATION 41 Edit distance calculation Transform S1=“winter” in S2=“writers” w r i t e r s Edits: 0 1 2 3 4 5 6 7 iàr sobs<tu<on w 1 0 1 2 3 4 5 6 nài sobs<tu<on i 2 1 1 1 2 3 4 5 s inseron n 3 2 2 2 2 3 4 5 t 4 3 3 3 2 3 4 5 e 5 4 4 4 3 2 3 4 r 6 5 4 5 4 3 2 3 42 .

Load more