Basics on bioinformacs Lecture 4

Nunzio D’Agostino [email protected]; [email protected] Why compare sequences

Sequence comparison is a way of arranging the sequences of DNA, RNA or protein to idenfy regions of similarity that may be a consequence of funconal, structural or evoluonary relaonships between the sequences.

The best way to compare two sequences (protein or nucleic acid) is to align them. This is a basic problem in bioinformacs that recurs in different forms, in many cases.

2 Conserved residues

Perfect Match

ATGCGTGTGTGCATGCAATGCGTGA ************ TGTGCATGCAAT

Sequence variations

Mismatch

ATGCGTGTGTGCATGCAATGCGTGA ************ TGTGCATGCAAT

ATGCGTGTGTGCATGCAATGCGTGA *** **** *** TGTCCATGTAAT

Sequence variations

Deletion

ATGCGTGTGTGCATGCAATGCGTGA ****** *! TGTGCAGCAAT Sequence variations

Deletion

ATGCGTGTGTGCATGCAATGCGTGA ****** *! TGTGCAGCAAT

ATGCGTGTGTGCATGCAATGCGTGA ****** ///// TGTGCAGCAAT Sequence variations Gap ATGCGTGTGTGCATGCAATGCGTGA ****** ///// TGTGCAGCAAT

ATGCGTGTGTGCATGCAATGCGTGA ****** ***** TGTGCA-GCAAT Complexity of pairwise alignment

• simple approach: compute & score all possible alignments

• but there are

possible global alignments for 2 sequences of length n

• e.g. two sequences of length 100 have about 1077 possible alignments Dot plot Dot plot is a graphical method that allows the comparison of two biosequences and idenfy regions of close similarity between them.

The dotplot is a table or matrix. In the simplest form of dotplot, when the residues are different the corresponding posions are le blank, while are filled when the residues correspond.

The dotplot capture in a single image the overall similarity between two sequences as well as the complete set of the different possible alignments.

The biggest asset of dot matrix analysis is it allows you to visualize the enre comparison at once, not concentrang on any one ‘opmal’ region, but rather giving you the ‘Gestalt’ of the whole thing.

9 Dot-plot

ATGCGTGTGTGCATGCAATGCGTGA T * * * * * * * G * * * * * * * * * T G C A G C A A T Dot-plot

ATGCGTGTGTGCATGCAATGCGTGA T * * * * * * * G * * * * * * * * * T * * * * * * * G * * * * * * * * * C * * * * A* * ** * G * * * * * * * * * C * * * * A* * ** * A* * ** * T * * * * * * * Dot-plot

ATGCGTGTGTGCATGCAATGCGTGA T * * * * * * * The dot plots of G * * * * * * * * * very closely T * * * * * * * related sequences G * * * * * * * * * will appear as a single line along C * * * * the matrix's A* * ** * G * * * * * * * * * C * * * * A* * ** * A* * ** * T * * * * * * * Word length=1 Interpretation of dot plot

A problem with dot matrices for long sequences is that they can be very noisy due to lots of insignificant matches.

Soluon use a window and a threshold o compare character by character within a window (have to choose window size) o require certain fracon of matches within window in order to display it with a “dot”

13 Dot-plot

ATGCGTGTGTGCATGCAATGCGTGA T * * * * * * * Noise reducon G * * * * can be obtained T * * * * * * * increasing the G * * * * word size C * * A G * * * * C * * A * A* * * T Word length=2 Dot-plot

ATGCGTGTGTGCATGCAATGCGTGA T * * Over a certain G * * * * word size may be T * * * * reduced also the G * * diagonal length C A G * * C * A * A T Word length=3 Dot-plot

ATGCGTGTGTGCATGCAATGCGTGA T * * * * * G * * * * * T * * * * G * * C A * G * * C * A A T Word length=4 Dot plot - one path Dot plot - one path A--CTGACTG-TCGACTGCCTG! * ** *** ** ****! ATGCTG-CTGCTCCACTG! Dot plot - one path Dot plot - one path ACTGACTG-T-C-GACTGC-CTG! * ** * * * ** * ***! A-TGCTGCTG-CTCCACTG! Dot plot - more paths Dottup

W=1 W=2

W=3 W=5 W=6 Dot plot - gap Dot plot - duplication Dot plot - duplication

1 2 Penalties

ü The scoring scheme consists of character substuon scores (i.e. score for each possible character replacement) plus penales for gaps.

ü The alignment score is the sum of substuon scores and gap penales. The alignment score reflects goodness of alignment. Dot plot – example 1

Match 1! Mismatch 0! Gap 0! Dot plot – example 1

A--CTGACTG-TCGACTGCCTG! Match 1! 100110011101101111! Mismatch 0! ATGCTG-CTGCTCCACTG! Gap 0! Dot plot – example 1

Match 1! Mismatch 0! Gap 0!

6!

4! 13! 12! Dot plot – example 2

Match 1! Mismatch 0! Gap -1! Dot plot – example 2

Match 1! Mismatch 0! Gap -1!

4!

4! 10! 6! Dot plot – example 3

Match 1! Mismatch -1! Gap -1! Dot plot – example 3

Match 1! Mismatch -1! Gap -1!

4!

4! 9! 5! Scoring scheme

match mismatch gap A B C D 1 0 0 4 13 12 6 1 0 -1 4 10 6 4 1 -1 -1 4 9 5 4 1 -1 -2 4 6 -1 2 2 0 0 2 26 24 12 Linear vs affine gap penalties oLinear gap penalty: same penalty subtracted from each space in the gap oAffine gap penalty: first space in the gap has a larger score than subsequent spaces in the gap; i.e., easier to lose/gain more subunits from a gap than to start a new gap/inseron (this makes sense evoluonarily)

Match = +1, mismatch = -1, gap = -2; CCTGGGCTATGC Score = 1 CC-GG-TT-TGC Same as above but with affine penalty = -1 CCTGGGCTATGC Score = 2 CC--GGTT-TGC

35 between two sequences

Since we are dealing with biological sequences, the problem can be approached using two different points of view, which lead to the same result.

It must be searched: 1. the minimum distance between the two sequences 2. the maximum similarity between the two sequences

36 Distance between two sequences

In the first case we refer to the evoluonary process, for which we say that if two orthologous sequences, for example, a mouse and a frog, have had separate evoluons from a certain point in me onwards, it is expected that the differences between the two sequences will give us an indicaon of their divergence.

In the second case, reference is made most directly to the search of similar substring, to be able to derive the structural and funconal relaonships.

In the scienfic literature are oen used interchangeably the minimum distance or maximum similarity between two sequences.

37 Similarity vs homology

Homologous sequences are descended from a common ancestral sequence.

Homology is either true or false. It can never be paral! Saying two sequences are 45% homologous is a misuse of the term.

Sequence identy and similarity can be described as a percentage and are used as evidence of homology

38 Distance between two strings

In informaon theory, the Hamming distance between two strings of equal length is the number of posions at which the corresponding symbols are different.

TONED and ROSES HD= 3. The between two sequences is the minimum number of single-character edits (inseron, deleon, substuon) required to change one sequences into the other.

KITTEN and SITTING LD = 3 kien → sien (substuon of "s" for "k") sien → siin (substuon of "i" for "e") sin → sing (inseron of "g" at the end).

39

The phrase edit distance is oen used to refer specifically to Levenshtein distance.

Edit Distance is a standard Dynamic Programming problem. Given two strings s1 and s2, the edit distance between s1 and s2 is the minimum number of operaons required to convert string s1 to s2.

The following operaons are typically used: Replacing one character of string by another character. Deleng a character from string Adding a character to string

40 Edit distance

Assess the Hamming distance between DECLENSION and RECREATION DECLENSION 4 RECREATION

Evaluate the Levenshtein distance between BIOINFORMATICS and CONFORMATION

BIOINFORMATICS 5 -CO-NFORMATION

41 Edit distance calculation

Transform S1=“winter” in S2=“writers”

w r i t e r s

Edits: 0 1 2 3 4 5 6 7 iàr sobstuon w 1 0 1 2 3 4 5 6 nài sobstuon i 2 1 1 1 2 3 4 5 s inseron n 3 2 2 2 2 3 4 5

t 4 3 3 3 2 3 4 5

e 5 4 4 4 3 2 3 4

r 6 5 4 5 4 3 2 3 42