Sequence Alignment Algorithms

2/19/17 Sequence alignment algorithms Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, FeBruary 23rd 2017 After this lecture, you can… … decide when to use local and global sequence alignments … use dynamic programming to align two sequences … explain Difference Between fixeD/linear/affine gap penalty … derive substitution scores and gap penalties from an alignment matrix … explain the progressive multiple alignment algorithm anD the Difference Between guiDe tree anD phylogenetic tree … recognize anD valiDate alignment Fasta files … list anD evaluate the assumptions on which sequence alignment DepenDs 1 2/19/17 Pairwise sequence alignments • Definition of sequence alignment – “Given two sequences: seqX = X1X2…XM and seqY = Y1Y2…YN an alignment is an assignment of gaps to positions 0, …, M in x, and to positions 0, …, N in seqY, so as to line up each letter in one sequence with either a letter or a gap in the other sequence” -AGAGGCTATCACCTGACCTCCAGGCCGATGCCCGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAGTAGCTATCACGACCGCGGTCGATTTGCCCGAC-CTATCAC--GACCGC--GGTCGATTTGCCCGAC • The optimal alignment is the alignment that is most consistent with a moDel of evolution • It is not trivial to make sequence alignments – The alignment shoulD be reliaBle – The method of obtaining the alignment shoulD be reproDuciBle – Thus, we use an algorithm to make sequence alignments Global anD local sequence alignments • Alignment: adDing gaps in one anD/or the other sequence until they are both equally long • Are sequences completely or partially homologous? • Local alignment – FinDs the optimal suB-alignment within two sequences – Partial homologs, e.g. resulting from domain rearrangement • GloBal alignment – Aligns two sequences from enD to enD – If you know two sequences are full homologs, e.g. resulting from gene duplication 2 2/19/17 How to Detect iDentical (suB-)sequences? GAACTGCACTC GTGCACTCT Alignment matrix Alignment matrix 3 2/19/17 How to iDentify iDentical (suB-)sequences? GTCGTTGCAGTGTATTGCGAACTGCACTCTGA GTCGTTGCAGTGTGCACTCTGA Alignment matrix TowarDs an algorithm • The challenge is to finD an algorithm that finDs the Best alignment Between two sequences A C G T A 1 • The first thing we neeD is a scoring system C -1 1 – SuBstitution matrix G -1 -1 1 • How many points for a match? T -1 -1 -1 1 • How much penalty for a mismatch? Substitution matrix – Gap penalty Gap penalty: -2 • These scores are BaseD on a model of evolution: – How often Do we think these events occur? – More likely events are given higher scores – Less likely events are given lower scores (higher penalties) 4 2/19/17 TowarDs an algorithm • Then we go through the alignment matrix, cell By cell anD score it: A C G T – If the resiDue at this position is the same in both A 1 C -1 1 the sequences, the cell gets a +1 score G -1 -1 1 – If the resiDue at this position is the same in both T -1 -1 -1 1 the sequences, the cell gets a -1 penalty Substitution matrix – Relative to what? àRelative to one of the possiBle previous cells àThe one that maximizes the alignment score Alignment matrix TowarDs an algorithm • From a given cell ( X ) the alignment can go in three Directions: a B – Diagonally from left aBove: this inDicates an aligneD residue c U V • Score of X = score of U + substitution score (b, d) D W X – From the cell Directly aBove: this inDicates a gap in the horizontal sequence Alignment matrix • Score of X = score of V – gap penalty – From the cell Directly to the left: this inDicates a gap in the vertical sequence • Score of X = score of W – gap penalty • Every time, we choose the option that leaDs to the highest alignment score in X • To iDentify local alignments, we set the alignment score to zero if Becomes negative, anD restart the alignment 5 2/19/17 Alignment matrix Re-thinking the moDel of evolution • An inDel is an insertion/Deletion of a sequence segment – InDels are usually single evolutionary events – So you Do not want to penalize every resiDue aligneD to a gap Gap open penalty: -2 Alignment matrix Gap extension penalty: 0 • The model of evolution can account for this By differentiating two gap penalties: – Gap open penalty: high penalty inDicating (un-)likelihood of an inDel event in evolution – Gap extension penalty: zero score or low penalty inDicating that the likelihooD of the evolutionary event is regarDless of the length of the inDel 6 2/19/17 Gaps • Gaps are the result of insertions or deletions in the sequence Gap open: Total gap penalty: 6 x -3 Gap extension: 5 x -2 -1 • A given insertion or deletion is probably just one evolutionary event, regardless of its size • Adding a gap penalty for each gap position may decrease the alignment score too much • This can be solveD by using a high penalty for “Gap opening” and a low penalty for “Gap extension” Dynamic programming • The algorithm we have DescriBeD is calleD dynamic programming 7 2/19/17 Align GCCCTAGCG to GCGCAATG. A C G T A 1 • Many solutions are possible C -1 1 – The optimal alignment maximizes the alignment score G -1 -1 1 T -1 -1 -1 1 • DepenDs on substitution matrix anD gap penalty Gap penalty: -2 – You coulD calculate alignment scores for all possible alignments: 1 + 1 – 1 + 1 – 1 + 1 – 1 – 1 – 2 = -2 – 2 – 1 + 1 – 1 – 1 + 1 – 1 – 1 + 1 = -4 1 + 1 – 1 + 1 – 1 + 1 – 2 – 1 + 1 = 0 1 + 1 – 1 + 1 – 2 – 2 + 1 – 2 – 1 + 1 = -3 Etcetera… An holDs alignment matrix all possible alignments 8 2/19/17 Global alignment A C G T • NeeDleman-Wunsch algorithm A 1 C -1 1 – Negative alignment matrix cells are alloweD G -1 -1 1 • So that alignment score can Be calculateD from start to enD of sequence T -1 -1 -1 1 – Backtrack from last cell Gap penalty: -2 • Proceed until the start of the sequence • Identifies the highest scoring gloBal alignment G C C C T A G C G GC 0 -2 --4 --6 --8 --10 --12 --14 --16 --18 -02 -42 G -2 1 -1 -3 -5 -7 -9 -11 -13 -15 G -12 -11 C -4 -1 2 0 -2 -4 -6 -8 -10 -12 G -6 -3 0 1 -1 -3 -5 -5 -7 -9 1 -2 -2 = -2 = -1-4 C -8 -5 -2 1 2 0 -2 -4 -4 -6 A -10 -7 -4 -1 0 1 1 -1 -3 -5 -4 2 -2 = -64 A -12 -9 -6 -3 -2 -1 2 0 -2 -4 T -14 -11 -8 -5 -4 -1 0 1 -1 -3 -0 + 1 = 12 -1 = -3 G -16 -13 -10 -7 -6 -3 -2 1 0 0 A C G T PossiBle alignments A 1 • Three global alignments are possiBle C -1 1 G -1 -1 1 – All three alignments are valiD! T -1 -1 -1 1 • The alignment scores are iDentical: 1+1-1+1-1+1-2-1+1=0 1+1-1+1-1+1-1-2+1=0 1+1-1+1-2+1-1-1+1=0 • Alignments strongly DepenD on the suBstitution matrix ! 9 2/19/17 Local alignment A C G T • Smith-Waterman algorithm A 1 C -1 1 – Negative alignment matrix cells are set to zero G -1 -1 1 • So that local alignments can be iDentified as positive values T -1 -1 -1 1 – Backtrack from highest cell Gap penalty: -2 • Proceed until the first zero • Identifies the highest scoring local alignment G C C C T A G C G G 0 -02 -04 -06 -08 -100 -120 -140 -160 -180 0 0 G -02 1 -01 -03 -05 -07 -09 -111 -130 -151 G 0 1 C -04 -01 2 01 -12 -04 -06 -08 -102 -120 G -06 -13 0 1 -01 -03 -05 -15 -07 -39 0 -2 = -2 0 C -08 -05 -22 1 2 0 -02 -04 -24 -16 A -100 -07 -04 -11 0 1 1 -01 -03 -15 0 -2 = -2 0 A -120 -09 -06 -03 -02 -01 2 0 -02 -04 T -140 -110 -08 -05 -04 -11 0 1 -01 -03 0 + 1 = 1 G -160 -131 -100 -07 -06 -03 -02 1 0 01 Exercise a. Is this a gloBal or a local alignment? b. What is the name of the algorithm useD? c. What is the gap penalty? d. Give the suBstitution matrix. e. What is the score of the optimal alignment? f. What is the optimal alignment? 10 2/19/17 Protein alignments • Make a gloBal alignment of these two sequences using the BLOSUM62 substitution matrix – CAPT – CFT Gap penalty: -11 C A P T 0 --112 --224 --336 --448 C --112 19 -12 --133 --245 F --224 -12 27 -04 --152 T --336 --133 -02 16 -11 Visualizing iDentity anD similarity in an alignment Retinol-binding protein aligned to b-lactogloBulin: 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin Identical (Bar) Not similar (space) Very similar (two Dots) Somewhat similar (one Dot) 11 2/19/17 Try it on BABA! • Basic-Algorithms-of-Bioinformatics Applet – http://BaBa.sourceforge.net/ – If your computer Does not run the Java Applet, use the stanDalone runnaBle version Multiple sequence alignment • What if we want to align many sequences, for example a homologous gene in several animals? • Option: dynamic programming in multiple dimensions 1 2 C E E C E N N E Q U U Q S E E S S E Q U E N C E 3 Alignment matrix • This algorithm is inefficient, because the size of the matrix (anD thus the number of computational steps) scales exponentially with the number of sequences – A matrix for 10 proteins of 100 resiDues is 10010 = 1020 cells in size – Storing this in RAM woulD require about 100 million computers 12 2/19/17 Progressive multiple sequence alignment • Algorithm goes through a series of pairwise alignments • You first neeD a guiDe that inDicates how similar/different the sequences are to each other – A guiDe is not a phylogenetic tree • Phylogenetic trees show evolutionary history, guides only show similarity • You neeD

Sequence Alignment Algorithms

Simplified Matching Algorithm Using a Translated Codon (Tron)

Gap Opening Penalty Formula

Alignment Principles and Homology Searching Using (PSI-)BLAST

Sequence Analysis

Sequence Alignment

Gap Penalty in Sequence Alignment Pdf

Structural and Evolutionary Considerations for Multiple Sequence Alignment of RNA, and the Challenges for Algorithms That Ignore Them

The Biologist's Guide to Paracel's Similarity Search Algorithms

Bioinformatics-Inspired Analysis for Watermarked Images with Multiple Print and Scan

Aligning Coding Sequences with Frameshift Extension Penalties