2/19/17

Sequence alignment algorithms

Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 23rd 2017

After this lecture, you can… … decide when to use local and global sequence alignments … use dynamic programming to align two sequences … explain difference between fixed/linear/affine gap penalty … derive substitution scores and gap penalties from an alignment matrix … explain the progressive multiple alignment algorithm and the difference between guide tree and phylogenetic tree … recognize and validate alignment Fasta files … list and evaluate the assumptions on which depends

1 2/19/17

Pairwise sequence alignments

• Definition of sequence alignment

– “Given two sequences: seqX = X1X2…XM and seqY = Y1Y2…YN an alignment is an assignment of gaps to positions 0, …, M in x, and to positions 0, …, N in seqY, so as to line up each letter in one sequence with either a letter or a gap in the other sequence” -AGAGGCTATCACCTGACCTCCAGGCCGATGCCCGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAGTAGCTATCACGACCGCGGTCGATTTGCCCGAC-CTATCAC--GACCGC--GGTCGATTTGCCCGAC • The optimal alignment is the alignment that is most consistent with a model of evolution • It is not trivial to make sequence alignments – The alignment should be reliable – The method of obtaining the alignment should be reproducible – Thus, we use an algorithm to make sequence alignments

Global and local sequence alignments • Alignment: adding gaps in one and/or the other sequence until they are both equally long

• Are sequences completely or partially homologous?

• Local alignment – Finds the optimal sub-alignment within two sequences – Partial homologs, e.g. resulting from domain rearrangement

• Global alignment – Aligns two sequences from end to end – If you know two sequences are full homologs, e.g. resulting from gene duplication

2 2/19/17

How to detect identical (sub-)sequences? GAACTGCACTC GTGCACTCT

Alignment matrix

Alignment matrix

3 2/19/17

How to identify identical (sub-)sequences? GTCGTTGCAGTGTATTGCGAACTGCACTCTGA GTCGTTGCAGTGTGCACTCTGA

Alignment matrix

Towards an algorithm • The challenge is to find an algorithm that finds the best alignment between two sequences A C G T A 1 • The first thing we need is a scoring system C -1 1 – Substitution matrix G -1 -1 1 • How many points for a match? T -1 -1 -1 1 • How much penalty for a mismatch? Substitution matrix

– Gap penalty Gap penalty: -2 • These scores are based on a model of evolution: – How often do we think these events occur? – More likely events are given higher scores – Less likely events are given lower scores (higher penalties)

4 2/19/17

Towards an algorithm • Then we go through the alignment matrix, cell by cell and score it: A C G T – If the residue at this position is the same in both A 1 C -1 1 the sequences, the cell gets a +1 score G -1 -1 1 – If the residue at this position is the same in both T -1 -1 -1 1

the sequences, the cell gets a -1 penalty Substitution matrix – Relative to what? àRelative to one of the possible previous cells àThe one that maximizes the alignment score

Alignment matrix

Towards an algorithm • From a given cell ( X ) the alignment can go in three directions: a b – Diagonally from left above: this indicates an aligned residue c U V • Score of X = score of U + substitution score (b, d) d W X – From the cell directly above: this indicates a gap in the horizontal sequence Alignment matrix • Score of X = score of V – gap penalty – From the cell directly to the left: this indicates a gap in the vertical sequence • Score of X = score of W – gap penalty • Every time, we choose the option that leads to the highest alignment score in X • To identify local alignments, we set the alignment score to zero if becomes negative, and restart the alignment

5 2/19/17

Alignment matrix

Re-thinking the model of evolution • An is an insertion/ of a sequence segment – are usually single evolutionary events – So you do not want to penalize every residue aligned to a gap

Gap open penalty: -2 Alignment matrix Gap extension penalty: 0 • The model of evolution can account for this by differentiating two gap penalties: – Gap open penalty: high penalty indicating (un-)likelihood of an indel event in evolution – Gap extension penalty: zero score or low penalty indicating that the likelihood of the evolutionary event is regardless of the length of the indel

6 2/19/17

Gaps • Gaps are the result of insertions or deletions in the sequence

Gap open: Total gap penalty: 6 x -3 Gap extension: 5 x -2 -1

• A given insertion or deletion is probably just one evolutionary event, regardless of its size • Adding a gap penalty for each gap position may decrease the alignment score too much • This can be solved by using a high penalty for “Gap opening” and a low penalty for “Gap extension”

Dynamic programming • The algorithm we have described is called dynamic programming

7 2/19/17

Align GCCCTAGCG to GCGCAATG. A C G T A 1 • Many solutions are possible C -1 1 – The optimal alignment maximizes the alignment score G -1 -1 1 T -1 -1 -1 1 • Depends on substitution matrix and gap penalty Gap penalty: -2 – You could calculate alignment scores for all possible alignments:

1 + 1 – 1 + 1 – 1 + 1 – 1 – 1 – 2 = -2

– 2 – 1 + 1 – 1 – 1 + 1 – 1 – 1 + 1 = -4

1 + 1 – 1 + 1 – 1 + 1 – 2 – 1 + 1 = 0

1 + 1 – 1 + 1 – 2 – 2 + 1 – 2 – 1 + 1 = -3

Etcetera…

An holds alignment matrix all possible alignments

8 2/19/17

Global alignment A C G T • Needleman-Wunsch algorithm A 1 C -1 1 – Negative alignment matrix cells are allowed G -1 -1 1 • So that alignment score can be calculated from start to end of sequence T -1 -1 -1 1 – Backtrack from last cell Gap penalty: -2 • Proceed until the start of the sequence • Identifies the highest scoring global alignment

G C C C T A G C G GC 0 -2 --4 --6 --8 --10 --12 --14 --16 --18 -02 -42 G -2 1 -1 -3 -5 -7 -9 -11 -13 -15 G -12 -11 C -4 -1 2 0 -2 -4 -6 -8 -10 -12 G -6 -3 0 1 -1 -3 -5 -5 -7 -9 1 -2 -2 = -2 = -1-4 C -8 -5 -2 1 2 0 -2 -4 -4 -6 A -10 -7 -4 -1 0 1 1 -1 -3 -5 -4 2 -2 = -64 A -12 -9 -6 -3 -2 -1 2 0 -2 -4 T -14 -11 -8 -5 -4 -1 0 1 -1 -3 -0 + 1 = 12 -1 = -3 G -16 -13 -10 -7 -6 -3 -2 1 0 0

A C G T Possible alignments A 1 • Three global alignments are possible C -1 1 G -1 -1 1 – All three alignments are valid! T -1 -1 -1 1

• The alignment scores are identical: 1+1-1+1-1+1-2-1+1=0 1+1-1+1-1+1-1-2+1=0 1+1-1+1-2+1-1-1+1=0 • Alignments strongly depend on the substitution matrix !

9 2/19/17

Local alignment A C G T • Smith-Waterman algorithm A 1 C -1 1 – Negative alignment matrix cells are set to zero G -1 -1 1 • So that local alignments can be identified as positive values T -1 -1 -1 1 – Backtrack from highest cell Gap penalty: -2 • Proceed until the first zero • Identifies the highest scoring local alignment G C C C T A G C G G 0 -02 -04 -06 -08 -100 -120 -140 -160 -180 0 0 G -02 1 -01 -03 -05 -07 -09 -111 -130 -151 G 0 1 C -04 -01 2 01 -12 -04 -06 -08 -102 -120 G -06 -13 0 1 -01 -03 -05 -15 -07 -39 0 -2 = -2 0 C -08 -05 -22 1 2 0 -02 -04 -24 -16 A -100 -07 -04 -11 0 1 1 -01 -03 -15 0 -2 = -2 0 A -120 -09 -06 -03 -02 -01 2 0 -02 -04 T -140 -110 -08 -05 -04 -11 0 1 -01 -03 0 + 1 = 1 G -160 -131 -100 -07 -06 -03 -02 1 0 01

Exercise

a. Is this a global or a local alignment? b. What is the name of the algorithm used? c. What is the gap penalty? d. Give the substitution matrix. e. What is the score of the optimal alignment? f. What is the optimal alignment?

10 2/19/17

Protein alignments • Make a global alignment of these two sequences using the BLOSUM62 substitution matrix – CAPT – CFT Gap penalty: -11 C A P T 0 --112 --224 --336 --448 C --112 19 -12 --133 --245 F --224 -12 27 -04 --152 T --336 --133 -02 16 -11

Visualizing identity and similarity in an alignment

Retinol-binding protein aligned to b-lactoglobulin:

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin

51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin

98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...... QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL...... VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin

137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI...... 178 lactoglobulin

Identical (bar) Not similar (space) Very similar (two dots) Somewhat similar (one dot)

11 2/19/17

Try it on BABA! • Basic-Algorithms-of-Bioinformatics Applet – http://baba.sourceforge.net/ – If your computer does not run the Java Applet, use the standalone runnable version

Multiple sequence alignment • What if we want to align many sequences, for example a homologous gene in several animals? • Option: dynamic programming in multiple dimensions 1 2 C E E C E N N E Q U U Q S E E S S E Q U E N C E 3 Alignment matrix • This algorithm is inefficient, because the size of the matrix (and thus the number of computational steps) scales exponentially with the number of sequences – A matrix for 10 proteins of 100 residues is 10010 = 1020 cells in size – Storing this in RAM would require about 100 million computers

12 2/19/17

Progressive multiple sequence alignment • Algorithm goes through a series of pairwise alignments

• You first need a guide that indicates how similar/different the sequences are to each other – A guide is not a phylogenetic tree • Phylogenetic trees show evolutionary history, guides only show similarity • You need an alignment first before you can create a phylogenetic tree – Align the most similar pair of sequences first, and then progressively align more divergent sequence pairs Iterate– Create a sequence profile to summarize the already-aligned sequences • This algorithm is efficient, because the computational steps scale linearly with the number of sequences

Some useful programs • Using existing bioinformatic programs is recommended because it makes your analysis reproducible • Programs to align sequences – Clustal Omega – T-Coffee – MAFFT – Muscle

• Programs to view alignments – Clustal – Jalview – Seaview

13 2/19/17

Warning!

Input Alignment Output unaligned sequences program the optimal alignment • Most computer programs will always output a result • If sequences are not homologous then it does not make any biological sense to align them: this is WRONG! – …even though an optimal alignment exists – An optimal alignment can always be calculated, even when sequences are not homologous

• We have to use sequence alignment in different ways: 1. First, we use alignment to discover if two sequences are likely homologous 2. Only if they are homologous, then we use alignment: a) To identify how they evolved (which mutations occurred?) b) To quantify evolutionary relationships in terms of sequence similarity/divergence

14 2/19/17

Alignment files • Alignments can be stored in Fasta format – Other formats are also possible, check files in plain text editor • Alignment files can easily be spotted when opened in a plain text editor: So that all sequences have Some of the sequences exactly the same length contain gap characters: “–” representing absent residues

>protein_sequence_A MTQSHHHVAA FDLGSSIRQE GLTET------DPNRAEIG TFGI >protein_sequence_B MTQSSHHVAA FDLGAALHQE GLTETDYSEV QRDPNRAEVG TFGV >protein_sequence_C ------AVAA FDLGAALRQE GLTETDYAEI QRDPNHAELG TF--

As in Fasta files, spaces and newlines just make sequences easier to read, they do not have any meaning

15 2/19/17

Bioinformatic considerations • Optimal alignment – This is just the alignment with the highest possible score – … which strongly depends on the substitution matrix and gap penalties → This means it depends on a specific model of evolution • Optimal alignment is not necessarily the most meaningful – Substitutions or gap penalties are not equally frequent at all positions – Gap penalties do not model insertion/deletion events well • Sometimes manual curation is necessary – Inspection and adjusting the alignment by hand – This is not reproducible, so use manual curation only in special cases if no automated option is available

Assumptions of sequence alignment • “Positions in the sequence mutate independently” • “The mutation rate is identical for all positions in the sequence” • “The mutation rate is constant in time and in different species and lineages” • “The nucleotide/amino acid composition is stable” THE ASSUMPTIONS ARE NOT ALWAYS TRUE! • …because the residues of a gene/protein interact to perform function • …because the effect of a mutation on fitness (and thus on the rate of evolution) differs per position in the sequence and per species – … and even per moment in time and location in space → it depends on the interaction of an organism and its proteins with the environment

16