Sequence Similarity Methods

Gloria Rendon

SC11 – Education

June, 2011 Sequence Similarity Methods - caveats

• Assumption1: of closely related species are more similar than genes of distantly related species. • Assumption2: Similar genes have similar sequences. • These methods predict the amount of evolution among species solely in terms of mutation events observed in the sequences of their genes. The General Algorithm...

Step1. COLLECT. Sequences are gathered

Step 2. COMPARE. Sequences are compared for similarity

Step 3. SCORE. A score is computed to assess significance of results

Step 4. CLUSTER. A matrix of sequence similarity is computed

Step 5 (Opt). A is reconstructed with matrix Types of Similarity-Based Methods

•Alignment-free Methods:

oBased on k-word oBased on Structural alignment oBased on Hidden markov models oOthers

•Based on Types of Similarity-Based Methods

•Alignment-free Methods:

oBased on k-word frequency oBased on Structural alignment oBased on Hidden markov models oOthers

•Based on Sequence alignment Alignment-based Methods Alignment-based Methods

A sequence alignment is a way of arranging the sequences of DNA, RNA, or to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Alignment-based Methods

A sequence alignment is a scheme of writing one sequence on top of another where the residues in one position are deemed to have a common evolutionary origin. If the same letter occurs in both sequences then this position has been conserved in evolution. If the letters differ it is assumed that the two derive from an ancestral letter (which could be one of the two or neither).. Alignment Representation

Sequence Sequence Alignment Length Name

Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Point Mutations •ONLY these types of point mutation events are considered by alignment-based methods: insertion, deletion, substitution. •Homologous sequences may have different length, though, which is generally explained through insertions or deletions in sequences. •Thus, a letter or a stretch of letters may be paired up with dashes in the other sequence to signify such an insertion or deletion.

•The term given to those dashes is indel or gap. Gaps in Alignments

One gap opening and two gap extensions

Gaps may be are inserted between the residues so that identical or similar characters are aligned in successive columns. Gaps represent a) deletions or insertions events b) sites with missing information There are two types of Gaps (from the point of view of the aligning algorithm): gap opening and gap extension. Moreover, they are weighted differently by the algorithm. SNIPs are a special case of point mutations

SNPs (single nucleotide polymorphism) •Copying errors during cell division result in variations in the DNA at a particular location. •These copying errors are point mutations called single nucleotide polymorphisms, or SNPs. •SNPs are passed on to the next generation through inheritance. Role of SNPs •In humans SNPs account for much of the genetic diversity. •Certain genetic diseases have been linked to SNPs. •However, much of the SNPs do not result in observable differences Point Mutation Analysis

The reason for aligning sequences when trying to elucidate their evolutionary relationship is that algorithms can calculate an estimate of their evolutionary distance from the alignment.

These methods are based on Levenshtein’s notion of edit distance between strings:

“Edit distance is the minimum number of edit operations needed to transform one string into another.”

“The more similar the sequences are, the smaller their edit distance is” Types of Alignment-based Methods

•Global alignment is when matching is attempted on the entire length of the sequences. This is usually the choice when aligning very similar sequences •Local alignment is when matching is done for specific segments of the sequences. This is usually the choice when it is believed that sequences contained conserved regions. Types of Alignment-based Methods

•Earlier we used BLAST to search for a sequence given a partial segment of it. •Blast will try both global as well as local alignments and will report the best matches of them all. •Re-examine the results page and find out which type of alignment performed best in this case

Let us re-examine the portion of this page that displays the alignment --marked with 3 Let us re-examine the portion of this page that displays the alignment --marked with 3

There are three rows.

The numbers on the left column specify the starting position The numbers on the right specify the ending position

The first row is the partial sequence you typed, named Query The third row is the sequence it is being matched against; in this case P46098

The second row is the result of the alignment between the top and bottom seqs The match is exact at every position Types of Alignment-based Methods

•Pair-wise alignment. Two sequences are aligned together •Multiple sequence alignment. Three or more sequences are aligned together Pairwise Alignment

Illustrated with BLAST and 18s ribosomal RNA sequence Pair-wise Alignment

1.Collect the two sequences

2. Align the sequences

3. Count the mutations in the alignment

4. Score the alignments Pair-wise Alignment

1.Collect the >seq2|LemnaMinor_18S_rRNA two sequences CTCCTACCGATTGAATGGTCCGGTGAAGCGCTCGGATCGCGG CGACGAGGGCGGTCCCCCGCCCGCGACGTCGCGAGAAGTCCG TTGAACCTTATCATTTAGAGGAAGGAG 2. Align the sequences The first sequence is displayed above.

3. Count the To get the second sequence and perform the mutations in the alignment, we simply use BLAST. alignment Go to the BLAST page at NCBI

4. Score the .ncbi.nlm.nih.gov alignments Then click on nucleotide blast Pair-wise Alignment

This is the nucleotide blast page at NCBI

Paste the sequence in the box

Select a database from the drop-down list; in this case, choose Nucleotide collection

Scroll to the bottom of the page and click on the Blast button Pair-wise Alignment

This is the results page of the Blast search.

The top hit is our original sequence.

It is listed in the table along with some statistics.

Let’s see under the hood to understand what happened and how the stats were calculated.. Pair-wise Alignment

1.Collect the two sequences

2. Align the sequences

3. Count the mutations in the If you scroll down the same results page, you will alignment see the results of all the pairwise alignments that BLAST included in the report.

4. Score the They will be sorted from best alignment (first one in alignments the report) to worst alignment (last one in the report).

This is the first one, therefore it is the best match. Pair-wise Alignment

1.Collect the two sequences

2. Align the sequences

3. Count the mutations in the Steps 3 and 4 are perform after the alignment is alignment performed in order to assess how good a match it is.

First, we need to count mismatches in the alignment. 4. Score the alignments Counting Mismatches (mutations)

Cell (T,T) = number of unchanged T residues = 1 Cell (T,G) = number of substitutions from T to G Cell (T, C) = number of substitutions from T to C Cell (T, A) = number of substitutions from T to A Cell (T, -) = number of deletions of T

... Cell (-, T) = number of insertions of T Cell (-, G) = number of insertions of G Cell (-, C) = number of insertions of C Cell (-, A) = number of insertions of A = 0 Pair-wise Alignment Not all mismatches are created equal. 1.Collect the two sequences Some substitutions are more likely than others; therefore we must use weight values, such as those in substitution matrices 2. Align the sequences

3. Count the mutations in the alignment

4. Score the alignments Scoring the alignments

Note that the result is a single value, a score, obtained by performing dot product between the alignment matrix and the substitution matrix, and adding the values of the resulting matrix as shown here. So, now you have a clearer idea of what goes under the hood of pairwise-alignment tools like BLAST. Exercise2: Using BLAST to transfer annotation

Sometimes we have a (or ) for which an annotation (the description line in format) is unknown; for example, when a new genome is being sequenced.

The general ‘in-silico’ procedure for assigning an annotation to that newly sequenced gene (or protein) calls for using BLAST to find a similar gene (or protein) for which the annotation is known.

If the match is close enough, we can then transfer the annotation from the known gene (or protein) to the new one. Exercise2: Using BLAST to transfer annotation

•Open a web browser and go the UNIPROT url www..org 1.Click on the Blast tab 2.In the box type the identifier: A7JKN7_FRANO 3.Then click on the BLAST button Exercise2: Using BLAST to transfer annotation

Notice how the UniProt-Blast program fetches the corresponding sequence before launching the BLAST search. Also notice that the annotation (description line) is unknown Exercise2: Using BLAST to transfer annotation

This is the BLAST result page. The first and second hits do not have annotations either. The third hit is annotated as Neurotransmitter-gated ion-channel. So, at first blush, we could transfer that annotation to the protein A7JKN7_FRANO Exercise3: GLOBAL Pairwise alignment program

• Open a web browser and go to the MOBYLE portal: mobyle.pasteur.fr/ • Choose Programs/ Alignment /pairwise/global/needle from the Programs box (left) • Copy-paste any two sequences from the file woese.seqs.fasta • Select the parameters: gap penalty=5, gap extension=0.2 • Click on Run • A job will be created to run this program with your data • Once the job is done we can view the results

• Q: how many gaps were inserted? • Q: what is the score of the alignment? • Q: what is the percent identity of the alignment? Exercise4: LOCAL Pairwise alignment program

• Open a web browser and go to the MOBYLE portal: mobyle.pasteur.fr/ • Choose Programs/ Alignment /pairwise/local/water from the Programs box (left) • Copy-paste THE SAME two sequences from the example we just finished • Select the parameters: gap penalty=5, gap extension=0.2 • Click on Run • A job will be created to run this program with your data • Once the job is done we can view the results

• Q: how many gaps were inserted? • Q: what is the score of the alignment? • Q:is the local alignment identical to the global alignment of the previous exercise? explain Multiple Sequence Alignment Multiple Sequence Alignment (MSA)

•Multiple sequence alignment methods try to align a group of three or more related sequences at once.

•MSAs are often used in identifying regions across a group of sequences hypothesized to be evolutionarily related.

•MSAs are also used to aid in establishing evolutionary relationships by constructing phylogenetic trees.

•MSA algorithms are more complex (time and space wise) than pairwise alignment algorithms. Multiple Sequence Alignment (MSA)

•A multiple alignment arranges a set of sequences in a scheme where positions believed to be homologous are written in a common column. •Like in a pairwise alignment, when a sequence does not possess an amino acid in a particular position this is denoted by a dash (indel,gap). •The scoring function calculates ‘the similarity’ of a sequence in relationship to the entire group. Conservation in a MSA

•In addition to the alignment itself; a line is added at the end with information about the degree of conservation for each position (i.e. column)

No symbol. There is no conservation in the column * exact match of the residue for all sequences : high degree of conservation; the mutations where for residues of similar biochemical properties, with letters of the same color . There is conservation among the majority of the sequences however, there were mutations for residues of a different group Exercise 4: Using a MSA with the Eight Species Solar System

• Open a web browser and go to MOBYLE portal: mobyle.pasteur.fr/ • Choose Alignment/multiple/clustalw-multialign from the Programs box (left) • Select Upload to copy the sequences from the file 8species.seqs.fasta on your computer to the portal • Click on Run • A job will be created to run this program with your data • Once the job is done we can view the results: alignment, tree, output Clustalw Results: aln file • The first segment of the ClustalW results page shows the alignment itself as shown below • To see additional information about conservation, please click on ‘view with ’ Clustalw Results: aln file and consensus

The consensus sequence refers to the most common residue (nucleotide or amino acid) at a particular position after a MSA has been calculated.

The consensus sequence for the eight imaginary species is:

A – T – A G A G

The most conserved positions are T in the third position (6/8) A in the fifth position (6/8)

Hence the height of the bars in the histogram denotes frequencies of the most common residues at that location. Clustalw Results: tree file and output file

•We will skip the tree for now.

•Let us examine the output file. Click here ClustalW Results: Output file

Input pair-wise alignment scores clustering

CLUSTAL 2.0.12 Multiple Sequence Alignments Sequences (1:2) Aligned. Score: 85 Start of Multiple Alignment Sequences (1:3) Aligned. Score: 42 Sequences (1:4) Aligned. Score: 28 Aligning... Sequence format is Pearson Sequences (1:5) Aligned. Score: 57 Group 1: Sequences: 2 Score:114 Sequence 1: 1 7 bp Sequences (1:6) Aligned. Score: 71 Group 2: Sequences: 2 Score:114 Sequence 2: 2 7 bp Sequences (1:7) Aligned. Score: 28 Group 3: Sequences: 4 Score:69 Sequence 3: 3 7 bp Sequences (1:8) Aligned. Score: 42 Group 4: Sequences: 2 Score:114 Sequence 4: 4 7 bp Sequences (2:3) Aligned. Score: 42 Group 5: Sequences: 2 Score:123 Sequence 5: 5 7 bp Sequences (2:4) Aligned. Score: 28 Group 6: Sequences: 4 Score:83 Sequence 6: 6 7 bp Sequences (2:5) Aligned. Score: 42 Group 7: Sequences: 8 Score:59 Sequence 7: 7 7 bp Sequences (2:6) Aligned. Score: 42 Alignment Score 209 Sequence 8: 8 7 bp Sequences (2:7) Aligned. Score: 28 Start of Pairwise alignments Sequences (2:8) Aligned. Score: 28 -Alignment file created Aligning... Sequences (3:4) Aligned. Score: 85 [8planets.aln] Sequences (3:5) Aligned. Score: 42 Sequences (3:6) Aligned. Score: 42 Sequences (3:7) Aligned. Score: 57 Sequences (3:8) Aligned. Score: 57 Sequences (4:5) Aligned. Score: 57 Sequences (4:6) Aligned. Score: 57 Sequences (4:7) Aligned. Score: 71 Sequences (4:8) Aligned. Score: 71 Sequences (5:6) Aligned. Score: 85 Sequences (5:7) Aligned. Score: 57 Sequences (5:8) Aligned. Score: 57 Sequences (6:7) Aligned. Score: 71 Sequences (6:8) Aligned. Score: 57 Sequences (7:8) Aligned. Score: 85 Guide tree file created: [8planets.dnd] Exercise 5: Using a MSA program to re-discover the Three Kingdoms of C. Woese

• Open a web browser and go to MOBYLE portal: mobyle.pasteur.fr/ • Choose Alignment/multiple/clustalw-multialign from the Programs box (left) • Select Upload to copy the sequences from the file woese.seqs.fasta on your computer to the portal • Click on Run • A job will be created to run this program with your data • Once the job is done we can view the results: alignment, tree, output Clustalw Output Results

Sequence 1: Methanosarcina_barkeri 1262 bp Sequence 2: Methanothermobacter_thermau 1494 bp Sequence 3: Methanobrevibacter_ruminant 1260 bp Sequence 4: Methanococcus_maripaludis_C6 1465 bp Sequence 5: Lemna_minor_chloroplast 1487 bp Sequence 6: Aphanocapsa_sp._HBC6 1441 bp Sequence 7: Corynebacterium_diphtheriae 712 bp Sequence 8: Bacillus_firmus_strain_QJGY2 746 bp Sequence 9: Chloribium_vibrioforme__Pros 1243 bp Sequence 10: Escherichia_coli_HS 1542 bp Sequence 11: Mus_musculus_L_cell 918 bp Sequence 12: Lemna_minor_18S_rRNA 111 bp Sequence 13: Saccharomyces_cerevisiae_str 1730 bp

The table was constructed with the file shown in the windows “Standard Output” Q: Cluster these results. Do they fall onto three groups? Types of Similarity-Based Methods

•Alignment-free Methods:

oBased on k-word frequency oBased on Structural alignment oBased on Hidden markov models oOthers

•Based on Sequence alignment K-word

In , a k-word (or k-tuple) is a sequence of length k.

A sequence of length n has n – k + 1 k-words.

Example query string L: TGATGATGAAGACATCAG

For k = 8, the set of k-tuples of L is

TGATGATG GATGATGA ATGATGAA TGATGAAG … GACATCAG K-word Lists

Consider the k-words when k=2 and L=GCATCGGC:

GC, CA, AT, TC, CG, GG, GC

AT: 3 → means the k-word AT in sequence L starts at position 3 CA: 2 CG: 5 GC: 1, 7 GG: 6 TC: 4 K-word Frequency Methods

Goal: Find common k-words in a group of sequences that have statistical significance.

Let us illustrate this statement with an example in natural language.

One English scholar tries to determine if a newly found manuscript was written by Shakespeare. He compares one page from the new manuscript against a page from one of Shakespeare’s works.

The top k-word of length 4 in common between the two pages is THOU.

Is this k-word statistically meaningful?? K-word Frequency Methods

• Based on Euclidean Distance • Based on Weighted Euclidean Distance • Based on Correlation • Based on Covariance • Based on Information Content K-word Frequency Methods

• Based on Euclidean Distance • Based on Weighted Euclidean Distance • Based on Correlation • Based on Covariance • Based on Information Content Algorithm using K-word Frequency to determine sequence similarity

• Collect sequences • Calculate meaningful k-words • Identify k-words in sequences • Catalog k-words • Score significance • Cluster sequences into similar groups The same algorithm in graphical form

1. Collect seqs

2. Calc k-words

3. Search k-words 4. Catalog k-words

4. Score 5. Cluster Step 2: Alternatives for K-words

•Example 1: Use Interpro, a database of already calculated k-words, called protein functional domains. Results may overlap

•Example 2: Use tools such as MEME to calculate ‘de- novo’ k-words from a training set the user specifies. Results do not overlap. Exercise 1: Using INTERPRO

With the following exercise, we will go to INTERPRO, a url where the entire database is made up of k-words called protein functional domains.

A functional domain is a segment of the protein that generally has a very specific function and/or structure.

INTERPRO’s search engine takes as input a single sequence (of a protein) and it will try to match it against its catalog of functional domains. The k-words may overlap Exercise 1:using INTERPRO

1. Open a web browser and retrieve the sequence of the protein that we found in Scenario 3 of the previous section by pasting this link on the browser http://www.uniprot.org/uniprot/P46098.fasta 2. Copy to the clipboard this sequence (Ctrl-A Ctrl-C) 3. Open another tab on the browser and type this link http://www.ebi.ac.uk/Tools/pfa/iprscan/ to go to the url of INTERPRO 4. Paste the sequence from the clipboard to the box provided in this page for the query sequence (Ctrl-V) 5. Scroll down to the bottom of the page leaving all parameters unchanged with default values and click on the Submit button 6. Examine the results page; it should look similar to the figure in the next page. 7. How many k-words where found in this protein? 8. What score did each k-word receive? 9. Are there any k-word in the specific segment of the sequence that the paper discussed and that we used in Scenario 3 to look for the entire sequence? K-word identifiers k-word alias(es) k-word location and length

Some k-words DO overlap Exercise 1:using INTERPRO

1. Collect seqs The INTERPRO k-words found were:

IPR006029 2. Calc k-words IPR006201 IPR006202 IPR008132 IPR008133 IPR018000 3. Search k-words 4. Catalog k-words Now, we can search for proteins with those k-words

4. Score 5. Cluster Exercise 1:using INTERPRO

1. Collect seqs

2. Calc k-words

3. Search k-words Go to the uniprot.org page and type in the query box: 4. Catalog k-words (IPR006029 AND IPR006201 AND IPR006202 AND IPR008132 AND IPR008133 AND IPR018000)

4. Score Then click on the Search button 5. Cluster It should look like the figure here with 19 hits Exercise 1:using INTERPRO

1. Collect seqs

2. Calc k-words

3. Search k-words Uniprot DOES NOT calculate a score for us. 4. Catalog k-words However, we can go ahead and cluster the 19 hits into a single group since they all have the same k-words and are similarly annotated as serotonin receptors 4. Score 5. Cluster Exercise 2: Using MEME

With the following exercise, we will use MEME to discover motifs in a set of proteins in a de-novo way.

MEME will start with a training set that you provide and will identify meaningful k-words called MEME motifs.

The training set is a group of scorpion neurotoxin sequences. From prior knowledge of this group of sequences, we know which residues are conserved and play a key role.

Therefore the exercise will focus on adjusting the parameters of the MEME tool so that the resulting motifs will include all those key residues. Exercise 2: Using MEME Key residues are marked in this figure: The cysteines that form the disulfide bridges; the residues R G K in positions 28,29,30 and Y in position 39 Exercise 2: Using MEME

1. Open a web browser and go to the MEME web server at http://meme.nbcr.net/

2. Scroll down to the programs and click on the MEME icon; it will take you to the MEME Data Submission Form Exercise 2: Using MEME Exercise 2: Using MEME 3. In the segment of the page marked as 1, you need to include the training set. It is the file called toxins.fasta

4. In the segment of the page marked as 2, you need to specify occurrences, or repetitions, of the k-word PER sequence. Do not change the default value.

5. In the segment of the page marked as 3, you need to specify the width of the k-word. For a fixed width you type the same value in Minimum and Maximum. For variable width you type the limits of the window. Try these values: 5-10

6. In the segment of the page marked as 4, you need to specify the maximum number of motifs. Type 10

7. Leave the other parameters unchanged. Scroll to the bottom of the page and click on Start search.

8. Results will be emailed to you. Exercise 2: MEME 5-10-10 Let us examine the k-words found by MEME.

They are ordered by statistical significance; hence, motif-1 is the most significantly conserved segment in the training set and motif-n is the least significantly conserved segment.

This is the WEBLOGO representation of motif-1 that consists of 10 residues. The height of the residue is proportional to its occurrence frequency in the training set.

In position 1 a K is completely conserved In position 2 a C is completely conserved In position 3 an M is completely conserved In position 4, it could be N or G, however, N is more likely than G

Etc. Exercise 2: MEME 5-10-10

Now, let us scroll to the bottom of the page to see the diagram of motifs.

For each sequence we observe k-words and gaps. The fewer the gaps, the better the coverage. So, there is good coverage in this figure.

From prior knowledge of this group, we know it to be a set a relatively conserved motifs.

The more sequences with the same diagram of motifs, the more similar the sequences are.

The trouble with this diagram is that it gives the impression that the sequences are not that similar to each other. Exercise 2: Using MEME

Repeat the same steps with these other parameter values

For width: minimum 5, maximum 20 For number of motifs: 5

Examine the resulting motifs and choose which one is better. Exercise 2: MEME 5-20-5

This is the WEBLOGO on motif-1. Compared to motif-1 in the previous run, we can see that:

•This one has twice as many residues •This one has more key residues included in (all the tall Cs and G) •There is more variability in other parts of the motif (more letters per column) Exercise 2: MEME 5-20-5

Now, let us scroll to the bottom of the page to see the diagram of motifs.

This diagram looks much better than in the previous run:

•There are few gaps in the diagram for this run too. •The diagram of motifs looks similar for many more sequences in this run than in the previous run. •Conservation or similarity among sequences by virtue of having almost identical motif diagrams can be appreciated better in this run than in the previous one. Exercise 2: MEME 5-20-5

1. Collect seqs So, we have FIVE k-words found by MEME.

2. Calc k-words

3. Search k-words Now, we need to search for similar sequences. 4. Catalog k-words The email you received from MEME specifies a number of links to the results. Click on the link

4. Score MEME output as html 5. Cluster Exercise 2: MEME 5-20-5

1. Collect seqs

2. Calc k-words

On the MEME results page; scroll down until you see the section called Further Analysis 3. Search k-words 4. Catalog k-words Click on the MAST button.

MAST is the search engine that looks for sequences that match any of the k-words discovered with MEME 4. Score 5. Cluster K-words from MEME will be used as input here.

Select database to search

Then click on Search button. Exercise 2: MAST results

Among the 135 hits; those that have a motif diagram similar to any diagram in training set would be the sequences most similar to the neurotoxins. Exercise 2: MAST results

1. Collect seqs

2. Calc k-words

3. Search k-words 4. Catalog k-words

4. Score 5. Cluster There exist formulas that use the motif diagram and e- value to perform scoring and clustering of the results. Additional Readings • Online lecture notes on Bioinformatics Lectures.molgen.mpg.de/online_lectures.html • Vinga S, Almeida J. Alignment-free sequence comparison--a review. Bioinformatics 2003; 19:513-23. • Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 1970; 48:443-53. • Smith TF, Waterman MS. Identification of common molecular subsequences. Journal of Molecular Biology 1981; 147:195-7. • Gotoh O. An improved algorithm for matching biological sequences. Journal of Molecular Biology 1982; 162:705-8. • Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl Acids Res 1994; 22:4673 - 80. • Mulder N, Apweiler R. InterPro and InterProScan: tools for protein sequence classification and comparison. Methods Mol Biol 2007; 396:59-70