Sequence Comparison Methods

Sequence Similarity Methods Gloria Rendon SC11 – Education June, 2011 Sequence Similarity Methods - caveats • Assumption1: genes of closely related species are more similar than genes of distantly related species. • Assumption2: Similar genes have similar sequences. • These methods predict the amount of evolution among species solely in terms of mutation events observed in the sequences of their genes. The General Algorithm... Step1. COLLECT. Sequences are gathered Step 2. COMPARE. Sequences are compared for similarity Step 3. SCORE. A score is computed to assess significance of results Step 4. CLUSTER. A matrix of sequence similarity is computed Step 5 (Opt). A phylogenetic tree is reconstructed with matrix Types of Similarity-Based Methods •Alignment-free Methods: oBased on k-word frequency oBased on Structural alignment oBased on Hidden markov models oOthers •Based on Sequence alignment Types of Similarity-Based Methods •Alignment-free Methods: oBased on k-word frequency oBased on Structural alignment oBased on Hidden markov models oOthers •Based on Sequence alignment Alignment-based Methods Alignment-based Methods A sequence alignment is a way of arranging the sequences of DNA, RNA, or proteins to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Alignment-based Methods A sequence alignment is a scheme of writing one sequence on top of another where the residues in one position are deemed to have a common evolutionary origin. If the same letter occurs in both sequences then this position has been conserved in evolution. If the letters differ it is assumed that the two derive from an ancestral letter (which could be one of the two or neither).. Alignment Representation Sequence Sequence Alignment Length Name Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Point Mutations •ONLY these types of point mutation events are considered by alignment-based methods: insertion, deletion, substitution. •Homologous sequences may have different length, though, which is generally explained through insertions or deletions in sequences. •Thus, a letter or a stretch of letters may be paired up with dashes in the other sequence to signify such an insertion or deletion. •The term given to those dashes is indel or gap. Gaps in Alignments One gap opening and two gap extensions Gaps may be are inserted between the residues so that identical or similar characters are aligned in successive columns. Gaps represent a) deletions or insertions events b) sites with missing information There are two types of Gaps (from the point of view of the aligning algorithm): gap opening and gap extension. Moreover, they are weighted differently by the algorithm. SNIPs are a special case of point mutations SNPs (single nucleotide polymorphism) •Copying errors during cell division result in variations in the DNA at a particular location. •These copying errors are point mutations called single nucleotide polymorphisms, or SNPs. •SNPs are passed on to the next generation through inheritance. Role of SNPs •In humans SNPs account for much of the genetic diversity. •Certain genetic diseases have been linked to SNPs. •However, much of the SNPs do not result in observable differences Point Mutation Analysis The reason for aligning sequences when trying to elucidate their evolutionary relationship is that algorithms can calculate an estimate of their evolutionary distance from the alignment. These methods are based on Levenshtein’s notion of edit distance between strings: “Edit distance is the minimum number of edit operations needed to transform one string into another.” “The more similar the sequences are, the smaller their edit distance is” Types of Alignment-based Methods •Global alignment is when matching is attempted on the entire length of the sequences. This is usually the choice when aligning very similar sequences •Local alignment is when matching is done for specific segments of the sequences. This is usually the choice when it is believed that sequences contained conserved regions. Types of Alignment-based Methods •Earlier we used BLAST to search for a sequence given a partial segment of it. •Blast will try both global as well as local alignments and will report the best matches of them all. •Re-examine the results page and find out which type of alignment performed best in this case Let us re-examine the portion of this page that displays the alignment --marked with 3 Let us re-examine the portion of this page that displays the alignment --marked with 3 There are three rows. The numbers on the left column specify the starting position The numbers on the right specify the ending position The first row is the partial sequence you typed, named Query The third row is the sequence it is being matched against; in this case P46098 The second row is the result of the alignment between the top and bottom seqs The match is exact at every position Types of Alignment-based Methods •Pair-wise alignment. Two sequences are aligned together •Multiple sequence alignment. Three or more sequences are aligned together Pairwise Alignment Illustrated with BLAST and 18s ribosomal RNA sequence Pair-wise Alignment 1.Collect the two sequences 2. Align the sequences 3. Count the mutations in the alignment 4. Score the alignments Pair-wise Alignment 1.Collect the >seq2|LemnaMinor_18S_rRNA two sequences CTCCTACCGATTGAATGGTCCGGTGAAGCGCTCGGATCGCGG CGACGAGGGCGGTCCCCCGCCCGCGACGTCGCGAGAAGTCCG TTGAACCTTATCATTTAGAGGAAGGAG 2. Align the sequences The first sequence is displayed above. 3. Count the To get the second sequence and perform the mutations in the alignment, we simply use BLAST. alignment Go to the BLAST page at NCBI 4. Score the blast.ncbi.nlm.nih.gov alignments Then click on nucleotide blast Pair-wise Alignment This is the nucleotide blast page at NCBI Paste the sequence in the box Select a database from the drop-down list; in this case, choose Nucleotide collection Scroll to the bottom of the page and click on the Blast button Pair-wise Alignment This is the results page of the Blast search. The top hit is our original sequence. It is listed in the table along with some statistics. Let’s see under the hood to understand what happened and how the stats were calculated.. Pair-wise Alignment 1.Collect the two sequences 2. Align the sequences 3. Count the mutations in the If you scroll down the same results page, you will alignment see the results of all the pairwise alignments that BLAST included in the report. 4. Score the They will be sorted from best alignment (first one in alignments the report) to worst alignment (last one in the report). This is the first one, therefore it is the best match. Pair-wise Alignment 1.Collect the two sequences 2. Align the sequences 3. Count the mutations in the Steps 3 and 4 are perform after the alignment is alignment performed in order to assess how good a match it is. First, we need to count mismatches in the alignment. 4. Score the alignments Counting Mismatches (mutations) Cell (T,T) = number of unchanged T residues = 1 Cell (T,G) = number of substitutions from T to G Cell (T, C) = number of substitutions from T to C Cell (T, A) = number of substitutions from T to A Cell (T, -) = number of deletions of T ... Cell (-, T) = number of insertions of T Cell (-, G) = number of insertions of G Cell (-, C) = number of insertions of C Cell (-, A) = number of insertions of A = 0 Pair-wise Alignment Not all mismatches are created equal. 1.Collect the two sequences Some substitutions are more likely than others; therefore we must use weight values, such as those in substitution matrices 2. Align the sequences 3. Count the mutations in the alignment 4. Score the alignments Scoring the alignments Note that the result is a single value, a score, obtained by performing dot product between the alignment matrix and the substitution matrix, and adding the values of the resulting matrix as shown here. So, now you have a clearer idea of what goes under the hood of pairwise-alignment tools like BLAST. Exercise2: Using BLAST to transfer annotation Sometimes we have a gene (or protein) for which an annotation (the description line in fasta format) is unknown; for example, when a new genome is being sequenced. The general ‘in-silico’ procedure for assigning an annotation to that newly sequenced gene (or protein) calls for using BLAST to find a similar gene (or protein) for which the annotation is known. If the match is close enough, we can then transfer the annotation from the known gene (or protein) to the new one. Exercise2: Using BLAST to transfer annotation •Open a web browser and go the UNIPROT url www.uniprot.org 1.Click on the Blast tab 2.In the box type the identifier: A7JKN7_FRANO 3.Then click on the BLAST button Exercise2: Using BLAST to transfer annotation Notice how the UniProt-Blast program fetches the corresponding sequence before launching the BLAST search. Also notice that the annotation (description line) is unknown Exercise2: Using BLAST to transfer annotation This is the BLAST result page. The first and second hits do not have annotations either. The third hit is annotated as Neurotransmitter-gated ion-channel. So, at first blush, we could transfer that annotation to the protein A7JKN7_FRANO Exercise3: GLOBAL Pairwise alignment program • Open a web browser and go to the MOBYLE portal: mobyle.pasteur.fr/ • Choose Programs/ Alignment /pairwise/global/needle from the Programs box (left) • Copy-paste any two sequences

Sequence Comparison Methods

Comparative Analysis of Multiple Sequence Alignment Tools

The ELIXIR Core Data Resources: Fundamental Infrastructure for The

Dual Proteome-Scale Networks Reveal Cell-Specific Remodeling of the Human Interactome

Sequence Motifs, Correlations and Structural Mapping of Evolutionary

"Phylogenetic Analysis of Protein Sequence Data Using The

Performance Evaluation of Leading Protein Multiple Sequence Alignment Methods

HMMER User's Guide

The Biogrid Interaction Database

The Interpro Database, an Integrated Documentation Resource for Protein

Multiple Sequence Alignment

Syntax Highlighting for Computational Biology Artem Babaian1†, Anicet Ebou2, Alyssa Fegen3, Ho Yin (Jeffrey) Kam4, German E

The Uniprot Knowledgebase BLAST