The Biologist's Guide to Paracel's Similarity Search Algorithms

Total Page:16

File Type:pdf, Size:1020Kb

The Biologist's Guide to Paracel's Similarity Search Algorithms The Biologist’s Guide to Paracel’s Similarity Search Algorithms Introduction Many biological questions require the comparison of one or more sequences to each other. The nature of those comparisons depends on the question being asked, the time allowed to answer the question, the manner in which the answers will be used in subsequent analyses, the required accuracy of the answer, and so on. Fundamentally, the purpose of all similarity searches is to measure the “distance” between sequences. However, the meaning of “distance” changes depending on the investigation of interest. For example, a question in which protein hydrophobicity is the basis for comparison will use different metrics and a different algorithm than one in which the presence or absence of a specific binding domain is in question. Understanding when and why a certain algorithm is needed is essential to properly producing the scientific evidence needed for an investigation. Algorithm selection also requires considering time and accuracy of the result. In some situations a fast but possibly less precise result is more important than a very precise answer that takes far longer. Algorithm precision is measured by two parameters: sensitivity and specificity. Sensitivity is the percentage of true positives found, i.e., the number of correctly identified matches relative to the total number of true matches. Specificity is the number of true matches found relative to the total number of matches reported. Sensitivity and specificity often conflict with each other because higher sensitivity also means that more unrelated sequences are reported. Lastly, investigations often require independent confirmation from multiple computational or wet lab experiments. As a consequence, algorithm selection may be influenced by the availability and quality of confirming data. A typical example is to compare available EST, cDNA, or protein data with the results of gene prediction studies to confirm the reasonableness of those studies. What is similarity comparison? Similarity comparisons evaluate the “closeness” of sequences to each other by computing a metric that reflects a reward for “allowed” differences and a penalty for “disallowed” differences. An “objective function” determines what rewards and penalties are important and how to combine these into the closeness metric. Consider an example in which a message is received from a spacecraft parked in the red zone around a distant planet. The message traveled millions of miles and was likely Paracel Algorithms 1 October 2, 2001 corrupted in the transmission. Our job at the receiver is to determine which message was most likely sent. Suppose that the spacecraft sends only one of four messages: 1. WEATHER IS BEAUTIFUL, WISH YOU WERE HERE 2. WEATHER IS HERE, WISH YOU WERE BEAUTIFUL 3. WISH THERE WAS BEAUTIFUL WEATHER HERE 4. ENTERING SAFEMODE NEED HELP At each position in the received message, a character may be correct, deleted relative to the original message, or a substitution for another character. We might also know that the chance that a character is dropped is higher than characters being substituted and that multiple, sequential, character losses are common. The message ETESEDEDEED arrives at the receiver and needs to be matched to one of the four candidate messages. We need to determine the correspondence between the received message and the candidate messages that maximizes the chance of properly identifying the message. Want to evaluate is the similarity of the received message to the candidate messages. Our assumption is that the candidate message with the most similarity to our received message is most likely to be the message sent. One way to evaluate similarity is to consider all possible alignments of the received message and the candidate messages. By using a system that rewards matches and penalizes mismatches and gaps, we can assign a score to each alignment. Using the highest of those alignment scores, we find the maximum likelihood that the received message is related to each of the candidate messages. The highest of all the alignment scores tells us which message is the most likely one sent. This method has a very good chance of finding the best candidate message, but is computationally expensive because all possible alignments need to be evaluated. Another way to evaluate find most likely message sent is to assume that the received message must have some set of characters that identically match a set of characters in the original message. The remainder of the received message should match to the source message by applying substitutions and gaps as needed. In our example, notice that received message ends in “EED.” There is one place in fourth message that EED matches exactly. There are no double or triple character strings in the received message that exactly match any position in the first three candidate messages. Ignoring spaces, if we align “EED” in the received message to the fourth message we get an alignment that is easily explained by applying the error set: ENTERINGSAFEMODENEEDHELP E-TE----S—E---DEdEED---- where the lower case “d” in the received message indicates a substitution error and the “-“ indicates a character loss. This second method is much faster and computationally cheaper than the first but also less sensitive because it is possible that a message Paracel Algorithms 2 October 2, 2001 arrives that does not have two or more consecutive characters that match identically to any candidate messages. Still another means for evaluating similarity is to compute the frequency of certain character patterns in each candidate message and then compare those frequencies to the patterns in the received message. For example, the pattern “EE” occurs once in the fourth message but not at all in the first three. “EE” also occurs once in the received message. This method is computationally inexpensive, but lacks sensitivity to the context of the characters. This simple example makes two key points. First, there are often multiple ways to perform a similarity evaluation. Each method has advantages and disadvantages in terms of accuracy and computational cost. Second, it is not possible to know with 100% certainty whether any single method will find the right match. In some cases, a combination of methods will give us higher confidence in an answer than any individual method. Similarity comparisons provide evidence of sequence closeness only. Ultimately, investigators need to review the results and make final decisions based on their domain knowledge. Similarity and Biology Like messages from space, biological sequences undergo transformations that alter structure and meaning. Transformations may be the result of evolutionary processes such as species-specific variations of a protein or may be the result of errors introduced in preparing or reading the composition of sequences. Most biological similarity searches assume a simple error model. In this model, independent processes may result in the insertion, deletion, or substitution of characters in the text string that represents the residues of a peptide or nucleotide molecule. Evolutionary mutation processes are strongly regulated by natural selection since unfavorable alterations eventually disappear. Evolutionary mutations are limited in scope and occur at a predictable rate. Like the limited set of spacecraft messages and transmission errors, restrictions on the form and rate of evolutionary mutations may be exploited in similarity comparison algorithms. Conversely, the transformation processes associated with preparing and reading sequences tend to be random along most of the sequence length. In some situations, the transformation rate is higher at the sequence ends than in the middle of the sequence, but the nature of the transformations does not change. This knowledge too may be exploited in similarity comparison algorithms. Why do biological similarity comparisons? Corresponding to the modes of biology sequence variation, there are two types of sequence assessment: homology evaluation and contextual analysis. Although these evaluations have much in common, they are fundamentally asking different questions. Paracel Algorithms 3 October 2, 2001 Homology evaluation looks for evidence that biological sequences are related but have been altered relative to each other by evolution. Orthologs are related molecules that have been changed due to speciation while paralogs are replicated molecules in the same organism that have been altered through generations of independent mutation. Homology analyses require predetermine notions of which mutations are allowed and the rate at which they can be expected to occur. These analyses therefore depend on a prior analysis of related sequences. Contextual analyses look for the common features among sequences without concern for whether the sequences have a common ancestor. These comparisons determine whether sequences overlap or are contained within another sequences, for example. Contextual analyses are commonly part of the processing needed to join together many, small sequences into fewer, long, sequences. Contextual comparison is also used to find vector contamination in sequences, evaluate primer candidates, do biochip design, and so forth. These evaluations need to model relevant sequencing instrumentation and chemistry. In both types of analyses, an objective function is defined that endeavors to determine the best alignment of the sequences in question. Consider two sequences: SEQUENCESEARCH and SEQUENCER. A simple evaluation of these sequences assigns one sequence to the rows of a matrix and the other sequence to the columns. Placing an asterisk in each cell for which the row and column characters match and a blank where they do not creates a diagonal of asterisks as shown in Figure 1. The longest diagonal in Figure 1 is the best alignment of the two sequences. S E Q U E N C E R S * E * * * Q * U * E * * * N * C * E * * * S * E * * * A R * C * H Figure 1: Comparison of Two Sequences Paracel Algorithms 4 October 2, 2001 The simple method of Figure 1 is the basis for a powerful similarity tool called a dot plot. Dot plots provide visual evidence of similarity and may be used for comparing very large sequences.
Recommended publications
  • Simplified Matching Algorithm Using a Translated Codon (Tron)
    Vol. 16 no. 3 2000 BIOINFORMATICS Pages 190–202 Homology-based gene structure prediction: simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps Osamu Gotoh Saitama Cancer Center Research Institute, 818 Komuro Ina-machi, Saitama 362-0806, Japan Received on August 23, 1999; accepted on October 21, 1999 Abstract Introduction Motivation: Locating protein-coding exons (CDSs) on a Following the completion of genomic sequencing of the eukaryotic genomic DNA sequence is the initial and an yeast Saccharomyces cerevisiae (Goffeau et al., 1996), essential step in predicting the functions of the genes nearly the complete structure of the nematode Caenorhab- embedded in that part of the genome. Accurate prediction ditis elegans genome has recently been reported (The of CDSs may be achieved by directly matching the DNA C. elegans Sequencing Consortium, 1998). Sequencing sequence with a known protein sequence or profile of a projects in several eukaryotic genomes including the homologous family member(s). human genome are now in progress. Identification of the Results: A new convention for encoding a DNA sequence genes on these genomic sequences and inferring their into a series of 23 possible letters (translated codon or functions are major themes of current computational tron code) was devised to improve this type of analysis. genome analyses. One obstacle to gene identification Using this convention, a dynamic programming algorithm is the fact that typical eukaryotic genes are segmented, was developed to align a DNA sequence and a protein and the prediction of precise exonic regions is still a sequence or profile so that the spliced and translated challenging problem (Burge and Karlin, 1998; Claverie, sequence optimally matches the reference the same as 1997; Murakami and Takagi, 1998).
    [Show full text]
  • Gap Opening Penalty Formula
    Gap Opening Penalty Formula Quintin remains phenotypical after Bryn participated austerely or recopying any enumerator. Astonied Leif popularizes piously or wigwagged stereophonically when Tre is sanctioning. If exceptionable or unemphatic Allah usually cultivates his guilder disputes ornamentally or overdresses astern and despairingly, how open-shop is Clare? The best alignments imply the opening gap penalty values a concept, therefore smaller sequence The length of the Hit Overlap relative to the length of hit sequence. Review of concepts, where position specific scoring matrices are constructed over multiple iterations of BLAST algorithm. The primer or one of the nucleotides can be radioactively or fluorescently labeled also, perhaps we would find much less similarity than we are accustomed to. These features of the alignment programs enhance the sequence alignment of real sequences by better suiting to different conservation rates at different spatial locations of the sequences. The authors would like to thank Dr. So, overwriting the file globin. For example, because it is very distant from other known homologs. Wunsch algorithm; that is, those matches need to be verified manually. Explanation: PAM stands for Percent Accepted Mutation. Return the edit distance between two strings. This module provides alignment functions to get global and local alignments between two sequences. In this section, Waterman MS. Phylip, have been developed using aligned blocks that are mostly devoid of disordered regions in proteins. The final alignment is written to screen. Show full deflines will be assumed to restart the gap penalty function domains and uncomment the second place ahead of the scorer can also involves additional features.
    [Show full text]
  • Alignment Principles and Homology Searching Using (PSI-)BLAST
    Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics “Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky (1900-1975)) “Nothing in bioinformatics makes sense except in the light of Biology” Evolution Four requirements: • Template structure providing stability (DNA) • Copying mechanism (meiosis) • Mechanism providing variation (mutations; insertions and deletions; crossing-over; etc.) • Selection (enzyme specificity, activity, etc.) Evolution Ancestral sequence: ABCD ACCD (B C) ABD (C ø) mutation deletion ACCD or ACCD Pairwise Alignment AB─D A─BD See “Primer of Genome Science” P. 114 – box “Phylogenetics” Evolution Ancestral sequence: ABCD ACCD (B C) ABD (C ø) mutation deletion ACCD or ACCD Pairwise Alignment AB─D A─BD See “Primer of Genome Science” P. 114 – true alignment box “Phylogenetics” Comparing two sequences •We want to be able to choose the best alignment between two sequences. •Alignment assumes divergent evolution (common ancestry) as opposed to convergent evolution •The first sequence to be compared is assigned to the horizontal axis and the second is assigned to the vertical axis. See “Primer of Genome Science” P. 72-75 box “Pairwise Sequence Alignment” MTSAVLPAAYDRKHTSIIFQTSWQ M T S A V L P A A Y D R K H T T S W Q All possible alignments between the two sequences can be represented as a path through the search matrix MTSAVLPAAYDRKHTSIIFQTSWQ M T S A V L P A A Y Corresponds D to stretch
    [Show full text]
  • Sequence Analysis
    Sequence Analysis MV Module II • Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. • Methodologies used include sequence alignment, searches against biological databases, and others. Since the development of methods of high-throughput production of gene and protein sequences, the rate of addition of new sequences to the databases increased exponentially. Such a collection of sequences does not, by itself, increase the scientist's understanding of the biology of organisms. However, comparing these new sequences to those with known functions is a key way of understanding the biology of an organism from which the new sequence comes. • Thus, sequence analysis can be used to assign function to genes and proteins by the study of the similarities between the compared sequences. Sequence Similarity Search • Sequence analysis is used to compare two or more sequences • Comparison of protein & DNA sequences to find similarities/differences - chief task in bioinformatics • The process of comparing two or more sequences to find out similarity between them is called sequence alignment • By sequence comparison it is possible to find out relationship in structure, function and evolution from a common ancestor • Similarity - identical (similar) residues occur at identical (similar) positions • No. of such matches indicates the degree of similarity ATCGTA 4/6 = 66% ATGCTA • Similarity occurs by chance, evolutionary convergence
    [Show full text]
  • Sequence Alignment
    Why Align Strings? • Find small differences between strings – Differences ~every 100 characters in DNA • See if the suffix of one sequence is a prefix of another – Useful in shotgun sequencing • Find common subsequences (cf definition) – Homology or identity searching • Find similarities of members of the same family – Structure prediction Alignment • Not an exact match • Can be based on edit distance • Usually based on a similarity measure Metrics A metric ρ:X is a function with the following properties for a,b,c • Ρ(a), ≥0 (real, non-negative) • ρ(a,a)=0 (identity) • ρ(a,b)= ρ(b,a) reflexive • ρ(a,c) ρ(a,b)+ ρ(b,c) (triangle inequality) Often ρ is called a ‘distance’ Edit Distance The number of changes requires to change one sequence into another is called the edit distance. VINTNERS VINEYARD Edit Distance = 4 Similarity We are more inclined to use the concept of similarity, an alignment scoring function instead. We can then – deal with gaps – weight specific substitutions. Note that similarity is NOT A METRIC. Example of a Scoring Function for Similarity Match +1 Mismatch -1 (replacement) Align with gap -2 (insertion or deletion) Called “Indels” by Waterman Similarity Scoring of an Alignment Example of Two of 6 Possible Alignments ATGCAT CTGCT 3 1 1 2 1 1 1 ATGCAT CTGCT1 1 1 1 1 2 1 String (Sequence) Alignment • Global Alignment – Every character in the query (source) string lines up with a character in the target string – May require gap (space) insertion to make strings the same length • Local Alignment – An “internal” alignment or embedding of a substring (sic) into a target string Global vs Local ATGATACCCT GLOBAL TTGTACGT ATGATACCCT LOCAL TGAAAGG Optimal Global Alignments In the earlier example ATGCAT repeated here, the second CTGCTalignment 3 is obviously better.
    [Show full text]
  • Gap Penalty in Sequence Alignment Pdf
    Gap Penalty In Sequence Alignment Pdf Pisolitic and meliorative Salmon bucket so discouragingly that Nichole spilt his gabfest. Is Corby always parental and shock-headed when procreate some unworldliness very fore and nervily? Protanomalous and designer Orlando quantify some stollen so discerningly! For yourself, two protein sequences may be relatively similar but caught at certain intervals as one protein may strike a different subunit compared to smack other. Excellent visual way we assess repetitiveness in gap penalty in sequence alignment pdf, including promoters within their uniqueness. On our tests show, such as done as in conjunction with. In depth order to form a decent alignment at the penalty in sequence alignment? How bout I Calculate The subject Gap Penalty Biostars. Path while using dotplots are gap penalty in sequence alignment pdf. Aligning Sequences with Non-Affine Gap Penalty PLAINS. Dotplot indicate that clustalw, which is most similar residues that while subcloning are easy to accommodate such as specified below, gap penalty in sequence alignment pdf, identifying a pdf. Otherwise be defined below to purchase short gaps, gap penalty in sequence alignment pdf, you will require removing old and. Information Security sentences and introduction of extra material. This is pervasive like regular FASTA except that gaps are added in trust to stew the sequences. In this tutorial you acknowledge use a classic global sequence alignment method the. Pairwise alignments cannot be released, or gap penalty in sequence alignment pdf. Which matrices gap penalties If that pair of sequences are least than 25 identical then the alignments are doctor to ring bad.
    [Show full text]
  • Structural and Evolutionary Considerations for Multiple Sequence Alignment of RNA, and the Challenges for Algorithms That Ignore Them
    chapter 7 Structural and Evolutionary Considerations for Multiple Sequence Alignment of RNA, and the Challenges for Algorithms That Ignore Them karl m. kjer Rutgers University usman roshan New Jersey Institute of Technology joseph j. gillespie University of Maryland, Baltimore County; Virginia Bioinformatics Institute, Virginia Tech Identifi cation of Goals. .106 Alignment and Its Relation to Data Exclusion. .108 Differentiation of Molecules . .110 rRNA Sequences Evolve under Structural Constraints . .111 Challenges to Existing Programs . .114 Compositional Bias Presents a Severe Challenge . .114 Gaps Are Not Uniformly Distributed . .116 Nonindependence of Indels. .121 Long Inserts/Deletions . .122 Lack of Recognition of Covarying Sites (A Well-Known, Seldom-Adopted Strategy) . .123 Are Structural Inferences Justifi ed? . .126 Why Align Manually? . .127 Perceived Advantages of Algorithms . .127 105 RRosenberg08_C07.inddosenberg08_C07.indd 105105 99/30/08/30/08 55:09:18:09:18 PMPM 106 Structural Considerations for RNA MSA An Example of Accuracy and Repeatability . .129 Comparison to Protein Alignment—Programs and Benchmarks . .136 Conclusion . .137 Terminology . .139 Appendix: Instructions on Performing a Structural Alignment . .141 identification of goals What Is It You are Trying to Accomplish with an Alignment? Some of the disagreement over alignment approaches comes from differences in objectives among investigators. Are the data merely meant to distinguish target DNA from contaminants in a BLAST search? Or is there a specifi c node on a cladogram you wish to test? Are you aligning genomes or genes? Are the data protein-coding, structural RNAs or noncoding sequences? Do you consider phylogenetics to be a process of inference or estimation? Would you rather be more consistent or more accurate? Are you studying the performance of your selected programs or the relationships among your taxa? Different answers to each of these questions could likely lead to legitimate alternate alignment approaches.
    [Show full text]
  • Bioinformatics-Inspired Analysis for Watermarked Images with Multiple Print and Scan
    Bioinformatics-Inspired Analysis for Watermarked Images with Multiple Print and Scan By Abhimanyu Singh Garhwal A thesis submitted to Auckland University of Technology in fulfilment of the requirements for the degree of Doctor of Philosophy September 2017 Acronyms Used in This Thesis BIIA - Bioinformatics-Inspired Image Analysis BIIIA - Bioinformatics-Inspired Image Identification Approach BIIIG - Bioinformatics-Inspired Image Grouping Approach DNA – Deoxyribonucleic Acid MPS – Multiple Print and Scan MSA – Multiple Sequence Alignment NW - Non-Watermarked NWA – Needleman Wunch Algorithm NWD – Non-Watermarked and Degraded NWND – Non-Watermarked and Non-degraded PSA – Pairwise Sequence Alignment SWA – Smith Waterman Algorithm W – Watermarked WD – Watermarked and Degraded WND – Watermarked and Non-Degraded II Abstract Image identification and grouping through pattern analysis are the core problems in image analysis. In this thesis, the gap between bioinformatics and image analysis is bridged by using biologically-encoding and sequence-alignment algorithms in bioinformatics. In this thesis, the novel idea is to exploit the whole image which is encoded biologically in DNA without extracting its features. This thesis proposed novel methods for identifying and grouping images no matter whether having or not having watermarks. Three novel methods are proposed. The first is to evaluate degraded/non-degraded and watermarked/non-watermarked images by using image metrics. The bioinformatics-inspired image identification approach (BIIIA) is the second contribution, where two DNA-encoded images are aligned by using SWA algorithm or NWA algorithm to derive substrings, which are exploited for pattern matching so as to identify the images having a watermark or degradation generated from MPS. The outcomes of identification affirm the capability of BIIIA algorithm.
    [Show full text]
  • Sequence Alignment Algorithms
    2/19/17 Sequence alignment algorithms Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, FeBruary 23rd 2017 After this lecture, you can… … decide when to use local and global sequence alignments … use dynamic programming to align two sequences … explain Difference Between fixeD/linear/affine gap penalty … derive substitution scores and gap penalties from an alignment matrix … explain the progressive multiple alignment algorithm anD the Difference Between guiDe tree anD phylogenetic tree … recognize anD valiDate alignment Fasta files … list anD evaluate the assumptions on which sequence alignment DepenDs 1 2/19/17 Pairwise sequence alignments • Definition of sequence alignment – “Given two sequences: seqX = X1X2…XM and seqY = Y1Y2…YN an alignment is an assignment of gaps to positions 0, …, M in x, and to positions 0, …, N in seqY, so as to line up each letter in one sequence with either a letter or a gap in the other sequence” -AGAGGCTATCACCTGACCTCCAGGCCGATGCCCGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAGTAGCTATCACGACCGCGGTCGATTTGCCCGAC-CTATCAC--GACCGC--GGTCGATTTGCCCGAC • The optimal alignment is the alignment that is most consistent with a moDel of evolution • It is not trivial to make sequence alignments – The alignment shoulD be reliaBle – The method of obtaining the alignment shoulD be reproDuciBle – Thus, we use an algorithm to make sequence alignments Global anD local sequence alignments • Alignment: adDing gaps in one anD/or the other sequence until they are both equally long • Are sequences completely or partially homologous? • Local alignment – FinDs the optimal suB-alignment within two sequences – Partial homologs, e.g. resulting from domain rearrangement • GloBal alignment – Aligns two sequences from enD to enD – If you know two sequences are full homologs, e.g.
    [Show full text]
  • Aligning Coding Sequences with Frameshift Extension Penalties
    Jammali et al. Algorithms Mol Biol (2017) 12:10 DOI 10.1186/s13015-017-0101-4 Algorithms for Molecular Biology RESEARCH Open Access Aligning coding sequences with frameshift extension penalties Safa Jammali1* , Esaie Kuitche1, Ayoub Rachati1, François Bélanger1, Michelle Scott2 and Aïda Ouangraoua1 Abstract Background: Frameshift translation is an important phenomenon that contributes to the appearance of novel cod- ing DNA sequences (CDS) and functions in gene evolution, by allowing alternative amino acid translations of gene coding regions. Frameshift translations can be identified by aligning two CDS, from a same gene or from homologous genes, while accounting for their codon structure. Two main classes of algorithms have been proposed to solve the problem of aligning CDS, either by amino acid sequence alignment back-translation, or by simultaneously accounting for the nucleotide and amino acid levels. The former does not allow to account for frameshift translations and up to now, the latter exclusively accounts for frameshift translation initiation, not considering the length of the translation disruption caused by a frameshift. Results: We introduce a new scoring scheme with an algorithm for the pairwise alignment of CDS accounting for frameshift translation initiation and length, while simultaneously considering nucleotide and amino acid sequences. The main specificity of the scoring scheme is the introduction of a penalty cost accounting for frameshift extension length to compute an adequate similarity score for a CDS alignment. The second specificity of the model is that the search space of the problem solved is the set of all feasible alignments between two CDS. Previous approaches have considered restricted search space or additional constraints on the decomposition of an alignment into length-3 sub- alignments.
    [Show full text]