The Biologist's Guide to Paracel's Similarity Search Algorithms

The Biologist’s Guide to Paracel’s Similarity Search Algorithms Introduction Many biological questions require the comparison of one or more sequences to each other. The nature of those comparisons depends on the question being asked, the time allowed to answer the question, the manner in which the answers will be used in subsequent analyses, the required accuracy of the answer, and so on. Fundamentally, the purpose of all similarity searches is to measure the “distance” between sequences. However, the meaning of “distance” changes depending on the investigation of interest. For example, a question in which protein hydrophobicity is the basis for comparison will use different metrics and a different algorithm than one in which the presence or absence of a specific binding domain is in question. Understanding when and why a certain algorithm is needed is essential to properly producing the scientific evidence needed for an investigation. Algorithm selection also requires considering time and accuracy of the result. In some situations a fast but possibly less precise result is more important than a very precise answer that takes far longer. Algorithm precision is measured by two parameters: sensitivity and specificity. Sensitivity is the percentage of true positives found, i.e., the number of correctly identified matches relative to the total number of true matches. Specificity is the number of true matches found relative to the total number of matches reported. Sensitivity and specificity often conflict with each other because higher sensitivity also means that more unrelated sequences are reported. Lastly, investigations often require independent confirmation from multiple computational or wet lab experiments. As a consequence, algorithm selection may be influenced by the availability and quality of confirming data. A typical example is to compare available EST, cDNA, or protein data with the results of gene prediction studies to confirm the reasonableness of those studies. What is similarity comparison? Similarity comparisons evaluate the “closeness” of sequences to each other by computing a metric that reflects a reward for “allowed” differences and a penalty for “disallowed” differences. An “objective function” determines what rewards and penalties are important and how to combine these into the closeness metric. Consider an example in which a message is received from a spacecraft parked in the red zone around a distant planet. The message traveled millions of miles and was likely Paracel Algorithms 1 October 2, 2001 corrupted in the transmission. Our job at the receiver is to determine which message was most likely sent. Suppose that the spacecraft sends only one of four messages: 1. WEATHER IS BEAUTIFUL, WISH YOU WERE HERE 2. WEATHER IS HERE, WISH YOU WERE BEAUTIFUL 3. WISH THERE WAS BEAUTIFUL WEATHER HERE 4. ENTERING SAFEMODE NEED HELP At each position in the received message, a character may be correct, deleted relative to the original message, or a substitution for another character. We might also know that the chance that a character is dropped is higher than characters being substituted and that multiple, sequential, character losses are common. The message ETESEDEDEED arrives at the receiver and needs to be matched to one of the four candidate messages. We need to determine the correspondence between the received message and the candidate messages that maximizes the chance of properly identifying the message. Want to evaluate is the similarity of the received message to the candidate messages. Our assumption is that the candidate message with the most similarity to our received message is most likely to be the message sent. One way to evaluate similarity is to consider all possible alignments of the received message and the candidate messages. By using a system that rewards matches and penalizes mismatches and gaps, we can assign a score to each alignment. Using the highest of those alignment scores, we find the maximum likelihood that the received message is related to each of the candidate messages. The highest of all the alignment scores tells us which message is the most likely one sent. This method has a very good chance of finding the best candidate message, but is computationally expensive because all possible alignments need to be evaluated. Another way to evaluate find most likely message sent is to assume that the received message must have some set of characters that identically match a set of characters in the original message. The remainder of the received message should match to the source message by applying substitutions and gaps as needed. In our example, notice that received message ends in “EED.” There is one place in fourth message that EED matches exactly. There are no double or triple character strings in the received message that exactly match any position in the first three candidate messages. Ignoring spaces, if we align “EED” in the received message to the fourth message we get an alignment that is easily explained by applying the error set: ENTERINGSAFEMODENEEDHELP E-TE----S—E---DEdEED---- where the lower case “d” in the received message indicates a substitution error and the “-“ indicates a character loss. This second method is much faster and computationally cheaper than the first but also less sensitive because it is possible that a message Paracel Algorithms 2 October 2, 2001 arrives that does not have two or more consecutive characters that match identically to any candidate messages. Still another means for evaluating similarity is to compute the frequency of certain character patterns in each candidate message and then compare those frequencies to the patterns in the received message. For example, the pattern “EE” occurs once in the fourth message but not at all in the first three. “EE” also occurs once in the received message. This method is computationally inexpensive, but lacks sensitivity to the context of the characters. This simple example makes two key points. First, there are often multiple ways to perform a similarity evaluation. Each method has advantages and disadvantages in terms of accuracy and computational cost. Second, it is not possible to know with 100% certainty whether any single method will find the right match. In some cases, a combination of methods will give us higher confidence in an answer than any individual method. Similarity comparisons provide evidence of sequence closeness only. Ultimately, investigators need to review the results and make final decisions based on their domain knowledge. Similarity and Biology Like messages from space, biological sequences undergo transformations that alter structure and meaning. Transformations may be the result of evolutionary processes such as species-specific variations of a protein or may be the result of errors introduced in preparing or reading the composition of sequences. Most biological similarity searches assume a simple error model. In this model, independent processes may result in the insertion, deletion, or substitution of characters in the text string that represents the residues of a peptide or nucleotide molecule. Evolutionary mutation processes are strongly regulated by natural selection since unfavorable alterations eventually disappear. Evolutionary mutations are limited in scope and occur at a predictable rate. Like the limited set of spacecraft messages and transmission errors, restrictions on the form and rate of evolutionary mutations may be exploited in similarity comparison algorithms. Conversely, the transformation processes associated with preparing and reading sequences tend to be random along most of the sequence length. In some situations, the transformation rate is higher at the sequence ends than in the middle of the sequence, but the nature of the transformations does not change. This knowledge too may be exploited in similarity comparison algorithms. Why do biological similarity comparisons? Corresponding to the modes of biology sequence variation, there are two types of sequence assessment: homology evaluation and contextual analysis. Although these evaluations have much in common, they are fundamentally asking different questions. Paracel Algorithms 3 October 2, 2001 Homology evaluation looks for evidence that biological sequences are related but have been altered relative to each other by evolution. Orthologs are related molecules that have been changed due to speciation while paralogs are replicated molecules in the same organism that have been altered through generations of independent mutation. Homology analyses require predetermine notions of which mutations are allowed and the rate at which they can be expected to occur. These analyses therefore depend on a prior analysis of related sequences. Contextual analyses look for the common features among sequences without concern for whether the sequences have a common ancestor. These comparisons determine whether sequences overlap or are contained within another sequences, for example. Contextual analyses are commonly part of the processing needed to join together many, small sequences into fewer, long, sequences. Contextual comparison is also used to find vector contamination in sequences, evaluate primer candidates, do biochip design, and so forth. These evaluations need to model relevant sequencing instrumentation and chemistry. In both types of analyses, an objective function is defined that endeavors to determine the best alignment of the sequences in question. Consider two sequences: SEQUENCESEARCH and SEQUENCER. A simple evaluation of these sequences assigns one sequence to the rows of a matrix and the other sequence to the columns. Placing an asterisk in each cell for which the row and column characters match and a blank where they do not creates a diagonal of asterisks as shown in Figure 1. The longest diagonal in Figure 1 is the best alignment of the two sequences. S E Q U E N C E R S * E * * * Q * U * E * * * N * C * E * * * S * E * * * A R * C * H Figure 1: Comparison of Two Sequences Paracel Algorithms 4 October 2, 2001 The simple method of Figure 1 is the basis for a powerful similarity tool called a dot plot. Dot plots provide visual evidence of similarity and may be used for comparing very large sequences.

The Biologist's Guide to Paracel's Similarity Search Algorithms

Simplified Matching Algorithm Using a Translated Codon (Tron)

Gap Opening Penalty Formula

Alignment Principles and Homology Searching Using (PSI-)BLAST

Sequence Analysis

Sequence Alignment

Gap Penalty in Sequence Alignment Pdf

Structural and Evolutionary Considerations for Multiple Sequence Alignment of RNA, and the Challenges for Algorithms That Ignore Them

Bioinformatics-Inspired Analysis for Watermarked Images with Multiple Print and Scan

Sequence Alignment Algorithms

Aligning Coding Sequences with Frameshift Extension Penalties