Simplified Matching Algorithm Using a Translated Codon (Tron)

Total Page:16

File Type:pdf, Size:1020Kb

Simplified Matching Algorithm Using a Translated Codon (Tron) Vol. 16 no. 3 2000 BIOINFORMATICS Pages 190–202 Homology-based gene structure prediction: simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps Osamu Gotoh Saitama Cancer Center Research Institute, 818 Komuro Ina-machi, Saitama 362-0806, Japan Received on August 23, 1999; accepted on October 21, 1999 Abstract Introduction Motivation: Locating protein-coding exons (CDSs) on a Following the completion of genomic sequencing of the eukaryotic genomic DNA sequence is the initial and an yeast Saccharomyces cerevisiae (Goffeau et al., 1996), essential step in predicting the functions of the genes nearly the complete structure of the nematode Caenorhab- embedded in that part of the genome. Accurate prediction ditis elegans genome has recently been reported (The of CDSs may be achieved by directly matching the DNA C. elegans Sequencing Consortium, 1998). Sequencing sequence with a known protein sequence or profile of a projects in several eukaryotic genomes including the homologous family member(s). human genome are now in progress. Identification of the Results: A new convention for encoding a DNA sequence genes on these genomic sequences and inferring their into a series of 23 possible letters (translated codon or functions are major themes of current computational tron code) was devised to improve this type of analysis. genome analyses. One obstacle to gene identification Using this convention, a dynamic programming algorithm is the fact that typical eukaryotic genes are segmented, was developed to align a DNA sequence and a protein and the prediction of precise exonic regions is still a sequence or profile so that the spliced and translated challenging problem (Burge and Karlin, 1998; Claverie, sequence optimally matches the reference the same as 1997; Murakami and Takagi, 1998). the standard protein sequence alignment allowing for Most gene-identification methods rely on statistical long gaps. The objective function also takes account features of coding and non-coding sequences, and sig- of frameshift errors, coding potentials, and translational nals around the beginnings and ends of transcription, initiation, termination and splicing signals. This method translation, and splicing. Various algorithms including was tested on Caenorhabditis elegans genes of known artificial neural networks (Uberbacher and Mural, 1991), structures. The accuracy of prediction measured in terms hidden Markov models (Burge and Karlin, 1997), and dis- of a correlation coefficient (CC) was about 95% at the criminant analyses (Zhang, 1997) coupled with dynamic nucleotide level for the 288 genes tested, and 97.0% for programming algorithms (Snyder and Stormo, 1995; Xu the 170 genes whose product and closest homologue share et al., 1994) have been used to capture specific signals more than 30% identical amino acids. We also propose a and derive a final prediction by combining several lines strategy to improve the accuracy of prediction for a set of information. The best-performing programs currently of paralogous genes by means of iterative gene prediction available correctly predict 70–80% of exons (Burge and and reconstruction of the reference profile derived from Karlin, 1998; Claverie, 1997; Murakami and Takagi, the predicted sequences. 1998). This level of accuracy might be sufficient for Availability: The source codes for the program ‘aln’ classification of the gene into a certain gene family, but is written in ANSI-C and the test data will be available insufficient for some other purposes, such as predicting via anonymous FTP at ftp.genome.ad.jp/pub/genomenet/ the structure of the encoded protein and evolutionary saitama-cc. studies, since the chance of perfect prediction of the entire Contact: [email protected] coding sequence declines exponentially with the number of exons in the gene. Several gene-finding methods incorporate the results of sequence similarity searches (Altschul et al., 1990) 190 c Oxford University Press 2000 Gene structure prediction by homology to significantly improve the overall prediction accuracy and second, and the second and third nucleotides in a (Burset and Guigo,´ 1996; Cai and Bork, 1998). To codon, respectively. Third, a special form of gap-penalty achieve still better predictions, however, more direct function that takes account of frameshift errors and involvement of homology information appears to be long gaps is employed. Finally, an iterative strategy is necessary (Mironov et al., 1998). Homology information developed to coherently improve the prediction accuracy has been shown to be useful even for better identification for a set of paralogous genes within a genome. Empirical of prokaryotic genes (Lolkema and Slotboom, 1998; examinations with 291 C.elegans genes of experimentally Pearson et al., 1997). A handful of methods have been identified exon–intron organizations indicated that our proposed to predict eukaryotic gene structures by direct method accurately predicts exonic sequences with a cor- matching of the genomic DNA sequence and a reference relation coefficient of about 97% at the nucleotide level, protein (Birney and Durbin, 1997; Gelfand et al., 1996; when the amino acid identities between the reference Huang and Zhang, 1996), or cDNA sequence (Florea sequence and the objective gene product exceed the gen- et al., 1998; Mott, 1997). ‘Genewise’ from Birney and erally recognized range (twilight zone) of reliable global Durbin (1997) appears to be the most general of these protein sequence alignment (Doolittle, 1981; Sander and methods, since it considers frameshift errors and accepts Schneider, 1991). a protein profile as the reference, while other methods lack either of the features. The use of a protein profile Methods or a profile hidden Markov model, rather than a single Tron code sequence, may improve the alignment accuracy, as has been repeatedly experienced for homology searches A remarkable feature of the universal genetic code is (Altschul et al., 1997; Park et al., 1998) and multiple that the second nucleotide in a codon greatly affects its sequence alignment (Gotoh, 1996). However, genewise specificity (Crick, 1968). In fact, all codons for an amino does not necessarily predict the complete gene structure, acid have a unique nucleotide at the second position, and sometimes reports only fragmental accounts. except for Ser (TCN and AGY) and termination (TAR Upon investigating the members of some multi-gene and TGA) codons. Thus, only 23 letters are necessary families throughout the genome of C.elegans, I became and sufficient to unambiguously encode both the original aware that prediction-based CDSs annotated in public nucleotide sequence and conceptually translated amino sequence databases, such as GenBank, might be wrong acid sequences in the three frames. We propose to call t for up to nearly half of the genes (Gotoh, 1998). I reached each translated codon a ‘tron’ ∈ , and express them by a this notion because multiple protein sequence alignments the standard one-letter amino acid codes ∈ , except for derived from such CDSs contained several structurally ‘J’, ‘O’, and ‘U’, which are used to represent translated implausible gaps (insertions and deletions). I was able AGY, TAR, and TGA codons, respectively (Figure 1). t a to show that the reassignment of exons greatly improved Thus, ={ , ‘J’, ‘O’, ‘U’}.Ina‘tron sequence’, = ... the quality of alignments (Gotoh, 1998). The principle of b b1b2 bJ , the tron code substitutes for the second the prediction algorithm was straightforward; when the nucleotide of a triplet. The tron codes at the first and genomic sequence is spliced and translated, the conceptual last sites in a sequence may be determined based on the protein sequence should optimally match the reference arbitrary assumption that a fixed nucleotide, ‘A’, occupies sequence or a ‘generalized profile’ as in the usual protein– the sites immediately before the first and immediately after protein or protein–profile alignment with an affine gap- the last sites of the original nucleotide sequence. A usual a penalty function. Our profile is more general than the usual 20×20 amino acid exchange matrix, M(a, b)(a, b ∈ ), (Gribskov et al., 1987) in terms of rigorous treatment such as PAM250 (Dayhoff et al., 1978) and JTT250 (Jones of internal gaps of various lengths and positions (Gotoh, et al., 1992), is easily expanded to a 20 × 23 amino acid a t 1994). Translational initiation, termination, and splicing versus tron similarity matrix, S(a, b)(a ∈ , b ∈ ), a signals as well as exonic coding potentials are also taken such that S(a, b) = M(a, b) if b ∈ , S(a, ‘J’) = into account for the objective function to be optimized. M(a, ‘S’),andS(a, ‘O’) = S(a, ‘U’) = MIN M(a, b). In this paper, I will show the details of the algorithm, The matrix S(a, b) facilitates immediate comparison of an which involves several novel features. First, a genomic amino acid or a profile vector with a translated codon. On sequence is encoded into translated codon (or tron) codes. the other hand, a single reference to a 23-element table The 23-letter codes can compactly encode the potential immediately recovers the original nucleotide at a specific translated amino acid sequence without losing any in- site. formation of the original nucleotide sequence. Second, a matching score for a codon interrupted by a phase-1 Boundary signals and coding potential or phase-2 intron is rigorously evaluated, where phase-1 We used the frequency of ‘ditrons’, i.e. neighboring trons and phase-2 introns imply those located between the first every three sites, in three frames to estimate coding 191 O.Gotoh Fig. 1. The tron codes. The proposed tron (translated codon) codes are shown in boldface letters together with arbitrarily chosen numeric codes (1–23). potential. All frequency data were normalized with the window size of 20 derived from first-order Markov corresponding reference frequencies obtained from the models at the nucleotide sequence level (Salzberg, 1997; 27 Mb of C.elegans genomic sequence that was publicly Zhang and Marr, 1993).
Recommended publications
  • Gap Opening Penalty Formula
    Gap Opening Penalty Formula Quintin remains phenotypical after Bryn participated austerely or recopying any enumerator. Astonied Leif popularizes piously or wigwagged stereophonically when Tre is sanctioning. If exceptionable or unemphatic Allah usually cultivates his guilder disputes ornamentally or overdresses astern and despairingly, how open-shop is Clare? The best alignments imply the opening gap penalty values a concept, therefore smaller sequence The length of the Hit Overlap relative to the length of hit sequence. Review of concepts, where position specific scoring matrices are constructed over multiple iterations of BLAST algorithm. The primer or one of the nucleotides can be radioactively or fluorescently labeled also, perhaps we would find much less similarity than we are accustomed to. These features of the alignment programs enhance the sequence alignment of real sequences by better suiting to different conservation rates at different spatial locations of the sequences. The authors would like to thank Dr. So, overwriting the file globin. For example, because it is very distant from other known homologs. Wunsch algorithm; that is, those matches need to be verified manually. Explanation: PAM stands for Percent Accepted Mutation. Return the edit distance between two strings. This module provides alignment functions to get global and local alignments between two sequences. In this section, Waterman MS. Phylip, have been developed using aligned blocks that are mostly devoid of disordered regions in proteins. The final alignment is written to screen. Show full deflines will be assumed to restart the gap penalty function domains and uncomment the second place ahead of the scorer can also involves additional features.
    [Show full text]
  • Alignment Principles and Homology Searching Using (PSI-)BLAST
    Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics “Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky (1900-1975)) “Nothing in bioinformatics makes sense except in the light of Biology” Evolution Four requirements: • Template structure providing stability (DNA) • Copying mechanism (meiosis) • Mechanism providing variation (mutations; insertions and deletions; crossing-over; etc.) • Selection (enzyme specificity, activity, etc.) Evolution Ancestral sequence: ABCD ACCD (B C) ABD (C ø) mutation deletion ACCD or ACCD Pairwise Alignment AB─D A─BD See “Primer of Genome Science” P. 114 – box “Phylogenetics” Evolution Ancestral sequence: ABCD ACCD (B C) ABD (C ø) mutation deletion ACCD or ACCD Pairwise Alignment AB─D A─BD See “Primer of Genome Science” P. 114 – true alignment box “Phylogenetics” Comparing two sequences •We want to be able to choose the best alignment between two sequences. •Alignment assumes divergent evolution (common ancestry) as opposed to convergent evolution •The first sequence to be compared is assigned to the horizontal axis and the second is assigned to the vertical axis. See “Primer of Genome Science” P. 72-75 box “Pairwise Sequence Alignment” MTSAVLPAAYDRKHTSIIFQTSWQ M T S A V L P A A Y D R K H T T S W Q All possible alignments between the two sequences can be represented as a path through the search matrix MTSAVLPAAYDRKHTSIIFQTSWQ M T S A V L P A A Y Corresponds D to stretch
    [Show full text]
  • Sequence Analysis
    Sequence Analysis MV Module II • Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. • Methodologies used include sequence alignment, searches against biological databases, and others. Since the development of methods of high-throughput production of gene and protein sequences, the rate of addition of new sequences to the databases increased exponentially. Such a collection of sequences does not, by itself, increase the scientist's understanding of the biology of organisms. However, comparing these new sequences to those with known functions is a key way of understanding the biology of an organism from which the new sequence comes. • Thus, sequence analysis can be used to assign function to genes and proteins by the study of the similarities between the compared sequences. Sequence Similarity Search • Sequence analysis is used to compare two or more sequences • Comparison of protein & DNA sequences to find similarities/differences - chief task in bioinformatics • The process of comparing two or more sequences to find out similarity between them is called sequence alignment • By sequence comparison it is possible to find out relationship in structure, function and evolution from a common ancestor • Similarity - identical (similar) residues occur at identical (similar) positions • No. of such matches indicates the degree of similarity ATCGTA 4/6 = 66% ATGCTA • Similarity occurs by chance, evolutionary convergence
    [Show full text]
  • Sequence Alignment
    Why Align Strings? • Find small differences between strings – Differences ~every 100 characters in DNA • See if the suffix of one sequence is a prefix of another – Useful in shotgun sequencing • Find common subsequences (cf definition) – Homology or identity searching • Find similarities of members of the same family – Structure prediction Alignment • Not an exact match • Can be based on edit distance • Usually based on a similarity measure Metrics A metric ρ:X is a function with the following properties for a,b,c • Ρ(a), ≥0 (real, non-negative) • ρ(a,a)=0 (identity) • ρ(a,b)= ρ(b,a) reflexive • ρ(a,c) ρ(a,b)+ ρ(b,c) (triangle inequality) Often ρ is called a ‘distance’ Edit Distance The number of changes requires to change one sequence into another is called the edit distance. VINTNERS VINEYARD Edit Distance = 4 Similarity We are more inclined to use the concept of similarity, an alignment scoring function instead. We can then – deal with gaps – weight specific substitutions. Note that similarity is NOT A METRIC. Example of a Scoring Function for Similarity Match +1 Mismatch -1 (replacement) Align with gap -2 (insertion or deletion) Called “Indels” by Waterman Similarity Scoring of an Alignment Example of Two of 6 Possible Alignments ATGCAT CTGCT 3 1 1 2 1 1 1 ATGCAT CTGCT1 1 1 1 1 2 1 String (Sequence) Alignment • Global Alignment – Every character in the query (source) string lines up with a character in the target string – May require gap (space) insertion to make strings the same length • Local Alignment – An “internal” alignment or embedding of a substring (sic) into a target string Global vs Local ATGATACCCT GLOBAL TTGTACGT ATGATACCCT LOCAL TGAAAGG Optimal Global Alignments In the earlier example ATGCAT repeated here, the second CTGCTalignment 3 is obviously better.
    [Show full text]
  • Gap Penalty in Sequence Alignment Pdf
    Gap Penalty In Sequence Alignment Pdf Pisolitic and meliorative Salmon bucket so discouragingly that Nichole spilt his gabfest. Is Corby always parental and shock-headed when procreate some unworldliness very fore and nervily? Protanomalous and designer Orlando quantify some stollen so discerningly! For yourself, two protein sequences may be relatively similar but caught at certain intervals as one protein may strike a different subunit compared to smack other. Excellent visual way we assess repetitiveness in gap penalty in sequence alignment pdf, including promoters within their uniqueness. On our tests show, such as done as in conjunction with. In depth order to form a decent alignment at the penalty in sequence alignment? How bout I Calculate The subject Gap Penalty Biostars. Path while using dotplots are gap penalty in sequence alignment pdf. Aligning Sequences with Non-Affine Gap Penalty PLAINS. Dotplot indicate that clustalw, which is most similar residues that while subcloning are easy to accommodate such as specified below, gap penalty in sequence alignment pdf, identifying a pdf. Otherwise be defined below to purchase short gaps, gap penalty in sequence alignment pdf, you will require removing old and. Information Security sentences and introduction of extra material. This is pervasive like regular FASTA except that gaps are added in trust to stew the sequences. In this tutorial you acknowledge use a classic global sequence alignment method the. Pairwise alignments cannot be released, or gap penalty in sequence alignment pdf. Which matrices gap penalties If that pair of sequences are least than 25 identical then the alignments are doctor to ring bad.
    [Show full text]
  • Structural and Evolutionary Considerations for Multiple Sequence Alignment of RNA, and the Challenges for Algorithms That Ignore Them
    chapter 7 Structural and Evolutionary Considerations for Multiple Sequence Alignment of RNA, and the Challenges for Algorithms That Ignore Them karl m. kjer Rutgers University usman roshan New Jersey Institute of Technology joseph j. gillespie University of Maryland, Baltimore County; Virginia Bioinformatics Institute, Virginia Tech Identifi cation of Goals. .106 Alignment and Its Relation to Data Exclusion. .108 Differentiation of Molecules . .110 rRNA Sequences Evolve under Structural Constraints . .111 Challenges to Existing Programs . .114 Compositional Bias Presents a Severe Challenge . .114 Gaps Are Not Uniformly Distributed . .116 Nonindependence of Indels. .121 Long Inserts/Deletions . .122 Lack of Recognition of Covarying Sites (A Well-Known, Seldom-Adopted Strategy) . .123 Are Structural Inferences Justifi ed? . .126 Why Align Manually? . .127 Perceived Advantages of Algorithms . .127 105 RRosenberg08_C07.inddosenberg08_C07.indd 105105 99/30/08/30/08 55:09:18:09:18 PMPM 106 Structural Considerations for RNA MSA An Example of Accuracy and Repeatability . .129 Comparison to Protein Alignment—Programs and Benchmarks . .136 Conclusion . .137 Terminology . .139 Appendix: Instructions on Performing a Structural Alignment . .141 identification of goals What Is It You are Trying to Accomplish with an Alignment? Some of the disagreement over alignment approaches comes from differences in objectives among investigators. Are the data merely meant to distinguish target DNA from contaminants in a BLAST search? Or is there a specifi c node on a cladogram you wish to test? Are you aligning genomes or genes? Are the data protein-coding, structural RNAs or noncoding sequences? Do you consider phylogenetics to be a process of inference or estimation? Would you rather be more consistent or more accurate? Are you studying the performance of your selected programs or the relationships among your taxa? Different answers to each of these questions could likely lead to legitimate alternate alignment approaches.
    [Show full text]
  • The Biologist's Guide to Paracel's Similarity Search Algorithms
    The Biologist’s Guide to Paracel’s Similarity Search Algorithms Introduction Many biological questions require the comparison of one or more sequences to each other. The nature of those comparisons depends on the question being asked, the time allowed to answer the question, the manner in which the answers will be used in subsequent analyses, the required accuracy of the answer, and so on. Fundamentally, the purpose of all similarity searches is to measure the “distance” between sequences. However, the meaning of “distance” changes depending on the investigation of interest. For example, a question in which protein hydrophobicity is the basis for comparison will use different metrics and a different algorithm than one in which the presence or absence of a specific binding domain is in question. Understanding when and why a certain algorithm is needed is essential to properly producing the scientific evidence needed for an investigation. Algorithm selection also requires considering time and accuracy of the result. In some situations a fast but possibly less precise result is more important than a very precise answer that takes far longer. Algorithm precision is measured by two parameters: sensitivity and specificity. Sensitivity is the percentage of true positives found, i.e., the number of correctly identified matches relative to the total number of true matches. Specificity is the number of true matches found relative to the total number of matches reported. Sensitivity and specificity often conflict with each other because higher sensitivity also means that more unrelated sequences are reported. Lastly, investigations often require independent confirmation from multiple computational or wet lab experiments.
    [Show full text]
  • Bioinformatics-Inspired Analysis for Watermarked Images with Multiple Print and Scan
    Bioinformatics-Inspired Analysis for Watermarked Images with Multiple Print and Scan By Abhimanyu Singh Garhwal A thesis submitted to Auckland University of Technology in fulfilment of the requirements for the degree of Doctor of Philosophy September 2017 Acronyms Used in This Thesis BIIA - Bioinformatics-Inspired Image Analysis BIIIA - Bioinformatics-Inspired Image Identification Approach BIIIG - Bioinformatics-Inspired Image Grouping Approach DNA – Deoxyribonucleic Acid MPS – Multiple Print and Scan MSA – Multiple Sequence Alignment NW - Non-Watermarked NWA – Needleman Wunch Algorithm NWD – Non-Watermarked and Degraded NWND – Non-Watermarked and Non-degraded PSA – Pairwise Sequence Alignment SWA – Smith Waterman Algorithm W – Watermarked WD – Watermarked and Degraded WND – Watermarked and Non-Degraded II Abstract Image identification and grouping through pattern analysis are the core problems in image analysis. In this thesis, the gap between bioinformatics and image analysis is bridged by using biologically-encoding and sequence-alignment algorithms in bioinformatics. In this thesis, the novel idea is to exploit the whole image which is encoded biologically in DNA without extracting its features. This thesis proposed novel methods for identifying and grouping images no matter whether having or not having watermarks. Three novel methods are proposed. The first is to evaluate degraded/non-degraded and watermarked/non-watermarked images by using image metrics. The bioinformatics-inspired image identification approach (BIIIA) is the second contribution, where two DNA-encoded images are aligned by using SWA algorithm or NWA algorithm to derive substrings, which are exploited for pattern matching so as to identify the images having a watermark or degradation generated from MPS. The outcomes of identification affirm the capability of BIIIA algorithm.
    [Show full text]
  • Sequence Alignment Algorithms
    2/19/17 Sequence alignment algorithms Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, FeBruary 23rd 2017 After this lecture, you can… … decide when to use local and global sequence alignments … use dynamic programming to align two sequences … explain Difference Between fixeD/linear/affine gap penalty … derive substitution scores and gap penalties from an alignment matrix … explain the progressive multiple alignment algorithm anD the Difference Between guiDe tree anD phylogenetic tree … recognize anD valiDate alignment Fasta files … list anD evaluate the assumptions on which sequence alignment DepenDs 1 2/19/17 Pairwise sequence alignments • Definition of sequence alignment – “Given two sequences: seqX = X1X2…XM and seqY = Y1Y2…YN an alignment is an assignment of gaps to positions 0, …, M in x, and to positions 0, …, N in seqY, so as to line up each letter in one sequence with either a letter or a gap in the other sequence” -AGAGGCTATCACCTGACCTCCAGGCCGATGCCCGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAGTAGCTATCACGACCGCGGTCGATTTGCCCGAC-CTATCAC--GACCGC--GGTCGATTTGCCCGAC • The optimal alignment is the alignment that is most consistent with a moDel of evolution • It is not trivial to make sequence alignments – The alignment shoulD be reliaBle – The method of obtaining the alignment shoulD be reproDuciBle – Thus, we use an algorithm to make sequence alignments Global anD local sequence alignments • Alignment: adDing gaps in one anD/or the other sequence until they are both equally long • Are sequences completely or partially homologous? • Local alignment – FinDs the optimal suB-alignment within two sequences – Partial homologs, e.g. resulting from domain rearrangement • GloBal alignment – Aligns two sequences from enD to enD – If you know two sequences are full homologs, e.g.
    [Show full text]
  • Aligning Coding Sequences with Frameshift Extension Penalties
    Jammali et al. Algorithms Mol Biol (2017) 12:10 DOI 10.1186/s13015-017-0101-4 Algorithms for Molecular Biology RESEARCH Open Access Aligning coding sequences with frameshift extension penalties Safa Jammali1* , Esaie Kuitche1, Ayoub Rachati1, François Bélanger1, Michelle Scott2 and Aïda Ouangraoua1 Abstract Background: Frameshift translation is an important phenomenon that contributes to the appearance of novel cod- ing DNA sequences (CDS) and functions in gene evolution, by allowing alternative amino acid translations of gene coding regions. Frameshift translations can be identified by aligning two CDS, from a same gene or from homologous genes, while accounting for their codon structure. Two main classes of algorithms have been proposed to solve the problem of aligning CDS, either by amino acid sequence alignment back-translation, or by simultaneously accounting for the nucleotide and amino acid levels. The former does not allow to account for frameshift translations and up to now, the latter exclusively accounts for frameshift translation initiation, not considering the length of the translation disruption caused by a frameshift. Results: We introduce a new scoring scheme with an algorithm for the pairwise alignment of CDS accounting for frameshift translation initiation and length, while simultaneously considering nucleotide and amino acid sequences. The main specificity of the scoring scheme is the introduction of a penalty cost accounting for frameshift extension length to compute an adequate similarity score for a CDS alignment. The second specificity of the model is that the search space of the problem solved is the set of all feasible alignments between two CDS. Previous approaches have considered restricted search space or additional constraints on the decomposition of an alignment into length-3 sub- alignments.
    [Show full text]