Generalised Implementation for Fixed-Length Approximate String Matching Under Hamming Distance & Applications

2015 IEEE International Parallel and Distributed Processing Symposium Workshops Generalised Implementation for Fixed-Length Approximate String Matching under Hamming Distance & Applications Solon P. Pissis and Ahmad Retha Department of Informatics, King’s College London, London, UK fsolon.pissis,[email protected] Abstract—Approximate string matching is the problem of One of the most commonly used error models is the so finding all factors of a text t of length n with a distance at most called edit distance model, which uses substitution, deletion k from a pattern x of length m ≤ n. Fixed-length approximate and insertion of letters in both strings. Each of these different string matching is the problem of finding all factors of a text t of length n with a distance at most k from any factor of operations can have a different edit cost or the cost may length h of a pattern x of length m ≤ n, where h ≤ m. It depend on the letters involved; in this case we speak of the is thus a generalisation of approximate string matching and general edit distance model. Otherwise, if all the operations has numerous direct applications in molecular biology—motif have a unit cost, we speak of simple edit distance or just edit extraction and circular sequence alignment, to name a few. distance. In the latter case we simply seek for the minimum MaxShiftM is a bit-parallel algorithm for fixed-length ap- number of operations to transform the one string into the proximate string matching under the Hamming distance model with time complexity O(mdh=wen), where w is the size of the other. A further simplification of this model is the Hamming computer word (Crochemore et al., 2010). An implementation distance model and this variant only allows for substitutions. of this algorithm is straightforward as long as the maximal DNA errors occur at a steady rate with the average single length of alignments is less than or equal to w. In this article, nucleotide substitution mutation rate in humans estimated our contribution is twofold: first, we propose a generalised to be 1:20 × 10−8 per nucleotide per generation [13]. implementation of MaxShiftM, that is, for any given h ≤ m, under the Hamming distance model; and second, we show how Apart from single-nucleotide substitution, there is also nu- our implementation can be used to improve the accuracy and cleotide deletion and insertion rates but these are signif- efficiency of multiple circular sequence alignment in terms of icantly lower. Although more specialised—but computa- the inferred likelihood-based phylogenies. tionally more expensive—error models exist for molecular Keywords-algorithms on strings; approximate string match- sequences [11], the edit and Hamming distance models are ing; dynamic programming. still extensively used as a reasonable accuracy-efficiency trade-off. I. INTRODUCTION Two approaches to devising search algorithms can be found in the literature, termed off-line and on-line. Off-line Various algorithms exist to tackle the problem of finding algorithms have a preprocessing stage where they take the an exact or similar string pattern in a given string text. input datasets and generate exhaustive intermediate indexes The topic contains many competing algorithms for solv- which can then be searched. Off-line algorithms presuppose ing specific problems in particular branches of scientific the search datasets remain mostly static and index-storage research [17]. In the field of molecular biology, string space is abounding. Often, indexing time is of secondary matching algorithms are used for extracting motifs or regions importance to search time and this makes it impractical for of interest [18] or for the alignment of short reads to longer our purposes. On-line search algorithms work on changing DNA or protein sequences [24]. This is important for solv- and expanding datasets and run on-the-fly with little or no ing problems such as gene classification, protein function preprocessing. On-line search algorithms are better suited for identification, mutation discovery, and genetic sequencing. immediate and on-demand searching of variable datasets. In The current state of art in the field present a number this work we consider only on-line approaches. of applications with algorithms that are general-purpose With edit distance, the searching problem is known as or directly applicable in molecular biology. The general approximate string matching with k-errors. The first al- computing problem these algorithms attempt to solve is gorithm for solving this problem has been rediscovered known as exact string matching and aims to determine many times in the past in different areas [17]. However this whether or not a string of letters is present inside another algorithm computed the edit distance, and it was converted string. Furthermore, this problem expands into the realms of into a search algorithm only in 1980 by Sellers [22]. Sellers approximate string matching; approximate matching allows algorithm requires time O(mn), where m is the length of for a limited number of errors or, dually, edit operations the pattern and n is the length of the text. The major theoret- needed for the searched pattern to have a match in the text. ical and practical achievements are O(kn)-time algorithms, /15 $31.00 © 2015 IEEE 367 DOI 10.1109/IPDPSW.2015.106 presented in [8], [9], [26], where k is the maximum edit sequences of circular structure may be cut at arbitrary distance allowed. An intensive study on the question is by positions to be sequenced so one copy of a genetic Navarro [17]. code will not be aligned with another. BEAR uses A thread of practice-oriented results exploits the hardware MaxShiftM to search for a sufficiently good rotation word-level parallelism of operations. It takes advantage of for each of the sequences against one another, so that the intrinsic parallelism of the word-level operations inside when they are passed onto a conventional multiple a computer word. By using this fact cleverly, the number sequence alignment tool, it will be able to provide a of operations that a bit-parallel algorithm performs can be better multiple alignment. cut down by a factor of at most w, where w is the size of MaxShiftM operates by maintaining a variant of the the computer word. In order to relate the behavior of bit- traditional dynamic programming matrix D, denoted by D0, parallel algorithms to other work, it is normally assumed through iteratively comparing the letters in text t against the that w = Θ(log n), as dictated by the RAM model. Wu letters in the pattern x to provide a representation for solv- and Manber, in [29], gave an O(kdm=wen)-time algorithm ing the fixed-length approximate string matching problem. for the k-errors problem by simulating the non-deterministic MaxShiftM maintains the alignments information in matrix automaton [25] using word-level parallelism. Baeza-Yates D0 via word-level shift-or operations. Edit operations and Navarro, in [1], gave an O(dkm=wen)-time variation are stored in individual bits of a computer word and at the of the Wu-Manber algorithm, implying O(n) performance end of this computation, the sum of these operations denote when km = O(w). In 1994, Wright [28] presented an the alignment cost; this cost can be determined by obtaining O(m log(jΣj)n=w)-time algorithm, where jΣj is the size the sum of bits set on in the computer word via a word-level of the input alphabet Σ. This was the first work using popcount operation. The resulting matrix D0 holds the word-level parallelism on the dynamic programming matrix. binary encoding for obtaining the optimal alignment between Myers, in [16], gave an O(dm=wen)-time algorithm using any factor of length h of x and some factor of t. word-level parallelism for computing the dynamic program- MaxShiftM requires time O(mn) under the assumption ming matrix more efficiently. Another general solution based that the maximal length of the stored alignments is less than on existing algorithms can be found in [5]. or equal to w. This ensures that each cell of the dynamic A generalisation of this problem—instead of searching programming matrix can be computed in constant time. For for all matches of the pattern—is to search for all matches instance, a na¨ıve implementation running on a regular 64-bit of any factor of some fixed length h ≤ m of the pat- computer, cannot handle alignments for h > 64. Hence it is tern. This problem is known as fixed-length approximate well understood that, while this may be sufficient for short string matching and it was introduced by Iliopoulos et alignments, it is not general enough for certain applications. al. in [12]. Crochemore et al. devised MaxShiftM [6], a bit-parallel algorithm for fixed-length approximate string OUR CONTRIBUTION matching under the Hamming distance model with time In this article, we present MW-MaxShiftM, a robust and complexity O(mdh=wen). MaxShiftM is currently the core generalised implementation of MaxShiftM which overcomes computational task of at least two applications in molecular the limitations of a na¨ıve implementation of the algorithm, biology: so that it no longer requires that the maximal length of the • MoTeX: a tool for single and structured motif extraction alignments is less than or equal to w (h ≤ w). With MW- from large-scale datasets [18], [19]. With MoTeX, MaxShiftM, the fixed length h can be set as long as length MaxShiftM is used to detect overrepresented motifs m of pattern x (h ≤ m). in the same or related genetic sequences. Regulatory MW-MaxShiftM carries out a similar process as regions such as promoter and enhancer binding sites MaxShiftM but stores the alignments across dh=we can be found surrounding genes in DNA.

Load more