External Inverse Pattern Matching

External Inverse Pattern Matching y z Leszek G asieniec Piotr Indyk Piotr Krysta Abstract We consider external inverse pattern matching problem Given a text T of length n over an ordered alphab et such that jj and a number m n The entire prob m lem is to nd a pattern P which is not a subword of T and which maximizes M AX the sum of Hamming distances b etween P and all subwords of T of length m We M AX present optimal O n log time algorithm for the external inverse pattern matching problem which substantially improves the only known p olynomial O nm log time algorithm introduced in Moreover we discuss a fast parallel implementation of our algorithm on the CREW PRAM mo del Topics algorithms and data structures string algorithms parallel algorithms MaxPlanck Institut f ur Informatik Im Stadtwald D Saarbr ucken Germany Email leszekmpisbmpgde y Computer Science Department Stanford University Gates Building CA USA Email indykcsstanfordedu z Institute of Computer Science University of Wroclaw Przesmyckiego PL Wroclaw Poland Email pkrystaiiuniwrocpl Research of this author was partially supp orted by KBN grant P Introduction Given a string T called later a text of length n over an alphab et The inverse pattern m m matching problem is to nd a word P or P which minimizes maxi MIN M AX mizes the sum of Hamming distances b etween P P and all subwords of length MIN M AX m in the text T One can also consider two variations of the problem when the optimal word is supp osed to o ccur in the text T or opp ositely when its o ccurrence in T is forbidden The two variations of the problem are called resp ectively internal and external inverse pattern matching problems It is assumed that in the internal inverse pattern matching desired internal pattern P must minimize the sum of distances whereas in the external case MIN optimal external pattern P maximizes the entire sum As rep orted in the inverse M AX pattern matching app ears naturally and nds applications in several elds like information retrieval data compression computer security and molecular biology For example the ex ternal inverse pattern matching can b e used in a context of intrusion or plagiarism detection see or in the synthesis of molecular prob es in genome sequencing by hybridization It was shown by Amir Ap ostolico and Lewenstein in that the inverse pattern matching problem can b e solved in time O n log when no additional restriction on P or P M AX MIN is assumed However it turned out that internal inverse pattern matching problem app ears to b e signicantly harder Amir et al in presented two algorithms for this problem The rst algorithm which is reasonably simple has the running time O nm log The second one uses more sophisticated techniques like convolutions for Hamming distance p 2 computation and it runs in time O n m log m Amir et al have shown a reduction from all mismatches problem see to the internal inverse pattern matching Any improvement in the all mismatches issue is a long standing op en problem thus it lo oks to b e quite unlikely to get a faster algorithm for the internal inverse pattern matching However Amir et al show that using techniques from one can get faster sup erlinear solution for the internal case when approximate answers are allowed The b est known to our knowledge O nm log time algorithm for the external inverse pattern matching was given by Amir et al in They presented the idea of mstems for the text T ie all p ossible words of length at most m not b elonging to T but whose all prop er prexes form subwords of T It was shown in that the optimal external pattern P can b e comp osed of some mstem of T extended by a prop er size sux of the maximal M AX word P Unfortunately the straightforward application of the mstem approach leads M AX to O nm log time solution b ecause one has to lo ok for mstems testing all text subwords of length at most m In this pap er we show how to p erform tests of text subwords more eciently We present new and optimal O n log time algorithm for the external inverse pattern matching problem showing that the internal case is a b ottleneck in the inverse pattern matching The optimality comes from the complexity of element distinction problem to which external inverse pattern matching can b e reduced The new ecient solution is a consequence of deep er analysis of relation b etween the maximal words P P and the M AX M AX text T Our main algorithm uses ecient algorithmic techniques like compact sux trees range minimum queries and lowest common ancestor queries in trees supp orted by an online computations of symbol weights dened later The rest of the pap er is organized as follows In section we introduce notation and basic techniques used in our algorithm Section contains the main algorithm with complexity analysis and pro of of correctness In this section we also discuss a parallel implementation of our algorithm Section contains the nal remarks and states some op en problems in the related areas Preliminaries Given an ordered alphab et containing symbols ie jj Any sequence of concate nated symbols from is called a word or a string We use symbol to denote op eration of concatenation but the symbol is omitted in cases where the use of concatenation is natural th We use a notation w i for the i symbol of the word w w ij for the substring of w which starts at p osition i and ends at j w stands for the string w without its rst symbol while symbol stands for the empty string For example let w b e a string of length n ie jw j n Then w w n w w n w i i w i and w i j when i j Any subword of w of the form w i for all i f ng is called a prex of w and a subword of the form w jn for all j f ng is called a sux of w We use notation u w u w when u is not a subword of string w In case u w we say that the word u is external string for w A search tree for a given set of words S whose edges are lab eled by symbols drawn from the alphab et is called a trie for S Any sequence v v v 1 k of neighboring no des by parentchildren relation in a tree such that all v s are pairwise i disjoint is called a path Each word w S is represented in the trie as a path from the ro ot to some leaf Recall that a trie is a prex tree ie two words have a common path from the ro ot as long as they have the same prex A path v v v whose all internal no des 1 k but last have degree equal to is called a chain Problem Denition Let T b e a text over such that jT j n and P b e a string over such that jP j m n The Hamming distance b etween the word P and a subword of T starting at p osition i is dened as follows jP j X hP j T i j H P T ii jP j j =1 where for any two symbols a b a b ha b a b In other words the Hamming distance H P T ii jP j gives the number of mismatches b etween symbols of the aligned words T ii jP j and P In this pap er we are primarily interested in the total Hamming distance b etween the word P and all its alignments in the text T which is dened as jT jjP j+1 X H P T ii jP j HP T i=1 Now we are ready to introduce the entire problem Problem External Inverse Pattern Matching n Given a text string T where jj and a p ositive integer m st m n The m entire problem is to nd a pattern P st P T and HP T M AX M AX M AX m HP T for all strings P which do not b elong to T Notice that if text T contains all p ossible strings of length m then the external inverse pattern matching has no solution th According to the denition of entire problem the i symbol of the desired pattern P M AX can b e aligned only with p ositions from i to n m i in the text T since we are interested only in full alignments of the pattern P in T The latter observation denes naturally M AX th th m dierent ranges in the text T st the i range is asso ciated with the i p osition in pattern P We call these m consecutive ranges windows Win Win So simply M AX 1 m th Win T in m i Let b e a frequency function dened in the i window Win i i i ie a equals to the number of all o ccurrences of symbol a in Win for all a and i i th i f mg The weight w of symbols in the i window is dened as follows i w a n m a for all i f mg and a i i In terms of the weights the entire problem can b e viewed as lo oking for a string P T of length m which maximizes the sum m X w P i i i=1 th Notice that the sum is maximized when the i p osition in the pattern P is o ccupied by the least frequent or equivalently the heaviest symbol in the window Win which corresp onds i to the denition of the maximal word P the maximal solution of the general inverse M AX pattern matching However in the external inverse pattern matching optimal pattern P M AX maximizes the sum among all external strings for the text T Through the rest of the pap er we will use the weighted version of the external inverse pattern matching Moreover b efore the main algorithm starts we transform the input string to one which consists of numbers from the range to st every symbol from the alphab et is substituted by a unique number Thus from now on we assume that symbols can b e treated as small numbers Since the alphab et is ordered the transformation can b e simply p erformed in time O n log which do es not violate the complexity of the entire algorithm Basic Techniques In the following section we recall some basic techniques used in our algorithm Compact sux tree and compact trie A sux tree of a word w is a trie which

Load more