Pattern Matching in Strings Maxime Crochemore, Christophe Hancart

Pattern matching in strings Maxime Crochemore, Christophe Hancart To cite this version: Maxime Crochemore, Christophe Hancart. Pattern matching in strings. J. Atallah Mikhail. Algo- rithms and Theory of Computation Handbook, CRC Press, pp.11.1-11.28, 1998. hal-00620790 HAL Id: hal-00620790 https://hal-upec-upem.archives-ouvertes.fr/hal-00620790 Submitted on 13 Feb 2013 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Pattern Matching in Strings Maxime Cro chemore Institut Gaspard Monge Universitede MarnelaVallee Christophe Hancart Lab oratoire dInformatique de Rouen Universitede Rouen Introduction The present chapter describ es a few standard algorithms used for pro cessing texts They apply for example to the manipulation of texts text editors to the storage of textual data text compression and to data retrieval systems The algorithms of the chapter are interesting in dierent resp ects First they are basic comp onents used in the implementations of practical software Second they introduce programming metho ds that serve as paradigms in other elds of computer science system or software design Third they play an imp ortant role in theoretical computer science by providing challenging problems Although data are stored variously text remains the main form of exchanging information This is particularly evident in literature or linguistics where data are comp osed of huge corp ora and dictionaries This applies as well to computer science where a large amount of data are stored in linear les And this is also the case in molecular biology where biological molecules can often b e approximated as sequences of nucleotides or aminoacids Moreover the quantity of available data in these elds tends to double every eighteen months This is the reason why algorithms should b e ecient even if the sp eed of computers increases regularly The manipulation of texts involves several problems among which are pattern matching approximate pattern matching comparing strings and text compression The rst problem is partially treated in the present chapter in that we consider only onedimensional ob jects Extensions of the metho ds to higher dimensional ob jects and solutions to the second problem app ear in the chapter headed Generalized Pattern Matching The third problem includes the comparison of molecular sequences and is developed in the corresp onding chapter Finally an entire chapter is devoted to text compression Pattern matching is the problem of lo cating a collection of ob jects the pattern inside raw text In this chapter texts and elements of patterns are strings which are nite sequences of symbols over a nite alphab et Metho ds for searching patterns describ ed by general regular expressions derive from standard parsing techniques see the chapter on formal grammars and languages We fo cus our attention to the case where the pattern represents a nite set of strings Although the latter case is a sp ecialization of the former case it can b e solved with more ecient algorithms Solutions to pattern matching in strings divide in two families In the rst one the pattern is xed This situation o ccurs for example in text editors for the search and substitute commands and in telecommunications for checking tokens In the second family of solutions the text is considered as xed while the pattern is variable This applies to dictionaries and to fulltext data bases for example The eciency of algorithms is evaluated by their worstcase running times and the amount of memory space they require The alphab et the nite set of symbols is denoted by and the whole set of strings over by The length of a string u is denoted by juj it is the length of the underlying nite sequence of symbols The concatenation of two strings u and v is denoted by uv A string v 0 00 is said to b e a factor or a segment of a string u if u can b e written in the from u v u where 0 00 0 0 u u if i ju j and j ju v j we say that the factor v starts at p osition i and ends at p osition j in u the factor v is also denoted by ui j The symbol at p osition i in u that is the i th symbol of u is denoted by ui Matching Fixed Patterns We consider in this section the two cases where the pattern represents a xed string or a xed dictionary a nite set of strings Algorithms search for and lo cate all the o ccurrences of the pattern in any text In the stringmatching problem the rst case it is convenient to consider that the text is examined through a window The window delimits a factor of the text and has usually the length of the pattern It slides along the text from left to right During the search it is p erio dically shifted according to rules that are sp ecic to each algorithm When the window is at a certain p osition on the text the algorithm checks whether the pattern o ccurs there or not by comparing some symbols in the window with the corresp onding aligned symbols of the pattern if there is a whole match the p osition is rep orted During this scan op eration the algorithm acquires from the text information which are often used to determine the length of the next shift of the window Some part of the gathered information can also b e memorized in order to save time during the next scan op eration In the dictionarymatching problem the second case metho ds are based on the use of automata or related data structures The Brute Force Algorithm The simplest implementation of the sliding window mechanism is the brute force algorithm The strategy consists here in sliding uniformly the window one p osition to the right after each scan op eration As far as scans are correctly implemented this obviously leads to a correct algorithm We give b elow the pseudo co de of the corresp onding pro cedure The inputs are a nonempty string x its length m thus m > a string y and its length n The variable p in the pro cedure corresp onds to the current left p osition of the window on the text It is understo o d that the stringtostring comparison in line has to b e pro cessed symbol p er symbol according to a given order BruteForceMatcherx m y n for p from up to n m lo op if y p p m x then rep ort p The time complexity of the brute force algorithm is O m n in the worst case for instance m1 n when a b is searched in a for any two symbol a b 2 satisfying a 6 b if we assume that the rightmost symbol in the window is compared last But its b ehavior is linear in n when searching in random texts The KarpRabin Algorithm Hashing provides a simple metho d for avoiding a quadratic number of symbol comparisons in most practical situations Instead of checking at each p osition p of the window on the text whether the pattern o ccurs here or not it seems to b e more ecient to check only if the factor of the text delimited by the window namely y p p m lo oks like x In order to check the resemblance b etween the two strings a hash function is used But to b e helpful for the stringmatching problem the hash function should b e highly discriminating for strings According to the running times of the algorithms the function should also have the following prop erties to b e eciently computable to provide an easy computation of the value asso ciated with the next factor from the value asso ciated with the current factor The last p oint is met when symbols of alphab et are assimilated with integers and when the hash function say h is dened for each string u 2 by juj1 X juj1i A ui d mo d q hu i=0 where q and d are two constants Then for each string v 2 for each symbols a a 2 hv a is computed from ha v by the formula jv j hv a ha v a d d a mo d q During the search for pattern x it is enough to compare the value hx with the hash value asso ciated with each factor of length m of text y If the two values are equal that is in case of collision it is still necessary to check whether the factor is equal to x or not by symbol comparisons The underlying stringmatching algorithm which is denoted as the KarpRabin algorithm is implemented b elow as the pro cedure KarpRabinMatcher In the pro cedure the values m1 d mo d q hx and hy m are rst precomputed and stored resp ectively in the variables r s and t lines The value of t is then recomputed at each step of the search phase lines It is assumed that the value of symbols ranges from to c the quantity c q is added in line to provide correct computations on p ositive integers KarpRabinMatcherx m y n r s x mo d q t for i from up to m lo op r r d mo d q s s d xi mo d q t t d y i mo d q for p from up to n m lo op t t d y p m mo d q if t s and y p p m x then rep ort p t c q t y p r mo d q Convenient values for d are p owers of in this case all the pro ducts by d can b e computed as shifts on integers The value of q is generally a large prime such that the quantities q d c and c q do not cause overows but it can also b e the value of the implicit mo dulus supp orted by integer op erations An illustration of the b ehavior of the algorithm is given in Figure p

Load more