Pattern matching in strings Maxime Crochemore, Christophe Hancart

To cite this version:

Maxime Crochemore, Christophe Hancart. Pattern matching in strings. J. Atallah Mikhail. Algo- rithms and Theory of Computation Handbook, CRC Press, pp.11.1-11.28, 1998. ￿hal-00620790￿

HAL Id: hal-00620790 https://hal-upec-upem.archives-ouvertes.fr/hal-00620790 Submitted on 13 Feb 2013

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés.

Pattern Matching in Strings

Maxime Cro chemore Institut Gaspard Monge Universitede MarnelaVallee

Christophe Hancart Lab oratoire dInformatique de Rouen Universitede Rouen

Introduction

The present chapter describ es a few standard algorithms used for pro cessing texts They apply

for example to the manipulation of texts text editors to the storage of textual data text

compression and to data retrieval systems The algorithms of the chapter are interesting in

dierent resp ects First they are basic comp onents used in the implementations of practical

software Second they introduce programming metho ds that serve as paradigms in other

elds of computer science system or software design Third they play an imp ortant role in

theoretical computer science by providing challenging problems

Although data are stored variously text remains the main form of exchanging information

This is particularly evident in literature or linguistics where data are comp osed of huge corp ora

and dictionaries This applies as well to computer science where a large amount of data are

stored in linear les And this is also the case in molecular biology where biological molecules

can often b e approximated as sequences of nucleotides or aminoacids Moreover the quantity

of available data in these elds tends to double every eighteen months This is the reason why

algorithms should b e ecient even if the sp eed of computers increases regularly

The manipulation of texts involves several problems among which are pattern matching

approximate pattern matching comparing strings and text compression The rst problem

is partially treated in the present chapter in that we consider only onedimensional ob jects

Extensions of the metho ds to higher dimensional ob jects and solutions to the second problem

app ear in the chapter headed Generalized Pattern Matching The third problem includes

the comparison of molecular sequences and is developed in the corresp onding chapter Finally

an entire chapter is devoted to text compression

Pattern matching is the problem of lo cating a collection of ob jects the pattern inside raw

text In this chapter texts and elements of patterns are strings which are nite sequences of

symbols over a nite alphab et Metho ds for searching patterns describ ed by general regular

expressions derive from standard techniques see the chapter on formal grammars and

languages We fo cus our attention to the case where the pattern represents a nite set of

strings Although the latter case is a sp ecialization of the former case it can b e solved with

more ecient algorithms

Solutions to pattern matching in strings divide in two families In the rst one the pattern

is xed This situation o ccurs for example in text editors for the search and substitute

commands and in telecommunications for checking tokens In the second family of solutions

the text is considered as xed while the pattern is variable This applies to dictionaries and to

fulltext data bases for example

The eciency of algorithms is evaluated by their worstcase running times and the amount

of memory space they require

The alphab et the nite set of symbols is denoted by and the whole set of strings over



by The length of a string u is denoted by juj it is the length of the underlying nite

sequence of symbols The concatenation of two strings u and v is denoted by uv A string v

0 00

is said to b e a factor or a segment of a string u if u can b e written in the from u v u where

0 00  0 0

u u if i ju j and j ju v j we say that the factor v starts at p osition i and ends

at p osition j in u the factor v is also denoted by ui j The symbol at p osition i in u that

is the i th symbol of u is denoted by ui

Matching Fixed Patterns

We consider in this section the two cases where the pattern represents a xed string or a xed

dictionary a nite set of strings Algorithms search for and lo cate all the o ccurrences of the

pattern in any text

In the stringmatching problem the rst case it is convenient to consider that the text

is examined through a window The window delimits a factor of the text and has usually

the length of the pattern It slides along the text from left to right During the search it is

p erio dically shifted according to rules that are sp ecic to each algorithm When the window

is at a certain p osition on the text the algorithm checks whether the pattern o ccurs there or

not by comparing some symbols in the window with the corresp onding aligned symbols of the

pattern if there is a whole match the p osition is rep orted During this scan op eration the

algorithm acquires from the text information which are often used to determine the length of

the next shift of the window Some part of the gathered information can also b e memorized in

order to save time during the next scan op eration

In the dictionarymatching problem the second case metho ds are based on the use of

automata or related data structures

The Brute Force Algorithm

The simplest implementation of the sliding window mechanism is the brute force algorithm

The strategy consists here in sliding uniformly the window one p osition to the right after each

scan op eration As far as scans are correctly implemented this obviously leads to a correct

algorithm

We give b elow the pseudo co de of the corresp onding pro cedure The inputs are a nonempty

string x its length m thus m > a string y and its length n The variable p in the pro cedure

corresp onds to the current left p osition of the window on the text It is understo o d that the

stringtostring comparison in line has to b e pro cessed symbol p er symbol according to a

given order

BruteForceMatcherx m y n

for p from up to n m

lo op if y p p m x

then rep ort p

The time complexity of the brute force algorithm is O m  n in the worst case for instance

m1 n

when a b is searched in a for any two symbol a b 2 satisfying a 6 b if we assume that

the rightmost symbol in the window is compared last But its b ehavior is linear in n when

searching in random texts

The KarpRabin Algorithm

Hashing provides a simple metho d for avoiding a quadratic number of symbol comparisons in

most practical situations Instead of checking at each p osition p of the window on the text

whether the pattern o ccurs here or not it seems to b e more ecient to check only if the factor

of the text delimited by the window namely y p p m lo oks like x In order to

check the resemblance b etween the two strings a hash function is used But to b e helpful

for the stringmatching problem the hash function should b e highly discriminating for strings

According to the running times of the algorithms the function should also have the following

prop erties

 to b e eciently computable

 to provide an easy computation of the value asso ciated with the next factor from the

value asso ciated with the current factor

The last p oint is met when symbols of alphab et are assimilated with integers and when the

hash function say h is dened for each string u 2 by

juj1

X

juj1i

A

ui  d mo d q hu

i=0

where q and d are two constants Then for each string v 2 for each symbols a a 2

hv a is computed from ha v by the formula

jv j

hv a ha v a  d  d a mo d q

During the search for pattern x it is enough to compare the value hx with the hash value

asso ciated with each factor of length m of text y If the two values are equal that is in case

of collision it is still necessary to check whether the factor is equal to x or not by symbol

comparisons

The underlying stringmatching algorithm which is denoted as the KarpRabin algorithm

is implemented b elow as the pro cedure KarpRabinMatcher In the pro cedure the values

m1

d mo d q hx and hy m are rst precomputed and stored resp ectively in the

variables r s and t lines The value of t is then recomputed at each step of the search

phase lines It is assumed that the value of symbols ranges from to c the quantity

c  q is added in line to provide correct computations on p ositive integers

KarpRabinMatcherx m y n

r

s x mo d q

t

for i from up to m

lo op r r  d mo d q

s s  d xi mo d q

t t  d y i mo d q

for p from up to n m

lo op t t  d y p m mo d q

if t s and y p p m x

then rep ort p

t c  q t y p  r mo d q

Convenient values for d are p owers of in this case all the pro ducts by d can b e computed

as shifts on integers The value of q is generally a large prime such that the quantities

q  d c and c  q do not cause overows but it can also b e the value of

the implicit mo dulus supp orted by integer op erations An illustration of the b ehavior of the

algorithm is given in Figure

p

y p n o d e f e n s e f o r s e n s e

hy p p

Figure An illustration of the b ehavior of the KarpRabin algorithm when searching for the pattern

x sense in the text y nodefenseforsense Here symbols are assimilated with their ASCI I

co des hence c and the values of q and d are set resp ectively to and This is valid for

16

example when the maximal integer is The value of hx is    

mo d Since only hy and hy among the dened values of hy p p

are equal to hx two stringtostring comparisons against x are p erformed

The worst case complexity of the ab ove stringmatching algorithm is quadratic as it is for

the brute force algorithm but its exp ected running time is O m n if parameters q and d are

adequate

The KnuthMorrisPratt Algorithm

This section presents the rst discovered lineartime stringmatching algorithm Its design

follows a tight analysis of a version of the brute force algorithm in which the stringtostring

comparison pro ceeds from left to right The brute force algorithm wastes the information

gathered during the scan of the text On the contrary the KnuthMorrisPratt algorithm

stores the information with two purp oses First it is used to improve the length of shifts

Second there is no backward scan of the text

Consider a given p osition p of the window on the text Assume that a mismatch o ccurs

b etween symbols y p i and xi for some i 6 i m an illustration is given in Figure

Thus we have y p p i x i and y p i xi With regard to the information

given by x i interesting shifts are necessarily connected with the b orders of x i

A b order of a string u is a factor of u that is b oth a prex and a sux of u Among the

b orders of x i the longest prop er b order followed by a symbol dierent from xi is the

b est p ossible candidate sub ject to the existence of such of a b order A factor v of a string u is

said to b e a prop er factor of u if u and v are not identical that is if jv j juj This introduces

the function dened for each i f m g by

i maxfk j 6 k i xi k i x k xk xi or k g

Then after a shift of length i i the symbol comparisons can resume with y p i against

x i in the case where i > and y p i against x otherwise Doing so we miss

no o ccurrence of x in y and avoid a backtrack on the text The previous statement is still

valid when no mismatch o ccurs that is when i m if we consider for a moment the string

x$ instead of x where $ is a symbol of alphab et o ccurring nowhere in x This amounts to

completing the denition of function by setting

m maxfk j 6 k m xm k m x k g

The KnuthMorrisPratt stringmatching algorithm is given in pseudo co de b elow as the

pro cedure KnuthMorrisPrattMatcher The values of function are rst computed

by the function BetterPrefixFunction given after The value of the variable j is equal

to p i in the remainder of the co de the search phase of the algorithm strictly sp eaking

this simplies the co de and p oints out the sequential pro cessing of the text Observe that the

prepro cessing phase applies a similar metho d to the pattern itself as if y x m

a b a a b c a b a b c a b b a b a a b a c b a a b c y

a b c a b a b c a b a b a x

a

a b a a b c a b a b c a b b a b a a b a c b a a b c y

shift

a b c a b a b c a b a b a x

a b c a b a b c a b a b a x

a b c a b a b c a b a b a x

b

a b c a b a b c a b a b a x

a b c a b a b c a b a b a x

i

c xi a b c a b a b c a b a b a

i

Figure An illustration of the shift in the KnuthMorrisPratt algorithm when searching for the

pattern x abcababcababa a The window on the text y is at p osition A mismatch o ccurs at

p osition on x The matching symbols are shown darkly shaded and the current analyzed symbols

lightly shaded Avoiding b oth a backtrack on the text and an immediate mismatch leads to shift the

window p ositions to the right The stringtostring comparison resumes at p osition on the pattern

b The current shift is the consequence of an analysis of the list of the prop er b orders of x and

of the symbol which follow them in x The prexes of x that are b orders of x abcabacbab are

rightaligned along the discontinuous vertical line String x abcab is a b order of x but

is followed by symbol a which is identical to x String x is the exp ected b order since it is

followed by symbol c c The values of the function for pattern x 5

KnuthMorrisPrattMatcherx m y n

BetterPrefixFunctionx m

i

for j from up to n

lo op while i > and y j xi

lo op i i

i i

if i m

then rep ort j m

i m

BetterPrefixFunctionx m

i

for j from up to m

lo op if xj xi

then j i

else j i

lo op i i

while i > and xj xi

i i

m i

return

The algorithm has a worstcase running time in O m n and requires O m extraspace

to store function The linear running time results from the fact that the number of symbol

comparisons p erformed during the prepro cessing phase and the search phase is less than m

and n resp ectively All the previous b ounds are indep endent of the size of the alphab et

The BoyerMoore Algorithm

The BoyerMoore algorithm is considered as the most ecient stringmatching algorithm in

usual applications A simplied version of it or the entire algorithm is often implemented in

text editors for the search and substitute commands

The scan op eration pro ceeds from right to left in the window on the text instead of left

to right as in the KnuthMorrisPratt algorithm In case of a mismatch the algorithm uses

two functions to shift the window These two shift functions are called the b etterfactor shift

function and the badsymbol shift function In the two next paragraphs we explain the goal

of the two functions and we give pro cedures to precompute their values

We rst explain the aim of the b etterfactor shift function Let p b e the current left

p osition of the window on the text Assume that a mismatch o ccurs b etween symbols y p i

and xi for some i 6 i m an illustration is given in Figure Then we have y p i xi

and y p i p m xi m The b etterfactor shift consists in aligning the

factor y p i p m with its rightmost o ccurrence xk m i k in x preceded

by a symbol xk dierent from xi to avoid an immediate mismatch If no such factor exists

the shift consists in aligning the longest sux of y p i p m with a matching prex

of x The b etterfactor shift function is dened by

i minfi k j 6 k i xk m i k xi m xk xi

or i m 6 k x xi k m xm i k m g

a b c b a b c a b b a a a b a c b a a b b a b c a b y

b a b a c b a b a b a x

a

a b c b a b c a b b a a a b a c b a a b b a b c a b y

shift

b a b a c b a b a b a x

b a b a c b a b a b a x

b a b a c b a b a b a x

b a b a c b a b a b a x b

b a b a c b a b a b a x

b a b a c b a b a b a x

i

c xi b a b a c b a b a b a

i

Figure An illustration of the b etterfactor shift in the BoyerMoore algorithm when searching for

the pattern x babacbababa a The window on the text is at p osition The stringtostring

comparison which pro ceeds from right to left stops with a mismatch at p osition on x The window is

shifted p ositions to the right to avoid an immediate mismatch b Indeed the string x aba

is rep eated three times in x but is preceded each time by symbol x b The exp ected matching

factor in x is then the prex ba of x The factors of x identical with aba and the prexes of x ending

with a sux of aba are rightaligned along the rightmost discontinuous vertical line c The values of

the shift function for pattern x 7

b a c d c b a b a b a d a c a b a a b b c b c a b d y

b a b a c b a b a b a x

b a c d c b a b a b a d a c a b a a b b c b c a b d y

a a b c d

shift

a

b a b a c b a b a b a x

b a

Figure An illustration of the badsymbol shift in the BoyerMoore algorithm when searching for the

pattern x babacbababa a The window on the text is at p osition The stringtostring comparison

stops with a mismatch at p osition on x Considering only this p osition and the unexp ected symbol

o ccurring at this p osition namely symbol x c leads to shift the window p ositions to the right

Notice that if the unexp ected symbol were a or d the applied shift would have b een and resp ectively

b The values of the table for pattern x when alphab et is reduced to fa b c dg

for each i f m g The value i is then exactly the length of the shift induced

by the b etterfactor shift The values of function are computed by the function given b elow

as the function BetterFactorFunction An auxiliary table namely f is used it is an

analogue of the function used in the KnuthMorrisPratt algorithm but dened this time

for the reverse pattern it is indexed from to m The running time of the function

BetterFactorFunction is O m

BetterFactorFunctionx m

for j from up to m

lo op j

i m

for j from m down to

lo op f j i

while i m and xj xi

lo op if i

then i i j

i f i

i i

for j from up to m

lo op if j

then j i

if j i

then i f i

return

We now come to the aim of the badsymbol shift function Figure shows an illustration

Consider again the text symbol y p i that causes a mismatch Assume rst that this symbol

o ccurs in x m Then let k b e the p osition of the rightmost o ccurrence of y p i in

x m The window can b e shifted i k p ositions to the right if k i and only one

p osition otherwise without missing an o ccurrence of x in y Assume now that symbol y p i

do es not o ccur in x Then no o ccurrence of x in y can overlap the p osition p i on the text

and thus the window can b e shifted i p ositions to the right Let b e the table indexed

on alphab et and dened for each symbol a by

a min fmg fm j j 6 j m xj ag

According to the ab ove discussion the badsymbol shift for the unexp ected text symbol a

aligned with the symbol at p osition i on the pattern is the value

a i maxf a i m g

which denes the badsymbol shift function on f m g We give now the co de

of the function LastOccurrenceFunction that computes table Its running time is

O m card

LastOccurrenceFunctionx m

for each a

lo op a m

for j from up to m

lo op xj m j

return

The shift applied in the BoyerMoore algorithm in case of a mismatch is the maximum

b etween the b etterfactor shift and the badsymbol shift In case of a whole match the shift

applied to the window is m minus the length of the longest prop er b order of x that is also the

value this value is indeed what is called the p erio d of the pattern The co de of the

entire algorithm is given b elow

BoyerMooreMatcherx m y n

BetterFactorFunctionx m

LastOccurrenceFunctionx m

p

while p 6 n m

lo op i m

while i > and y p i xi

lo op i i

if i >

then p p maxf i y p i i m g

else rep ort p

p p

The worstcase running time of the algorithm is quadratic It is surprising however that

when used to search only for the rst o ccurrence of the pattern the algorithm runs in linear

time Slight mo dications of the strategy yield lineartime algorithms When searching for

m1 n

a b in a with a b and a b the algorithm considers only bnmc symbols of the text

This b ound is the absolute minimum for any stringmatching algorithm in the mo del where the

pattern only is prepro cessed Indeed the algorithm is exp ected to b e extremely fast on large

alphab ets relative to the length of the pattern

Practical StringMatching Algorithms

The badsymbol shift function introduced in the BoyerMoore algorithm is not very ecient

for small alphab ets but when the alphab et is large compared with the length of the pattern

as it is often the case with the ASCI I table and ordinary searches made under a text editor

it b ecomes very useful Using only the corresp onding table pro duces some ecient algorithms

for practical searches We describ e one of these algorithms b elow

Consider a p osition p of the window on the text and assume that the symbols y p m

and xm are identical If xm do es not o ccur in the prex x m of x the

window can b e shifted m p ositions to the right after the stringtostring comparison b etween

y p p m and x m is p erformed Otherwise let k b e the p osition of the rightmost

o ccurrence of xm in x m the window can b e shifted m k p ositions to the right

This shows that y p m is also a valid shift in the case where y p m xm

The underlying algorithm is the Horsp o ol algorithm

The pseudo co de of the Horsp o ol algorithm is given b elow To prevent two references to the

rightmost symbol in the window at each scan and shift op eration table is slightly mo died

xm contains the sentinel value after its previous value is saved in variable t The

value of the variable j is the value of the expression p m in the discussion ab ove

HorspoolMatcherx m y n

LastOccurrenceFunctionx m

t xm

xm

j m

while j n

lo op s y j

if s 6

then j j s

else if y j m j x m

then rep ort j m

j j t

Just like the brute force algorithm the Horsp o ol algorithm has a quadratic worstcase time

complexity But its b ehavior in practice is at least as go o d as the b ehavior of the BoyerMoore

algorithm is An example showing the b ehavior of b oth algorithms is given in Figure

The AhoCorasick Algorithm

The UNIX op erating system provides standard textle facilities Among them is the series

of grep commands that lo cate patterns in les We describ e in this section the AhoCorasick

algorithm underlying an implementation of the fgrep command of UNIX It searches les for

a nite and xed set of strings the dictionary and can for instance output lines containing

at least one of the strings

If we are interested in searching for all o ccurrences of all strings of a dictionary a rst

solution consists in rep eating some stringmatching algorithm for each string Considering a

dictionary X containing k strings and a text y the search runs in that case in time O m n  k

where m is the sum of the length of the strings in X and n the length of y But this solution

is not ecient since text y has to b e read k times The solution describ ed in this section

provides b oth a sequential read of the text and a total running time which is O m n on a

xed alphab et The algorithm can b e viewed as a direct extension of weaker version of the

KnuthMorrisPratt algorithm

The search is done with the help of an automaton that stores the situations encountered

during the pro cess At a given p osition on the text the current state is identied with the

set of pattern prexes ending here The state represents all the factors of the pattern that

can p ossibly lead to o ccurrences Among the factors the longest contains all the information

necessary to continue the search So the search is realized with an automaton denoted by

n o d e f e n s e f o r s e n s e y

s e n s e x s e n s e x

a

s e n s e x s e n s e x

s e n s e x s e n s e x

n o d e f e n s e f o r s e n s e y

s e n s e x s e n s e x s e n s e x

b

s e n s e x s e n s e x

s e n s e x s e n s e x

Figure An illustration of the b ehavior of two fast stringmatching algorithms when searching for the

pattern x sense in the text y nodefenseforsense The successive p ositions of the window

on the text are suggested by the alignments of x with the corresp onding factors of y The symbols

of x considered during each scan op eration are shown hachured a Behavior of the BoyerMoore

algorithm The rst and second shifts result from the b ettershift function the third and fourth from

the badsymbol function and the fth from a shift of the length of x minus the length of its longest

prop er b order the p erio d of x b Behavior of the Horsp o ol algorithm We assume here that the four

leftmost symbols in the window are compared with the symbols of x from left to right

D X of which states are in onetoone corresp ondence with the prexes of X Implementing

completely the transition function of D X would required a size O m card Instead of

that the AhoCorasick algorithm requires only O m space To get this space complexity a

part of the transition function is made explicit in the data and the other part is computed

with the help of a failure function For the rst part we assume that for any input p a the

function denoted by Target returns some state q if the triple p a q is an edge in the data

and the value nil otherwise The second part uses the failure function fail which is an analogue

of the function used in the KnuthMorrisPratt algorithm But this time the function is

dened on the set of states and for each state p dierent from the initial state

fail p the state identied with the longest prop er sux of the prex identied with p

that is also a prex of a string of X

The aim the failure function is to defer the computation of a transition from the current state

say p to the computation of the transition from the state fail p with the same input symbol

say a when no edge from p lab eled by symbol a is in the data the initial state which is

identied with the empty string is the default state for the statement We give b elow the

pseudo co de of the function NextState that computes the transitions in the representation

The initial state is denoted by i

e

2 3

c

1

a

s

0 4

e

5

a

6 7 8

s e

Figure The like automaton of the pattern X face as easeg The initial state is distinguished

by a thick ingoing arrow each terminal state by a thick outgoing arrow The states are numbered from

to according to the order in which they are created by the construction statement describ ed in the

present section State is identied with the empty string state with a state with ac state with

ace and so on The automaton accepts the language X

NextStatep a i

while p nil and Targetp a nil

lo op p fail p

if p nil

then q Targetp a

else q i

return q

The prepro cessing phase of the AhoCorasick algorithm builds the explicit part of D X

including function fail It is divided itself into two phases

The rst phase of the prepro cessing phase consists in building a subautomaton of D X

It is the trie of X the digital in which branches sp ell the strings of X and edges are

lab eled by symbols having as initial state the ro ot of the trie and as terminal states the no des

corresp onding to strings of X an example is given in Figure It diers from D X in two

p oints

it contains only the forward edges

it accepts only the set X

An edge p a q in the automaton is said to b e forward if the prex identied with q is

in the form ua where u is the prex corresp onding to p The function given b elow as the

function TrieLikeAutomaton computes the automaton corresp onding to the trie of X by

returning its initial state The terminal mark of each state r is managed through the attribute

terminal r the mark is either true or false dep ending on whether state r is terminal or

not We assume that the function NewState creates and returns a new state and that the

pro cedure MakeEdge adds a given new edge to the data

TrieLikeAutomatonX

i NewState

terminal i false

for string x from rst to last string of X

lo op p i

for symbol a from rst to last symbol of x

lo op q Targetp a

if q nil

then q NewState

terminal q false

MakeEdgep a q

p q

terminal p true

return i

The second step of the prepro cessing phase consists mainly in precomputing the failure

function This is done by a breadthrst traversal of the trielike automaton The corresp onding

pseudo co de is given b elow as the pro cedure MakeFailureFunction

MakeFailureFunctioni

fail i nil

EmptyQueue

Enqueue i

while not QueueIsEmpty

lo op p Dequeue

for each symbol a such that Targetp a nil

lo op q Targetp a

fail q NextStatefail p a i

if terminal fail q

then terminal q true

Enqueue q

During the computation some states can b e made terminal This o ccurs when the state is

identied with a prex that ends with a string of X an illustration is given in Figure

The complete dictionarymatching algorithm implemented in the pseudo co de b elow as the

pro cedure AhoCorasickMatcher starts with the two steps of the prepro cessing the search

follows which simulates automaton D X It is understo o d that the empty string do es not

b elong to X

AhoCorasickMatcherX y

i TrieLikeAutomatonX

MakeFailureFunctioni

p i

for symbol a from rst to last symbol of y

lo op p NextStatep a i

if terminal p

then rep ort an o ccurrence

The total number of tests Targetp a nil p erformed by function NextState during

its calls by pro cedure MakeFailureFunction and during its calls by the search phase of

e

2 3

c

1

a

s

0 4

e

5

a

6 7 8

s e

Figure The explicit part of the automaton D X of the pattern X face as easeg Compared

to the trielike automaton of X displayed in Figure state has b een made terminal this is b ecause

the corresp onding prex namely eas ends with the string as that is in X The failure function fail is

depicted with discontinuous nonlab eled directed edges

the algorithm are b ounded by m and n resp ectively similarly as the b ounds of comparisons

in the KnuthMorrisPratt algorithm Using a total order on the alphab et the running time

of function Target is b oth O log k and O log card since the maximum number of edges

outgoing a state in the data representing automaton D X is b ounded b oth by k and by

card Thus the entire algorithm runs in time O m n on a xed alphab et and in time

O m n log min fk card g in the general case The algorithm requires O m extraspace

to store the data and to implement the queue used during the breadthrst traversal executed

in pro cedure MakeFailureFunction

Let us discuss the question of rep orting o ccurrences of pattern X line of pro cedure

AhoCorasickMatcher The simplest way of doing it is to rep ort the ending p ositions of

o ccurrences This remains to output the value of the p osition of the current symbol in the text

A second p ossibility is to rep ort the whole set of strings in X ending at the current p osition

To do so the attribute terminal has to b e transformed First for a state r terminal r is the

set of the string of X that are suxes of the string corresp onding to r Second to avoid a

quadratic b ehavior sets are manipulated by their identiers only

Indexing Texts

This section deals with the patternmatching problem applied to xed texts Solutions consist

in building an index on the text that sp eeds up further searches The indexes that we consider

here are data structures that contain all the suxes and therefore all the factors of the text Two

types of structures are presented sux trees and sux automata They are b oth compact

representations of suxes in the sense that their sizes are linear in the length of the text

although the sum of lengths of suxes of a string is quadratic Moreover their constructions

take linear time on xed alphab ets On an arbitrary nite alphab et assumed to b e ordered

a log card factor has to b e added to almost all running times given in the following This

corresp onds to the branching op eration involved in the resp ective data structures

Indexes are p owerful to ols that have many applications Here is a nonexhaustive list of

them assuming an index on the text y

Membership testing if a string x o ccurs in y

Occurrence number pro ducing the number of o ccurrences of a string x in y

0

a

b

2 5 12

7

abb

abbabb

bb

b

1 7 6 9 11

0 3 6

abb abb

8 10 3 4

4 5 1 2

Figure The sux tree T y of the string y aabbabb The no des are numbered from to

according to the order in which they are created by the construction algorithm describ ed in the present

section Each of the eight external no des of the trie is marked by the p osition of the o ccurrence of

the corresp onding sux in y Hence the branch running from the ro ot to an external no de

sp ells the string bbabb which is the sux of y starting at p osition

List of p ositions analogue of the stringmatching problem of Section

Longest rep eated factor lo cating the longest factor of y o ccurring at least twice in y

Longest common factor nding a longest string that o ccurs b oth in a string x and in y

Solutions to some of these problems are rst considered with sux trees then with sux

automata

Sux Trees

The sux tree T y of a nonempty string y of length n is a containing all the

suxes of y In order to simplify the statement it is assumed that y ends with a sp ecial

symbol of the alphab et o ccurring nowhere else in y this sp ecial symbol is denoted by $ in the

examples The sux tree of y is a trie which satises the following prop erties

the branches from the ro ot to the external no des sp ell the nonempty suxes of y and

each external no de is marked by the p osition of the o ccurrence of the corresp onding sux

in y

the internal no des have at least two successors except if y is a onelength string

the edges outgoing an internal no de are lab eled by factors starting with dierent symbols

any string that lab els an edge is represented by the couple of integers corresp onding to

its p osition in y and its length

An example of sux tree is displayed in Figure The sp ecial symbol at the end of y avoids

marking no des and implies that T y has exactly n external no des The other prop erties

then imply that the total size of T y is O n which makes it p ossible to design a lineartime

construction of the data structure The algorithm describ ed in the following and implemented

by the pro cedure SuffixTree given further has this time complexity

The construction algorithm works as follows It inserts the nonempty suxes y i n

6 i n of y in the data structure from the longest to the shortest sux In order to explain

how this is p erformed we introduce the two notations

h the longest prex of y i n that is a prex of some stricly longest sux of y

i

and

t the string w such that y i n is identical with h w

i i

dened for each i f n g The strategy to insert the suxes is precisely based on

these denitions Initially the data structure contains only the string y Then the insertion

of the string y i n 6 i n pro ceeds in two steps

rst the head in the data structure that is the no de h corresp onding to string h is

i

lo cated p ossibly breaking an edge

second a no de called the tail say t is created added as successor of no de h and the

edge from h to t is lab eled with string t

i

The second step of the insertion is clearly p erformed in constant time Thus nding the head

is critical for the overall p erformance of the construction algorithm A bruteforce metho d to

nd the head consists in sp elling the current sux y i n from the ro ot of the trie giving

2

an O jh j time complexity for the insertion at step i and an O n running time to build

i

the sux tree T y Adding shortcircuit links leads to an overall O n time complexity

although there is no guarantee that the insertion at any step i is realized in constant time

Observe that in any sux tree if the string corresp onding to a given internal no de p in the



data structure is in the form au with a and u then there exists an unique internal

no de corresp onding to the string u From this remark are dened the sux links by

link p the no de q corresp onding to the string u

when p corresp onds to the string au for some symbol a

for each internal no de p that is dierent from the ro ot The links are useful when computing

h from h b ecause of the prop erty if h is in the form aw for some symbol a and

i i1 i1



some string w then w is a prex of h

i

We explain in three following paragraphs how the sux links help to nd the successive

heads eciently We consider a step i in the algorithm assuming that i > We denote by g

the no de that corresp onds to the string h The aim is b oth to insert y i n and to nd

i1

the no de h corresp onding to the string h We rst study the most general case of the insertion

i

of the sux y i n Particular cases are studied after

0

We assume in the present case that the predecessor of g in the data structure say g is



b oth dened and dierent from the ro ot Then h is in the form auv where a u v

i1

0 0

au corresp onds to the no de g and v lab els the edge from g to g Since the string uv is a

prex of h it can b e fully sp elled from the ro ot Moreover the sp elling op eration of uv from

i

0

the ro ot can b e shortcircuited by sp elling only the string v from the no de link g The no de

q reached at the end of the sp elling op eration p ossibly breaking the last partially taken down

edge is then exactly the no de link g It remains to sp ell the string t from q for completely

i1

inserting the string y i n The sp elling stops on the exp ected no de h p ossibly breaking

again an edge which b ecomes the new head in the data structure The sux of t that has

i1

not b een sp elled so far is exactly the string t An example for the whole previous statement

i

is given in Figure

0 0

a b a b

5 2 2 5

abbabb abbabb abb abb babb

bb bb b

7 6 9 1 1 7 6 4

3 0 0 3 2

abb abb abb

8 10 3 8 3 4

4 5 1 4 1 2

a b

Figure During the construction of the sux tree T y of the string y aabbabb the step that

is the insertion of the sux bb The dened sux link are depicted with discontinuous nonlab eled

directed edges a Initially the head in the data structure is no de and its sux link is not yet

dened The predecessor of no de no de is dierent from the ro ot and the factor of y that is sp elled



from the ro ot to no de namely h abb is in the form auv where a u and v is the string

4



of lab eling the edge from no de to no de Here a a u is the empty string and v bb Then the

string uv bb is sp elled from the no de linked with no de that is from no de the sp elling op eration

stops on the edge from no de to no de this edge is broken which creates no de No de is linked to

no de The string t is sp elled from no de the sp elling op eration stops on no de which b ecomes

4

the new head in the data structure b No de is created added as successor of no de and the edge

from no de to no de is lab eled by the string remainder of the last sp elling op eration

The second case is when g is a direct successor of the ro ot The string h is then in the

i1



form au where a 2 and u 2 Similarly to the ab ove case the string u can b e fully sp elled

from the ro ot The sp elling of u gives a no de q which is then linked with g Afterwards the

string t is sp elled from q

i1

The last case is when g is the ro ot itself The string t minus its rst symbol has to b e

i1

sp elled from the ro ot Which ends the study of all the p ossible cases that can arise

The imp ortant asp ect of the algorithm is the use of two dierent implementations for the

two sp elling op erations p ointed out ab ove The rst one given in the pseudo co de b elow as the

function Fast deals with the situation where we know in advance that a given factor

y j j k of y can b e fully sp elled from a given no de p of the trie It is then sucient to

scan only the rst symbols of the lab els of the encountered no des which justies the name of

the function The second implementation of the sp elling op eration sp ells a factor y j j k

of y from a given no de p to o but this time the sp elling is p erformed symbol by symbol The

corresp onding function is implemented after as the function SlowFind Before giving the

pseudo co de of the functions we precise the notations used in the following

 For any input y p j the function SuccessorByOneSymbol returns the no de q

such that q is a successor of the no de p and the rst symbol of the lab el of the edge from

p to q is y j if such a no de q do es not exist it returns nil

 For any input p q the function Label returns the two integers that represents the lab el

of the edge from the no de p to the no de q

 The function NewNode creates and returns a new no de

 For any input p j k q the function NewBreakingNode creates and returns the

no de r breaking the edge p y j j k q at the p osition in the lab el y j j k

Which gives the two edges p y j j r and r y j j k q

Function FastFind returns a couple of no des such that the second one is the no de reached

by the sp elling and the rst one is its predecessor

FastFindy p j k

0

p nil

while k

0

lo op p p

q SuccessorByOneSymboly p j

r s Labelp q

if s 6 k

then p q

j j s

k k s

else p NewBreakingNodep r s q k

k

0

return p p

Compared to function FastFind function SlowFind considers an extrainput that is the

0

predecessor of no de p denoted by p It considers in addition two extraoutputs that are the

p osition and the length of the factor that remains to b e sp elled

0

SlowFindy p p j k

b false

lo op q SuccessorByOneSymboly p j

if q nil

then b true

else r s Labelp q

while s and y j y r

lo op

j j

k k

0

p p

if s

then p q

else p NewBreakingNodep r s q

b true

while b false

0

return p p j k

The complete construction algorithm is implemented as the function SuffixTree given

b elow The function returns the ro ot of the constructed suxtree Memorizing systematically

0 0

the predecessors h and q of the no des h and q avoids considering doubly linked The

name of the attribute which marks the p ositions of the external no des is made explicit

SuffixTreey n

p NewNode

0

h nil

h p

r

s n

for i from up to n

0

lo op if h nil

0

then h h r s SlowFindy nil p r s

0

else j k Labelh h

0

if h p

0

then q q FastFindy p j k

0 0

else q q FastFindy link h j k

link h q

0 0

h h r s SlowFindy q q r s

t NewNode

MakeEdgeh r s t

position t i

return p

The algorithm runs in time O n more precisely O n log card if we take into account

the branching in the data structure Indeed the instruction at line in function FastFind is

p erformed less than n times and the number of symbol comparisons done at line in function

SlowFind is less than n

Once the sux tree of y is build some op erations can b e p erformed rapidly We describ e

four applications in the following Let x b e a string of length m

Testing whether x o ccurs in y or not can b e solved in time O m by sp elling x from the

ro ot of the trie symbol by symbol If the op eration succeeds x o ccurs in y Otherwise we get

the longest prex of x o ccurring in y

Pro ducing the number of o ccurrences of x in y starts identically by sp elling x Assume

that x o ccurs actually in y Let p b e the no de at the extremity of the last taken down edge

or b e the ro ot itself if x is empty The exp ected number say k is then exactly the number

of external no des of the subtrie of ro ot p This number can b e computed by traversing the

subtrie Since each internal no de of the subtrie has at least two successors the total size of

the subtrie is O k and the traversal of the subtrie is p erformed in time O k indep endently

of The metho d can b e improved by precomputing in time O n indep endently of all

the values asso ciated with each internal no de the whole op eration is then p erformed in time

O m whatever is the number of o ccurrences of x

The metho d for rep orting the list of p ositions of x in y pro ceeds in the same way The

running time needed by the op eration is O m to lo cate x in the trie plus O k to rep ort each

of the p ositions asso ciated with the k external no des

Finding the longest rep eated factor of y remains to compute the deep est internal no de

of the trie that is the internal no de corresp onding to a longest p ossible factor in y This is

p erformed in time O n

Sux Automata

The sux automaton S y of a string y is the minimal deterministic automaton recognizing

Su y that is the set of suxes of y This automaton is minimal among all the deterministic

a a b b a b b

0 1 2 3 4 6 7 9

a

8 10

b b

b

5

a

b

Figure The sux automaton S y of the string y aabbabb The states are numbered from

to according to the order in which they are created by the construction algorithm describ ed in the

present section The initial state is state terminal states are states and This automaton

is the minimal deterministic automaton accepting the language of the suxes of y

automata recognizing the same language which implies that it is not necessarily complete An

example is given in Figure

The main p oint ab out sux automata is that their size is asymptotically linear in the length

of the string More precisely given a string y of length n the number of states of S y is equal

to n when n 6 and is b ounded by n and n otherwise as to the number of edges

it is equal to n when n 6 it is or when n and it b ounded by n and n

otherwise

The construction of the sux automaton of a string y of length n can b e p erformed in time

O n or more precisely in time O n log card on an arbitrary alphab et It makes use

of a failure function fail denes on the states of S y The set of states of S y identies with

the quotient sets

1 

u Su y fv j uv Su y g

for the strings u in the whole set of factors of y One may observe that two sets in the form

1

u Su y are either disjoint or comparable This allows to set

fail p the smallest quotient set stricly containing the quotient set identied with p

for each state p of the automaton dierent from the initial state of the automaton The function

given b elow as the function SuffixAutomaton builds the sux automaton of y and returns

the initial state say i of the automaton The construction is online which means that at each

0 0

step of the construction just after pro cessing a prex y of y the sux automaton S y is

0

build Denoting by t the state without outgoing edge in the automaton S y terminal states

0

of S y are implicitly known by the sux path of t that is the list of the states

t fail t fail fail t i

The algorithm uses the function length dened for each state p of S y by

length p the length of the longest string sp elled from i to p

SuffixAutomatony

i NewState

terminal i false

length i

fail i nil

t i

for symbol a from rst to last symbol of y

lo op t SuffixAutomatonExtensioni t a

p t

lo op terminal p true

p fail p

while p nil

return i

The online construction is based on the function SuffixAutomatonExtension that is

0

implemented b elow The latter function pro cesses the next symbol say a of the string y If y

0

is the prex of y preceding a it transforms the sux automaton S y already build into the

0

sux automaton S y a

SuffixAutomatonExtensioni t a

0

t t

t NewState

terminal t false

0

length t length t

0

p t

lo op MakeEdgep a t

p fail p

while p nil and Targetp a nil

if p nil

then fail t i

else q Targetp a

if length q length p

then fail t q

else r NewState

terminal r false

for each letter b such that Targetq b nil

lo op MakeEdger bTargetq b

length r length p

fail r fail q

fail q r

fail t r

lo op CancelEdgep aTargetp a

MakeEdgep a r

p fail p

while p nil and Targetp a q

return t

We illustrate the b ehavior of function SuffixAutomatonExtension in Figure

With the sux automaton S y of y several op erations can b e solved eciently We

describ e three of them Let x b e a string of length m

b b b b a a b b b

0 1 2 3 4 5 6 8 9 10

a

a

a

a

b

a

7

a

c

c

c

c

b b b b a a b b b c

0 1 2 3 4 5 6 8 9 10 11

b

a

a

a

b

a

7

a

Figure An illustration of the b ehavior of function SuffixAutomatonExtension The function

0 0 0

transforms the sux automaton S y of a string y in the sux automaton S y a for any given symbol

0

a the terminal states b eing implicitly known Let us consider that y bbbbaabbb and let us examine

three p ossible cases according to a namely a c a b and a a a The automaton S bbbbaabbb

0 0

The state denoted by t is state and the sux path of t is the list of the states and

During the execution of the rst lo op of the function state p runs through a part of the sux path of

0

t At the same time edges lab eled by a are created from p the newly created state t unless such

an edge already exists in which case the lo op stops b If a c the execution stops with an undened

value for p The edges lab eled by c start at terminal states and the failure of t is the initial state

c If a b the lo op stops on state p b ecause an edge lab eled by b is dened on it The condition

at line of function SuffixAutomatonExtension is satised which means that the edge lab eled

by a from p is not a shortcircuit In this case the state ending the previous edge is the failure of t

d Finally when a a the lo op stops on state p for the same reason but the edge lab eled by a

from p is a shortcircuit The string bbba is a sux of the newly considered string bbbbaabbba but

bbbba is not Since these two strings reach state this state is duplicated into a new state r that

b ecomes terminal Suxes bba and ba are redirected to this new state The failure of t is r 22

b b b b a a b b b b

0 1 2 3 4 5 6 8 9 10 11

a

a

c

a

b

a

7

a

a

a

a

12

d

a

b b b b a a b b b a

0 1 2 3 4 5 6 8 9 10 11

a

b

7

a

Membership test solves in time O m by sp elling x from the initial state of the automaton

If the entire string is sp elled x o ccurs in y Otherwise we get the longest prex of x o ccurring

in y

Computing the number k of o ccurrences of x in y assuming that x is a factor of y starts

similarly Let p b e the state reached after the sp elling of x from the initial state Then k is

exactly the number of terminal states accessible from p The number k asso ciated with each

state p can b e precomputing in time O n indep endently of the alphab et by a depthrst

traversal of the graph underlying the automaton The query for x is then p erformed in time

O m whatever is k

The base of an algorithm for computing a longest factor common to x and y is implemented

in the pro cedure EndingFactorsMatcher given b elow This pro cedure rep orts at each

p osition in y the length of the longest factor of x ending here It can obviously b e used for

string matching It works as the pro cedure AhoCorasickMatcher in the use of the failure

function The running time of the search phase of the pro cedure is O m

EndingFactorsMatchery x

i SuffixAutomatony

p i

for symbol a from rst to last symbol of x

lo op if Targetp a nil

then

p Targetp a

else lo op p fail p

while p nil and Targetp a nil

if p nil

then

p i

else length p

p Targetp a

rep ort

Retaining a largest value of the variable in the pro cedure instead of rep orting all values

solves the longest common factor problem

Research Issues and Summary

String searching by hashing was introduced by Harrison and later fully analyzed in

Karp and Rabin

The rst lineartime stringmatching algorithm is due to Knuth Morris and Pratt Knuth

Morris and Pratt It can b e proved that during the search the delay that is the

number of times a symbol of the text is compared to symbols of the pattern is less than

p

blog m c where is the golden ratio Simon gives a similar algorithm



but with a delay b ounded by the size of the alphab et of the pattern Hancart proves

that the delay of Simons algorithm is less than blog mc This pap er also proves that this is

2

optimal among algorithms pro cessing the text with a onesymbol buer The b ound b ecomes

O log min f blog mc card g using an ordering on the alphab et which is not a restriction

2

in practice

Galil gives a general criterion to transform stringmatching algorithms that work

sequentially on the text into realtime algorithms

The BoyerMoore algorithm was designed in Boyer and Mo ore The version given

in this chapter follows Knuth Morris and Pratt This pap er contains the rst pro of

on the linearity of the algorithm when restricted to the search of the rst o ccurrence of the

pattern Cole proves that the maximum number of symbol comparisons is b ounded by

n for non p erio dic patterns and that this b ound is tight

Knuth Morris and Pratt considers a variant of the BoyerMoore algorithm in which

all previous matches inside the current window are memorized Each window conguration

b ecomes the state of what is called the BoyerMoore automaton It is still unknown whether

the maximum number of states of the automaton is p olynomial or not

Several variants of the BoyerMoore algorithm avoid the quadratic b ehavior when searching

for all o ccurrences of the pattern Among the most ecient in terms of the number of symbol

comparisons are the algorithm of Ap ostolico and Giancarlo TurboBM algorithm by

Cro chemore et alii the two previous algorithms are analyzed in Lecro q and

the algorithm of Colussi Colussi

The Horsp o ol algorithm is from Horsp o ol The pap er contains practical asp ects of

string matching that are developed in Hume and Sunday

log m

The optimal b ound on the exp ected time complexity of string matching is O n see

m

Knuth Morris and Pratt and the pap er of Yao

String matching can b e solved by lineartime algorithms requiring only a constant amount

of memory in addition to the pattern and the window on the text This can b e proved by

dierent techniques presented in Cro chemore and Rytter The most recent solution is

by G asieniecPlandowski and Rytter

Cole et alii shows that in the worst case any stringmatching algorithm working

with symbol comparisons makes at least n n m comparisons during its search phase

m

Some stringmatching algorithms make less than n comparisons The presentlyknown upp er

b ound on the problem is n n m but with a quadratictime prepro cessing phase

m

see Cole et alii With a lineartime prepro cessing phase the current upp er b ounds are

log m

n m and n n m see resp ectively Galil and Giancarlo and Breslauer

m

and Galil Except in a few cases patterns of length for example lower and upp er

b ounds do not meet So the problem of the exact complexity of string matching is op en

The AhoCorasick algorithm is from Aho and Corasick CommentzWalter

has designed an extension of the BoyerMoore algorithm that solves the dictionarymatching

problem It is fully describ ed in Aho

The suxtree construction of Section is from McCreight An online version is by

Ukkonen A previous algorithm by Weiner relates sux trees to a data structure

close to sux automata

The construction of sux automata also describ ed as direct acyclic word graphs and often

denoted by the acronym DAWG is from Blumer et alii and from Cro chemore

An alternative data structure that implements eciently indexes is the notion of sux

arrays introduced in Manber and Myers

Dening Terms

Border A string v is a b order of a string u if v is b oth a prex and a sux of u String v is

said to b e the b order of u if it is the longest prop er b order of u

0 00 0 00

Factor A string v is a factor of a string u if u u v u for some strings u and u

Occurrence A string v o ccurs in a string u if v is a factor of u

Pattern A nite number of strings that are searched for in texts

00 00

Prex A string v is a prex of a string u if u v u for some string u

Prop er Qualies a factor of a string that is not equal to the string itself

Segment Equivalent to factor

0 0

Sux A string v is a sux of a string u if u u v for some string u

Sux tree Trie containing all the suxes of a string

Sux automaton Smallest automaton accepting the suxes of a string

Text A stream of symbols that is searched for o ccurrences of patterns

Trie Digital tree tree in which edges are lab eled by symbols or strings

Window Factor of the text that is aligned with the pattern

References

Aho AV Algorithms for nding patterns in strings In Handbook of Theoretical Com

puter Science ed J van Leeuwen vol A chap p Elsevier Amsterdam

Aho AV and Corasick MJ Ecient string matching an aid to bibliographic search

Comm ACM

Baase S Computer algorithms Introduction to design and analysis AddisonWesley

Blumer A Blumer J Ehrenfeucht A Haussler D Chen MT and Seiferas J The

smallest automaton recognizing the subwords of a text Theoret Comput Sci

Boyer RS and Mo ore JS A fast string searching algorithm Comm ACM

Breslauer D and Galil Z Ecient comparison based string matching J Complexity

Cole R Tight b ounds on the complexity of the BoyerMoore pattern matching algorithm

SIAM J Comput

Cole R Hariharan R Zwick U and Paterson MS Tighter lower b ounds on the

exact complexity of string matching SIAM J Comput

Colussi L Fastest pattern matching in strings J Algorithms

Cormen TH Leiserson CE and Rivest RL Introduction to algorithms MIT Press

Cro chemore M Transducers and rep etitions Theoret Comput Sci

Cro chemore M and Rytter W Text Algorithms Oxford University Press

Galil Z String matching in real time J ACM

Galil Z and Giancarlo R On the exact complexity of string matching upp er b ounds

SIAM J Comput

Gonnet GH and BaezaYates RA Handbook of algorithms and data structures

AddisonWesley

Hancart C On Simons string searching algorithm Inf Process Lett

Horsp o ol RN Practical fast searching in strings Software Practice and Experience

Hume A and Sunday DM Fast string searching Software Practice and Experience

Karp RM and Rabin MO Ecient randomized patternmatching algorithms IBM

J Res Dev

Knuth DE Morris Jr JH and Pratt VR Fast pattern matching in strings SIAM

J Comput

Lecro q T Exp erimental results on stringmatching algorithms Software Practice and

Experience

McCreight EM A spaceeconomical sux tree construction algorithm J Algorithms

Manber U and Myers G Sux arrays a new metho d for online string searches SIAM

J Comput

Sedgewick R Algorithms AddisonWesley

Simon I String matching algorithms and automata In First American Workshop on

String Processing ed R BaezaYates and N Ziviani p Universidade Federal de

Minas Gerais

Stephen GA String searching algorithms World Scientic Press

Further Information

Problems and algorithms presented in the chapter are just a sample of questions related to

pattern matching They share the formal metho ds used to design ecient algorithms A wider

panorama of algorithms on texts may b e found in a few b o oks such as Cro chemore and Rytter

and Stephen

Research pap ers in pattern matching are disseminated in a few journals among which are

Communications of the ACM Journal of the ACM Theoretical Computer Science Journal of

Algorithms SIAM Journal on Computing Algorithmica

Two main annual conferences present the latest advances of this eld of research

 Combinatorial Pattern Matching which started in in Paris France and was held

since in London England Tucson Arizona Padova Italy Asilomar California

Helsinki Finland Laguna Beach California

 Workshop on String Processing which started in in Belo Horizonte Brazil and

was held since in Valparaiso Chile and Recife Brazil

But general conferences in computer science often have sessions devoted to pattern matching

Several b o oks on the design and analysis of general algorithms contain a chapter devoted

to algorithms on texts Here is a sample of these b o oks Baase Cormen Leiserson

and Rivest Gonnet and BaezaYates Sedgewick