<<

NR A Fast and Flexible Pattern Matching To ol

Gonzalo Navarro

Abstract

We present nrgrep nondeterministic reverse grep a new pattern matching to ol designed

for ecient search of complex patterns Unlike previous to ols of the grep family such as agrep

and Gnu grep nrgrep is based on a single and uniform concept the bitparallel simulation

of a nondeterministic sux automaton As a result nrgrep can nd from simple patterns to

regular expressions exactly or allowing errors in the matches with an eciency that degrades

smo othly as the complexity of the searched pattern increases Another concept fully integrated

into nrgrep and that contributes to this smo othness is the selection of adequate subpatterns for

fast scanning which is also absent in many current to ols We show that the eciency of nrgrep

is similar to that of the fastest existing string matching to ols for the simplest patterns and by

far unpaired for more complex patterns

Key words Online string matching searching approximate string match

ing grep agrep BNDM

Intro duction

The purp ose of this pap er is to present a new pattern matching to ol which we have coined nrgrep

for nondeterministic reverse grep Nrgrep is aimed at ecient searching for complex patterns

inside natural language texts but it can b e used in many other scenarios

The pattern matching problem can b e stated as follows given a text T of n characters and a

1n

pattern P nd all the p ositions of T where P o ccurs The problem is basic in almost every area of

computer science and app ears in many dierent forms The pattern P can b e just a simple string

but it can also b e for example a regular expression An o ccurrence can b e dened as exactly or

approximately matching the pattern

In this pap er we concentrate on online string matching that is the text cannot b e indexed

Online string matching is useful for casual searching ie users lo oking for strings in their les

and unwilling to maintain an index for that purp ose dynamic text collections where the cost

of keeping an uptodate index is prohibitive including the searchers inside text editors and Web

1

interfaces for not very large texts up to a few hundred megabytes and even as internal to ols

of indexed schemes as agrep is used inside glimpse or cgrep is used inside compressed

indexes



Dept of Computer Science University of Chile Blanco Encalada Santiago Chile

gnavarrodccuchilecl Work develop ed while the author was at p ostdo ctoral stay at the Institut Gaspard Monge

Univ de MarnelaVallee France partially supp orted by Fundacion Andes and ECOSConicyt

1

We refer to the search in page facility not to confuse with searching the Web

There is a large class of string matching algorithms in the literature see for example

but not all of them are practical There is also a wide variety of fast online string matching to ols

in the public domain most prominently the grep family Among these Gnu grep and Wu and

Manb ers agrep are widely known and currently considered as the fastest string matching to ols

in practice Another distinguishing feature of these software systems is their exibility they can

search not only for simple strings but they also p ermit classes of characters that is a pattern

p osition matches a set of characters wild cards a pattern p osition that matches an arbitrary

string regular expression searching multipattern searching etc Agrep also p ermits approximate

searching the pattern matches the text after p erforming a limited numb er of alterations on it

The algorithmic principles b ehind agrep are diverse Exact string matching is done with

the Horsp o ol algorithm a variant of the BoyerMo ore family The sp eed of the Boyer

Mo ore string matching algorithms comes from their ability to skip ie not insp ect some text

characters Agrep deals with more complex patterns using a variant of ShiftOr an algorithm

exploiting bitparallelism a concept that we explain later to simulate nondeterministic automata

NFA eciently ShiftOr however cannot skip text characters Multipattern searching is treated

with bitparallelism or with a dierent algorithm dep ending on the case As a result the search

p erformance of agrep varies sharply dep ending on the typ e of search pattern and even slight

mo dications to the pattern yield widely dierent search times For example the search for the

string algorithm is times faster than for Aalgorithm where Aa is a class of characters

that matches A and a which is useful to detect the word either starting a sentence or not In

the rst case agrep uses Horsp o ols algorithm and in the second case it uses ShiftOr Intuitively

there should exist a more uniform approach where b oth strings could b e eciently searched for

without a signicant dierence in the search time

An answer to this challenge is the BNDM algorithm BNDM is based on a previous

algorithm BDM for backward DAWG matching The BDM algorithm to b e explained

later uses a sux automaton to detect substrings of the pattern inside a text window the Boyer

Mo ore family detects only suxes of the pattern As the BoyerMo ore algorithms BDM can also

skip text characters In the original BDM algorithm the sux automaton is made deterministic

BNDM is a recent version of BDM that keeps the sux automaton in nondeterministic form by

using bitparallelism As a result BNDM can search for complex patterns and still keep a search

eciency close to that of simple patterns It has b een shown exp erimentally that the

BNDM algorithm is by far the fastest one to search for complex patterns BNDM has b een later

extended to handle regular expressions

Nrgrep is a pattern matching to ol built over the BNDM algorithm hence the name nondeter

ministic reverse grep since BNDM scans windows of the text in reverse direction However there

is a gap b etween a pattern matching algorithm and a real software The purp ose of this work is to

ll that gap We have classied the allowed search patterns in three levels

Simple patterns a simple pattern is a sequence of m classes of characters note that a single

character is a particular case of a class Its distinguishing feature is that an o ccurrence of a

simple pattern has length m as well as each pattern p osition matches one text p osition

Extended patterns an extended pattern adds to simple patterns the ability to characterize indi

vidual classes as optional ie they can b e skipp ed when matching the text or rep eatable

ie they can app ear consecutively a numb er of times in the text The purp ose of extended

patterns is to capture the most commonly used extensions of the normal search patterns so

as to develop sp ecialized pattern matching algorithms for them

Regular expressions a regular expression is formed by simple classes the empty string or the

concatenation union or rep etition of other regular expressions This is the most general

typ e of pattern we can search for

We develop a dierent pattern matching algorithm with increasing complexity for each typ e of

pattern so simpler patterns are searched for with simpler and faster algorithms The classication

has b een made having in mind the typical search needs on natural language and it would b e

dierent say for DNA searching We have also this in mind when we design the error mo del for

approximate searching Approximate searching or searching allowing errors means that the

user gives an error threshold k and the system is able to nd the pattern in the text even if it

is necessary to p erform k or less op erations in the pattern to match its o ccurrence in the text

The op erations typically p ermitted are the insertion deletion and substitution of single characters

However transp osition of adjacent characters is an imp ortant typing error that is normally

disregarded b ecause it is dicult to deal with We allow the four op erations in nrgrep although

the user can sp ecify a subset of them

An imp ortant asp ect that deserves attention in order to obtain the desired smo othness in

the search time is the selection of an optimal subpattern to scan the text A typical case is an

extended pattern with a large and rep eatable class of characters close to one end For technical

reasons that will b e made clear later it may b e very exp ensive to search for the pattern as is while

pruning the extreme of the pattern that contains the class and verifying the p otential o ccurrences

found leads to much faster searching Some to ols such as Gnu grep for regular expressions try to

apply some heuristics of this typ e but we provide a general and uniform subpattern optimization

metho d that works well in all cases and under a simplied probabilistic mo del yields the optimal

search p erformance for that pattern Moreover the selected subpattern may b e of a simpler typ e

than the whole pattern and a faster search algorithm may b e p ossible Detecting the exact typ e

of pattern given by the user despite the syntax used is an imp ortant issue that is solved by the

pattern parser

We have followed the philosophy of agrep in some asp ects such as the recordoriented way

to present the results and most of the pattern syntax features and search options The main

advantages of nrgrep over the grep family are uniformity in design smo othness in search time

sp eed when searching for complex patterns p owerful extended patterns improved error mo del for

approximate searching and subpattern optimization

In this pap er we start by explaining the concepts of bit parallelism and searching with sux

automata Then we explain how these are combined to search for simple patterns extended patterns

and regular expressions We later consider the approximate search of these patterns Finally we

present the nrgrep software and show some exp erimental results on it Despite that the algorithmic

asp ects of the pap er b orrow from our previous work in some cases the pap er has

some novel and nontrivial algorithmic contributions such as

searching for extended patterns which implies the bit parallel simulation of new typ es of

restricted automata

approximate searching allowing transp ositions for the three typ es of patterns which has never

b een considered under the bitparallel approach and

algorithms to select optimal search subpatterns in the three typ es of patterns

The nrgrep to ol is freely available under a Gnu license from

httpwwwdccuchileclgnavarropubcode

Basic Concepts

We dene in this section the basic concepts and notation needed throughout the pap er

Notation

We consider that the text is a sequence of n characters T t t where t is the

1 n i

alphab et of the text and its size is denoted jj In the simplest case the pattern is denoted

as P p p a sequence of m characters p in which case it sp ecies the single string

1 m i

p p More general patterns sp ecify a nite or innite set of strings

1 m

We say that P matches T at p osition i whenever there exists a j such that t t b elongs

i i+j

to the set of strings sp ecied by the pattern The substring t t is called an o ccurrence of

i i+j

P in T Our goal is to nd all the text p ositions that start a pattern o ccurrence

The following notation is used for strings S denotes the string s s s In particular

ij i i+1 j

S the empty string if i j A string X is said to b e a prex sux and factor or

ij

substring resp ectively of XY YX and YXZ for any Y and Z

We use some notation to describ e bitparallel algorithms We use exp onentiation to denote

3

bit rep etition eg We denote as b b the bits of a mask of length which is

1

stored somewhere inside the computer word of xed length w in bits We use Clike syntax for

op erations on the bits of computer words ie j is the bitwiseor is the bitwiseand b

is the bitwisexor complements all the bits and moves the bits to the left and enters

zeros from the right eg b b b b b b b We can also p erform arithmetic

1 2 1 3 2 1

op erations on the bits such as addition and subtraction which op erate the bits as the binary

representation of a numb er for instance b b b b

x x

In the following we show that the pattern can b e a more complex entity matching in fact a set

of dierent text substrings

Simple Patterns

We call a simple pattern a sequence of characters or classes of characters Let m b e the numb er

of elements in the sequence then a simple pattern P is written as P p p where p

1 m i

We say that P matches at text p osition i whenever t p for i m The most imp ortant

i i

feature of simple patterns is that they match a substring of the same length m in the text

We use the following notation to describ e simple patterns we concatenate the elements of the

sequence together Simple characters ie classes of size are written down directly while other

classes of characters are written in square brackets The rst character inside the square brackets

can b e which means that the class is exactly the complement of what is sp ecied The rest

is a simple enumeration of the characters of the class except that we allow ranges xy means

all the characters b etween x and y inclusive we assume a total order in which is in practice the

ASCI I co de Finally the character represents a class equal to the whole alphab et and

represents the class of all separators ie non alphanumeric characters Most of these conventions

are the same used in software Some examples are

Aamerican which matches American and american

nBegin which nds Begin if it is not preceded by a line break

which matches any date in the s

eazAZt which p ermits any character in the rst p osition then e then anything

except a letter or underscore in the third p osition then t and nishes with a separator

Note that we have used n to denote the newline We also use t for the tab xHH for

the character with hex ASCI I co de HH and in general C to interpret any character C literally

eg the backslash character itself as well as the sp ecial characters that follow

It is p ossible to sp ecify that the pattern has to app ear at the b eginning of a line by preceding

it with or at the end of the line by following it with Note that this is not the same as

adding the newline preceding or following the pattern b ecause the b eginningend of the le signals

also the b eginningend of the line and the same happ ens with records when the record delimiter

is not the end of line

Extended Patterns

In general an extended pattern adds some extra sp ecication capabilities to the simple pattern

mechanism In this work we have chosen some features which we b elieve are the most interesting

for typical text searching The reason to intro duce this intermediatelevel pattern b etween simple

patterns and regular expressions is that it is p ossible to devise sp ecialized search algorithms for

them which can b e faster than those for general regular expressions The op erations we p ermit for

extended patterns are sp ecify optional classes or characters and p ermit the rep etition of a class

or character The notation we use is to add a symb ol after the aected character or class

means an optional class means that the class can app ear zero or more times and means

that it can app ear one or more times Some examples are

colour which matches color and colour

azAZazAZ which matches valid variable names in most programming

languages a letter followed by letters or digits

LatinAmerica which matches Latin and America separated by one or more sepa

rator characters eg spaces tabs etc

Regular Expressions

A regular expression is the most sophisticated pattern that we allow to search for and it is in general

considered p owerful enough for most applications A regular expression is dened as follows

Basic elements any character and the empty string are regular expressions matching

themselves

Parenthesis if e is a regular expression then so is e which matches the same strings This

is used to change precedence

Concatenation if e and e are regular expressions then e e is a regular expression that

1 2 1 2

matches a string x i x can b e written as x y z where e matches y and e matches z

1 2

Union if e and e are regular expressions then e je is a regular expression that matches a

1 2 1 2

string x i e or e match x

1 2

Kleene closure if e is a regular expression then e is a regular expression that matches a

string x i for some n x can b e written as x x x and e matches each string x

1 n i

We follow the same syntax with the precedence order j except that we use square

brackets to abbreviate x jx j jx x x x where the x are characters we omit the

1 2 n 1 2 n i

concatenation op erator we add the two op erators e ee and e ej and we use the

empty string to denote eg abc denotes abjc These arrangements make the extended

patterns to b e a particular case of regular expressions Some examples are

dogcat which matches dog and cat

DrProfMrKnuth which matches Knuth preceded by a sequence of titles

Pattern Matching Algorithms

We explain in this section the basic string and regular expression search algorithms our software

builds on

Bit Parallelism and the ShiftOr Algorithm

In a new approach to text searching was prop osed It is based on bitparal lelism This

technique consists in taking advantage of the intrinsic parallelism of the bit op erations inside a

computer word By using cleverly this fact the numb er of op erations that an algorithm p erforms

can b e cut down by a factor of at most w where w is the numb er of bits in the computer word

Since in current architectures w is or the sp eedup is very signicant in practice

Figure shows a nondeterministic automaton that searches for a pattern in a text Classical

pattern matching algorithms such as KMP convert this automaton to a deterministic form

and achieve O n worst case search time The ShiftOr algorithm on the other hand uses

bitparallelism to simulate the automaton in its nondeterministic form It achieves O mnw

worstcase time ie an optimal sp eedup over a classical O mn simulation For m w ShiftOr

is twice as fast as KMP b ecause of b etter use of the computer registers Moreover it is easily

extended to handle classes of characters

Σ

a b c d e f g

0 1 23 4 5 6 7

Figure A nondeterministic automaton NFA to search for the pattern P abcdefg in a text

Text Scanning

We present now the ShiftAnd algorithm which is an easier to explain though a little less

ecient variant of ShiftOr Given a pattern P p p p p and a text T t t t

1 2 m i 1 2 n

t the algorithm builds rst a table B which for each character stores a bit mask b b The

i m 1

mask in B c has the ith bit set if and only if p c The state of the search is kept in a machine

i

word D d d where d is set whenever p p p matches the end of the text read up to

m 1 i 1 2 i

now another way to see it is to consider that d tells whether the state numb ered i in Figure is

i

active Therefore we rep ort a match whenever d is set

m

m

We set D originally and for each new text character t we up date D using the formula

j

m1

D D j B t

j

The formula is correct b ecause the ith bit is set if and only if the i th bit was set for

the previous text character and the new text character matches the pattern at p osition i In other

words t t p p if and only if t t p p and t p Again it is

j i+1 j 1 i j i+1 j 1 1 i1 j i

p ossible to relate this formula to the movement that o ccurs in the NFA for each new text character

each state gets the value of the previous state but this happ ens only if the text character matches

m1

the corresp onding arrow Finally the j after the shift allows a match to b egin at the

current text p osition this op eration is saved in the ShiftOr where all the bits are complemented

This corresp onds to the selflo op at the initial state of the automaton

The cost of this algorithm is O n For patterns longer than the computer word ie m w

the algorithm uses dmw e computer words for the simulation not all them are active all the time

with a worstcase cost of O mnw and still an average case cost of O n

Classes of Characters and Extended Patterns

The ShiftOr algorithm is not only very simple but it also has some further advantages The most

immediate one is that it is very easy to extend to handle classes of characters where each pattern

p osition may not only match a single character but a set of characters If p is the set of characters

i

that match the p osition i in the pattern we set the ith bit of B c for all c p No other change

i

is necessary to the algorithm In they show also how to allow a limited numb er k of mismatches

in the o ccurrences at O nm logk w cost

This paradigm was later enhanced to supp ort allow wild cards regular expressions ap

proximate search with nonuniform costs and combinations of them Further development of the

bitparallelism approach for approximate string matching yielded some of the fastest algorithms for

short patterns In most cases the key idea was to simulate an NFA

Bitparallelism has b ecome a general way to simulate simple NFAs instead of converting them

to deterministic automata This is how we use it in nrgrep

The BDM Algorithm

The main disadvantage of ShiftOr is its inability to skip characters which makes it slower than

the algorithms of the BoyerMo ore or the BDM families We describ e in this section the

BDM pattern matching algorithm which is able to skip some text characters

BDM is based on a sux automaton A sux automaton on a pattern P p p p is an

1 2 m

automaton that recognizes all the suxes of P A nondeterministic version of this automaton has

a very regular structure and is shown in Figure In the BDM algorithm this automaton is made deterministic

I ε ε ε ε ε ε ε ε a b c d e f g

0 1 2 3 4 5 6 7

Figure A nondeterministic sux automaton for the pattern P abcdefg Dashed lines

represent transitions ie they o ccur without consuming any input

A very imp ortant fact is that this automaton can b e used not only to recognize the suxes of

P but also factors of P Note that there is a path lab eled by x from the initial state if and only if

x is a factor of P That is the NFA will not run out of active states as long as it has read a factor

of P

The sux automaton is used to design a simple pattern matching algorithm This algorithm

runs in O mn time in the worst case but it is optimal on average O n log mm time Other

more complex variations such as Turb oBDM and MultiBDM achieve linear time in the

worst case

To search for a pattern P p p p in a text T t t t the sux automaton of

1 2 m 1 2 n

r

P p p p ie the pattern read backwards is built A window of length m is slid along

m m1 1

the text from left to right The algorithm searches backward inside the window for a factor of

the pattern P using the sux automaton ie the sux automaton of the reverse pattern is fed

with the characters in the text window read backward This backward search ends in two p ossible

forms

We fail to recognize a factor ie we reach a window character that makes the automaton run

out of active states This means that the sux of the window we have read is not anymore

a factor of P Figure illustrates this case We then shift the window to the right its

starting p osition corresp onding to the p osition following the character we cannot miss an

o ccurrence b ecause in that case the sux automaton would have found a factor of it in the

window

Window

Search for a factor with the sux automaton

u

Fail to recognize a factor at

New search

Safe shift

New window

Figure Sux automaton search

We reach the b eginning of the window therefore recognizing the pattern P since the lengthm

window is a factor of P indeed it is equal to P We rep ort the o ccurrence and shift the

window by one p osition

Combining ShiftOr and BDM the BNDM Algorithm

We describ e in this section the BNDM pattern matching algorithm This algorithm a com

bination of ShiftOr and BDM has all the advantages of the bitparallel forward scan algorithm

and in addition it is able to skip some text characters like BDM

Instead of making the automaton of Figure deterministic BNDM simulates it using bit

parallelism The bitparallel simulation works as follows Just as for ShiftAnd we keep the state

of the search using m bits of a computer word D d d Each time we p osition the window in

m 1

m

the text we initialize D this corresp onds to the transitions and scan the window backward

For each new text character read in the window we up date D If we run out of s in D then there

cannot b e a match and we susp end the scanning and shift the window If on the other hand we

can p erform m iterations then we rep ort the match

We use a table B which for each character c stores a bit mask This mask sets the bits

corresp onding to the p ositions where the reversed pattern has the character c just as in the Shift

And algorithm The formula to up date D is

D D B t

j

BNDM is not only faster than ShiftOr and BDM for ab out m but it can accom

mo date all the extensions mentioned in Section In particular it can easily deal with classes of

characters by just altering the prepro cessing and it is by far the fastest algorithm to search for

this typ e of patterns

Note that this typ e of search is called backward scanning b ecause the text characters inside

the window are read backwards However the search progresses from left to right in the text as

the window is shifted There have b een other few attempts to skip characters under a ShiftOr

approach for example

Regular Expression Searching

Bitparallelism has b een successfully used to deal with regular expressions ShiftOr was extended

in two ways to deal with this case rst using the Thompson and later Glushkovs

constructions of NFAs from the regular expression Figure shows b oth constructions for the

pattern abcddefde

ε f 5 7 10 12 ε ε ε ε a b c d d e 0 1 2 3 4 9 14 15 16 ε ε ε ε 6 8 11 13 d e

e d

a b c d d e d e 0 1 2 3 4 5 6 7 8 9

f

f

Figure Thompsons top and Glushkovs b ottom resulting NFAs for the regular expression

abcddefde

Given a regular expression with m positions each characterclass denes a new p osition

Thompsons construction pro duces an automaton of up to m states Its advantage is that the

states of the resulting automaton can b e arranged in a bit mask so that all the transitions move

forward except the transitions This is used for a bitparallel simulation which moves the bits

forward as for the simple ShiftOr and then applies all the moves corresp onding to transitions

For this sake a table E mapping from bit masks to bit masks is precomputed so that E x yields

a bit mask where x has b een expanded with all the moves The co de for a transition is therefore

s1

D D j B t

j

D E D

We have used s as the numb er of states in the Thompson automaton where m s m

s

The main problem is that the E table has entries This is handled by splitting the argument

horizontally so for example if s then two tables E and E can b e created which receive

1 2

half masks and deliver full masks with the expansion of only the bits of their half of course

the transitions can go to the other half this is why they deliver full masks In this way the

amount of memory required is largely reduced at the cost of two op erations to build the real E

value This takes advantage of the fact that if a bit mask x is split in two halves x y z then

jz j jy j

E y z E y j E z

Glushkovs construction has the advantage of pro ducing an automaton with exactly m states

which can as low as half the states generated by Thompsons construction On the other hand

the structure of arrows is not regular and the trick of forward shift plus moves cannot b e used

Instead it has another prop erty all the arrows leading to a given state are lab eled by the same

character or class This prop erty has b een used recently to provide a spaceeconomical bit

parallel simulation where the co de for a transition is

D T D B t

j

where T is a table that receives a bit map D of states and delivers another bit map of states

reachable from states in D no matter by which characters The transitions do not have to b e

dealt with b ecause Glushkovs construction do es not pro duce them The T table can b e horizontally

partitioned as well

It has b een shown that a bitparallel implementation of Glushkovs construction is faster

than one of Thompsons which should b e clear since in general the tables obtained are much smaller

An interesting improvement p ossible thanks to bit parallelism is that classes of characters can b e

dealt with the normal mechanism used for simple patterns without generating one state for each

alternative

On the other hand a deterministic automaton DFA can b e built from the nondeterministic

one It is not hard to see that indeed the previous constructions simulate a DFA since each

state of the DFA can b e seen as a set of states of the NFA and each p ossible set of states of the

NFA is represented by a bitmask Normally the DFA takes less space b ecause only the reachable

combinations are generated and stored while for direct access to the tables we need to store in

the bitparallel simulations all the p ossible combinations of active and inactive states On the

other hand bitparallelism p ermits extending regular expressions with classes of characters and

other features eg approximate searching which is dicult otherwise Furthermore Glushkovs

m

construction p ermits not storing a table of states char acter s of worst case size O in the

m

case of a DFA but just the table T of size O Finally in case of space problems the technique

of splitting the bitmasks can b e applied

Therefore we use the bitparallel simulation of Glushkovs automaton for nrgrep After the

up date op eration and we check whether a nal state of D is reached this means just an and

op eration with the mask of nal states Describing Glushkovs NFA construction algorithm is

2

outside the scop e of this pap er but it takes O m time The result of the construction can b e

represented as a table B c which yields the states reached by character c no matter from where

and a table F ol l ow i which yields the bitmask of states activated from state i no matter by which

m

character From F ol l ow the deterministic version T can b e built in O worst case time with

the following pro cedure

T

for i m

i

for j

i

T j F ol l ow i j T j

A backward search algorithm for regular expressions is also p ossible and in some cases

the search is much faster than a forward search The idea is as follows First we compute the

length of the shortest path from the initial to a nal state using a simple graph algorithm This

will b e the length of the window in order not to lose any o ccurrence Second we reverse all the

arrows of the automaton make all the states initial and take as the only nal state the original

initial state The resulting automaton will have active states as long as we have read a reverse

factor of a string matching the regular expression and will reach its nal state when we read in

particular a reverse prex Figure illustrates the result

ε

e d

a b cd d e d e

f

f

Figure An automaton recognizing reverse prexes of abcddefde based on the

Glushkov construction of Figure

We apply the same BNDM technique of reading backwards the text window If the automaton

runs out of active states then no factor of an o ccurrence of the pattern is present in the window

and we can shift the window aligning its b eginning to one p osition after the one that caused the

mismatch If on the other hand we reach the b eginning of the window in the backward scan we

cannot guarantee that an o ccurrence has b een found When searching for a simple string the only

way to reach the window b eginning is to have read the whole pattern Regular expressions on the

other hand can have o ccurrences of dierent length and all we know is that we have matched a

factor There are in fact two choices

The nal state of the automaton is not active which means that we have not read a prex

of an o ccurrence In this case we shift the window by one p osition and resume the scanning

The nal state of the automaton is active Since we have found a pattern prex we have to

p erform a forward verication starting at the window initial p osition until either we nd an

o ccurrence or the automaton runs out of active states

So we need apart from the reversed automaton also the normal automaton without initial

selflo op as in Figure for the verication of p otential o ccurrences

An extra complication comes from the fact that the NFA with reverse arrows do es not have the

prop erty that all arrows leading to a state are lab eled by the same character Rather all the arrows

leaving a state are lab eled by the same character Hence the simulation can b e done as follows

R

D T D B t

j

R

where T corresp onds to the reverse arrows but B is that of the forward automaton

Approximate String Matching

Approximate searching means nding the text substrings that can b e converted into the pattern

by p erforming at most k op erations on them Permitting a limited numb er k of such dierences

also called errors is an essential to ol to recover from typing sp elling and OCR optical character

recognition errors Despite that approximate pattern matching can b e reduced to a problem of

regular expression searching the regular expression grows exp onentially with the numb er of allowed

errors or dierences

A rst design decision is what should b e taken as an error Based on existing surveys

we have chosen the following four typ es of errors insertion of characters deletion of characters re

placement of a character by another character and exchange of adjacent characters transp osition

These errors are symmetric in the sense that one can consider that they o ccur in the pattern or in

the text and the result is the same Traditionally only the rst three errors have b een p ermitted

b ecause transp osition despite b eing recognized as a very imp ortant source of errors is harder to

handle However the problem is known to grow very fast in complexity as k increases and since

a transp osition can only b e simulated with two errors of the other kind ie an insertion and a

deletion we would need to double k in order to obtain a similar result One of the algorithmic

contributions of nrgrep is a bitparallel algorithm for p ermitting the transp ositions together with

the other typ es of errors This p ermits us searching with smaller k values and hence obtain faster

searching with similar or b etter retrieval results

Approximate searching is characterized by the fact that no known search algorithm is the b est

in all cases From the wealth of existing solutions we have selected those that adapt b est to

our goal of exibility and uniformity Three main ideas can b e used

Forward Searching

The most basic idea that is well suited to bit parallelism is to have k similar automata

representing the state of the search when zero to k errors have o ccurred Apart from the normal

arrows inside each automaton there are arrows going from automaton i to i corresp onding to

the dierent errors The original approach did not consider transp ositions which have b een

dealt with later

Figure shows an example with k Let us rst fo cus on the big no des and soliddashed

lines Apart from the normal forward arrows we have three typ es of arrows that lead from each

row to the next one ie increment the numb er of errors vertical arrows which represent the

insertion of characters in the pattern since they advance in the text but not in the pattern

diagonal arrows which represent replacement of the current text character by a pattern character

since they advance in the text and in the pattern and dashed diagonal arrows transitions

which represent deletion of characters in the pattern since they advance in the pattern without

consuming any text input The remaining arrows the dotted ones represent transp ositions which

p ermit reading the next two pattern characters in the wrong order and move to the next row This

is achieved by means of temp orary states which we have drawn as smaller circles

If we disregard the transp ositions there are dierent ways to simulate this automaton in O

time when it ts in a computer word but no bit parallel solution has b een presented to

account for the transp ositions This is one of our contributions and is explained later in the pap er

We extend a simpler bit parallel simulation which takes O k time p er text character as long

as m O w The technique stores each row in a machine word R R just as we did for

0 k

mi i

ShiftAnd in Section The bitmasks R are initialized to to account for i p ossible initial

i

deletions in the pattern The up date pro cedure to pro duce R up on reading text character t is as

j a b c d b c d 0 errors

a b c

a b c d b c d 1 error

a b c

a b c d

2 errors

Figure A nondeterministic automaton accepting the pattern abcd with at most errors

Unlab eled solid arrows match any character while the dashed not dotted lines are transitions

follows

m1

R R j B t

0 j

0

for i k do

R R B t j R j R j R

i j i1 i1

i i1

where of course many co ding optimizations are p ossible and are done in nrgrep but make the co de

less clear In particular using the complemented version of the representation as in ShiftOr is a

bit faster

The rationale of the pro cedure is as follows R has the same up date formula as for the Shift

0

And algorithm For the others the up date formula is the or of four p ossible facts The rst one

corresp onds to the normal forward arrows note that there is no initial selflo op for them only for

R The second one brings s state activations from the upp er row at the same p osition which

0

corresp onds to a vertical arrow ie an insertion The third one brings s from the upp er row at

the previous p ositions this is obtained with the left shift corresp onding to a diagonal arrow ie

a replacement The fourth one is similar but it works on the newer value of the previous row R

i1

instead of R and hence it corresp onds to an transition ie a deletion

i1

m1

A match is detected when R is not zero It is not hard to show that whenever the

k

nal state of R is active the nal state of R is active to o so it suces to consider R as the only

i k k

nal state

Backward Searching

Backward searching can b e easily adapted from the forward searching automaton following the

same techniques used for exact searching That is we build the automaton of Figure on

the reverse pattern consider all the states as initial ones and consider as the only nal state the

rst no de of the last row This will recognize all the reverse prexes of P allowing at most k errors

and will have active states as long as some factor of P has b een seen with at most k errors

Some observations are of interest First note that we will never shift the window b efore exam

ining at least k characters since we cannot make k errors b efore that Second the length of the

window has to b e that of the shortest p ossible match which b ecause of deletions in the pattern

is of length m k Third just as it happ ens with regular expressions the fact that we arrive to

the b eginning of the window with some active states in the automaton do es not immediately mean

that we have an o ccurrence so we have to check the text for a complete o ccurrence starting at the

window b eginning

It has b een shown that this algorithm takes time O k k log mm k for m w

Splitting into k Subpatterns

A well known prop erty establishes that under the mo del of insertions deletions and

replacements if the pattern is cut in k contiguous pieces then at least one of the pieces o ccurs

unchanged inside any o ccurrences with k errors or less This is easily veried b ecause each op eration

can alter at most one piece So the technique consists of p erforming a multipattern searching for

the pieces without errors and checking the text surrounding the o ccurrences of each piece for a

complete approximate o ccurrence of the whole pattern This leads to the fastest algorithms for low

error levels

The prop erty is not true if we add the transp osition b ecause this op eration can alter two

contiguous pieces at the same time Much b etter than splitting the pattern in k pieces is to

split it in k pieces and leave one unused character b etween each pair of pieces Under this

partition a transp osition can alter at most one piece

We are now confronted with a multipattern search problem This can b e solved with a very

simple mo dication of the single pattern backward search algorithm Consider the pattern

abracadabra searched with two errors We split it in abr cad and bra Figure depicts

the automaton used for backward searching of the three pieces This is implemented with the same

bit parallel mechanism as for a single pattern except that there are more nal states and

an extra bit mask is necessary to avoid propagating s by the missing arrows This extra bit

mask is implemented at no cost by removing the corresp onding s from the B mask during the

prepro cessing Note that for this to work we need that all the pieces have the same length

a br a ca d a b r a

ε

Figure An automaton recognizing reverse prexes of selected pieces of abracadabra

Searching for Simple Patterns

Nrgrep directly uses the BNDM algorithm when it searches for simple patterns However some

mo dications are necessary to convert the pattern matching algorithm into a search software

Record Oriented Output

The rst issue to consider is what will we rep ort from the results of the search Printing the text

p ositions i where a match o ccurs is normally of little help for the user Printing the text p ortion

that matched ie the o ccurrence do es not help much either b ecause this is equal to the pattern

at least if no classes of characters are used We have followed agreps philosophy the most useful

way to present the result is to print a context of the text p ortion that matched the pattern

This context is dened as follows The text is considered to b e a sequence of records A user

dened record delimiter determines the text p ositions where a new record starts The text areas

until the rst record separator and after the last record separators are considered records as well

When a pattern is found in the text the whole record where the o ccurrence lies is printed If

the o ccurrence overlaps with or contains a record delimiter then it is not considered a pattern

o ccurrence

The record delimiter is by default the newline character but the user can sp ecify any other

simple pattern as a record delimiter For example the string From can b e used to delimit

email messages in a mail archive therefore b eing able to retrieve complete emails that contain a

given string The system p ermits to sp ecify whether the record delimiter should b e contained in

the next or in the previous record Moreover nrgrep p ermits to sp ecify an extra record separator

when the records are printed eg add another newline

It should b e clear by now that searching for longer strings is faster than for shorter ones Since

record delimiters tend to b e short strings it is not a go o d idea to delimit the text records rst and

then search for the pattern inside each record Rather we prefer to search for the pattern in the

text without any consideration for record separators and when the pattern is found search for the

next and previous record delimiters At this time we may determine that the o ccurrence overlaps

a record delimiter and discard it Note that we have to b e able to search for record delimiters

forward and backward We use the same BNDM algorithm to search for record delimiters

There are some nrgrep options however that make it necessary a recordwise traversal printing

record numb ers or printing records that do not contain the pattern In this case we advance in the

le by delimiting the records rst and then searching for the pattern inside each record

Text Buering

Text buering is necessary to cop e with large les and to achieve optimum p erformance For

example in nrgrep the buer size is set to Kb b ecause it ts well the cache size of many

machines but this default can b e overriden by the user To avoid complex interactions b etween

record limits and buer limits we discard the last incomplete record each time we read a new

buer from disk The discarded partial record is moved to the b eginning of the buer b efore

reading more text at the next buer loading If a record is larger than the buer size then an

articial record delimiter is inserted to correct the situation and a warning message is printed

Note that this also requires the ability to search for the record delimiter in backward direction

This technique works well unless the record size is large compared to the buer size in which case

the user should enlarge the buer size using the appropriate option

Contexts

Another capability of nrgrep is context sp ecication This means that the pattern o ccurrence has

to b e surrounded by certain characters in order to b e considered as such For example one may

sp ecify that the pattern should match as a whole word ie surrounded by separators or a whole

record ie surrounded by record delimiters However it is not just a matter of adding the

context strings at the ends of the pattern b ecause for example a word may b e in the b eginning of

a record and hence the separator may b e absent We solve this by checking each o ccurrence found

to determine that the required string is present b eforeafter the o ccurrence or that we reached the

b eginningend of the record

This seems trivial for a simple pattern b ecause its length is xed but for more complex patterns

such as regular expressions where there may b e many dierent o ccurrences in the same text area

we need a way to discard p ossible o ccurrences and still check for other ones that are basically in

the same place For example the search for aba as a whole word should match in the text

aaa aabaa aaa Despite that b alone is an o ccurrence of the pattern that do es not t the

whole word criterion the o ccurrence can b e extended to another one that do es We return to this

issue later

Subpattern Filter

The BNDM algorithm is designed for the case m w Otherwise we have in principle to simulate

the algorithm using many computer words However as shown in it is much faster to prune

the pattern to its rst w characters search for that subpattern and try to extend its o ccurrences to

an o ccurrence of the complete pattern This is b ecause for reasonably large w the probability of

nding a pattern of length w is low enough to make the cost of unnecessary verications negligible

On the other hand the b enet of a p ossible shift of length m w would b e cancelled by the need

to up date dmw e computer words p er text character read

Hence we select a contiguous subpattern of w characters or classes rememb er that a class

needs also one bit the same as a character and search for it Its o ccurrences are veried with the

complete pattern prior to checking records and contexts

The main p oint is which part of the pattern to search for In the abstract algorithms of

any part is equally go o d or bad b ecause a uniformly distributed mo del is assumed In practice

dierent characters have dierent probabilities and some pattern p ositions may have classes of

characters whose probability is the sum of those of the individual characters This in fact is

farther reaching than the problem of the limit w in the length of the pattern even in a short

pattern we may prefer not to include a part of the pattern in the fast scanning part This is

discussed in detail in the next subsection

Selecting the Optimal Scanning Subpattern

Let us consider the pattern helloa Pattern p ositions to match all the alphab et which

means that the search with the nondeterministic automaton inside the window will examine at least

four window p ositions even in a text window like xxxxxxxxx and will shift at most by so the

average numb er of comparisons p er character is at the very b est If we take the subpattern

hello then we can have an average much closer to op erations p er text character

We have designed a general algorithm that under the assumption that the text characters are

3

indep endent nds the b est search subpattern in O m worst case time although in practice it

2

is closer to O m log m This is a mo dest overhead in most practical text search scenarios The

algorithm is tailored to the BDMBNDM search technique and works as follows

First we build an array pr ob m which stores the sum of the probabilities of the characters

participating in the class of each pattern p osition N r g r ep stores an array of English letter prob

abilities but this can b e tailored to other purp oses and the nal scheme is robust with resp ect to

changes in those probabilities from one language to another The construction of pr ob takes O m

time in the worst case

Second we build an array ppr ob m m where ppr obi stores the probability of

2

matching the subpattern P This is computed in O m time by dynamic programming as

ii+1

follows

ppr obi ppr obi pr obi ppr obi

for increasing values

Third we build an array mpr ob m m m where mpr obi j gives the probability

3

of matching any pattern substring of length in P This is computed in O m time by

ij 1

dynamic programming using the formulas

mpr obi j

mpr obi i

j i mpr obi j ppr obi mpr obi j

for decreasing i values Note that we have used the formula for the union of indep endent events

P r A B P r A P r B

Finally the average cost p er character asso ciated to a subpattern P is computed with the

ij

following rationale With probability we insp ect one window character If any pattern p osition

in the range i j matches the window character read then we read a second character recall

Section If any consecutive pair of pattern p ositions in i j matches the two window characters

read then we read a third character and so on This is what we have computed in mpr ob The

exp ected numb er of window characters to read is therefore

pcosti j mpr obi j mpr obi j mpr obi j j i

In the BDMBNDM algorithm as so on as the window sux read ceases to b e found in the

pattern we shift the window to the p osition following the character that caused the mismatch A

simplied computation considers that the ab ove pcosti j which is an average can b e used as a

xed value and therefore we approximate the real average numb er of op erations p er text character

as

pcosti j

opsi j

j i pcosti j

which is the average numb er of characters insp ected divided by the shift obtained Once this is

obtained we select the i j pair that minimizes the work to do We also avoid considering cases

3

where j i w The total time needed to obtain this has b een O m

The amount of work is reduced by noting that we can check the ranges in increasing order of

i values and therefore we do not need the rst co ordinate of mpr ob which can b e indep endently

computed for each i Moreover we start by considering maximum j since in practice longer a b c d e f g h

ε ε ε

Figure A nondeterministic automaton accepting the pattern abcdefgh

subpatterns tend to b e b etter than shorter ones We keep the b est value found up to now and

avoid considering ranges i j which cannot b e b etter than the current b est solution even for

pcosti j note that since i is tried in ascending order and j in descending order the whole

2

pattern is tried rst This reduces the cost to O m log m in practice

As a result of this pro cedure we not only obtain the b est subpattern to search for under a

simplied cost mo del but also a hint of how many op erations p er character will we p erform If

this numb er is larger than then it is faster and safer to switch to plain ShiftOr This is precisely

what nrgrep do es

Searching for Extended Patterns

As explained in Section we have considered optional and rep eatable classes of characters as

the features allowed in our extended patterns Each of these features is treated in a dierent way

and all are integrated in an automaton which is more general than that of Figure Over this

automaton we later apply the general forward and backward search machinery

Optional Characters

Let us consider the pattern abcdefgh A nondeterministic automaton accepting that pattern

is drawn in Figure

The gure is chosen so as to show that multiple consecutive optional characters could exist

This outrules the simplest solution which works when that do es not happ en one could set up a

bit mask O with ones in the optional p ositions in our example O and let the ones

in previous states of D propagate to them Hence after the normal up date to D we could p erform

D D j D O

For example this works if we have read abcdef D and the next text character is

h since the ab ove op eration would convert D to b efore op erating it against B h

However it do es not work if the text is abefgh where b oth consecutive optional

characters have b een omitted

A general solution needs to propagate each active state in D so as to o o d all the states ahead

it that corresp ond to optional characters In our example we would like that when D is

and in general whenever its second bit is active it b ecomes after the o o ding

This is achieved with three masks A I and F marking dierent asp ects of the states related to

optional characters More sp ecically the ith bit of A is set if this p osition in P is optional that

of I is set if this is the p osition of the rst optional character of a blo ck of consecutive optional

characters and that of F is set if this is the p osition after the last optional character of a blo ck c f

a b c d e f g h

ε

Figure A nondeterministic automaton accepting the pattern abcdefgh

In our example A I and F After p erforming the normal

transition on D we do as follows

Df D j F

D D j A Df I b Df

whose rationale is as follows The rst line adds a at the p ositions following optional blo cks in

D In the second line we add some active states to D Since the states to add are anded with

A let us just consider what happ ens inside a sp ecic optional blo ck The eect that we want is

that the rst counting from the right o o ds all the blo ck bits to the left of it We subtract I

from Df which is equivalent to subtracting at each blo ck This subtraction cannot propagate

its eect outside the blo ck b ecause there is a coming from jF in Df after the highest bit

of the blo ck The eect of the subtraction is that all the bits until the rst counting from

the right are reversed eg and the rest are unchanged In general

z z

b b b b b b When this is reversed by the op eration we get

x x1 xy x x1 xy

z z

b b b Finally when this is xored with the same Df b b b we

x x1 xy x x1 xy

xy +1 z +1

get

This is precisely the eect we wanted the last o o ded all the bits to the left That itself

has b een converted to zero however but it is restored when the result is or ed with the original D

This works even if the last active state in the optional blo ck is the leftmost bit of the blo ck Note

that it is necessary to and with A at the end to lter out the bits of F that survive the pro cess

whenever the blo ck is not all zeros On the other hand it is necessary to or Df with F b ecause a

blo ck of all zeros would propagate the op eration outside its limits

Note that we could have a b order problem if there are optional characters at the b eginning of

the pattern As seen later however this cannot happ en when we select the b est subpattern for

fast scanning but it has to b e dealt with when verifying the whole pattern

Rep eatable Characters

There are two kinds of rep eatable characters marked in the syntax by zero or more rep etitions

and one or more rep etitions Each of them can b e simulated using the other since a aa

and a a For involved technical reasons that are related for example to the ability to build

easily the masks for the reversed patterns and b order conditions for the verication we preferred

the second choice despite that it uses one more bit than necessary for the op eration Figure

shows the automaton for abcdefgh

The bitparallel simulation of this automaton is more straightforward than for the optional

characters We just need to have a mask S c that for each character c tells which pattern p ositions

can remain active when we read character c In the ab ove example S c and S f

The transition is handled with the mechanism for optional characters Therefore a

complete simulation step p ermitting optional and rep eatable characters is as follows

m1

D D j B t j D S t

j j

Df D j F

D D j A Df I b Df

Forward and Backward Search

Our aim is to extend the approach used for simple patterns to patterns containing optional symb ols

Forward scanning is immediate once we learn how to simulate the dierent automata using bit

parallelism We just have to add an initial selflo op to enable text scanning this is already done

in the last formula We detect the nal p ositions of o ccurrences and then check the surrounding

record and context conditions

Backward searching needs in principle just to obtain an automaton that recognizes reverse

factors of the pattern This is obtained by building exactly the same automaton of the forward

scan without initial selflo op on the reverse pattern and letting all the states b e initial ie

initializing D with all active states However there are some problems to deal with all of them

deriving from the fact that the o ccurrences have variable length now

Since the o ccurrences do not have a xed length we have to compute the minimum length

of a p ossible match of the pattern eg in the example abcdefgh and use this value as

the width of the search window in order not to lose any p otential o ccurrence As b efore we set

up an automaton that recognizes all the reverse factors of the automaton and use it to traverse

the window backward Because the o ccurrences are not all of the same length the fact that we

arrive to the b eginning of the window do es not immediately imply that the pattern is present For

example a length text window could b e cdefffg Despite that this is a factor of the pattern

and therefore we would reach the window b eginning no pattern o ccurrence starts in the b eginning

of the window

Therefore each time we arrive to the b eginning of the window we have to check that the initial

state is active and then run a forward verication from the window b eginning on until either we

nd a match ie the last automaton state is activated or we determine that no match can start

at the window p osition under consideration ie the automaton runs out of active states The

automaton used for this forward verication is the same as for forward scanning except that the

initial selflo op is absent However as we see next verication is in fact a little more complicated

and we mix it with the rest of verications that are needed on every o ccurrence surrounding record

context etc

Verifying Occurrences

In fact the verication is a little dierent since as we see in the next subsection we select a subpat

tern for the scanning as b efore this is necessary if the pattern has more than w charactersclasses

but can also b e convenient on shorter patterns Say that P P SP where S is the subpattern

1 2

that has b een selected for fast text scanning Each time the backward search determines that a

given text p osition is a p otential start p oint for an o ccurrence of S we obtain the surrounding

record and check from the candidate text p osition the o ccurrence of SP in forward direction

2

and P in backward direction do not confuse forwardbackward scanning with forwardbackward

1

verication

When we had a simple pattern this just needed to check that each text character b elonged

to the corresp onding pattern class Since the o ccurrences have variable length we need to use

prebuilt automata for SP and for the reversed version of P These automata do not have the

2 1

initial selflo op However this time we need to use a multiword bitparallel simulation since the

patterns could b e longer than the computer word

Note also that under this scenario the forward scanning also needs verication In this case

we nd a guaranteed end p osition of S in the text not a candidate one as for backward searching

Hence we check from that nal p osition P in forward direction and P S in backward direction

2 1

Note that we need to check S again b ecause the automaton cannot tell where is the b eginning of

S and there could b e various b eginning p ositions

A nal complication is intro duced by record limits and context conditions Since we require

that a valid o ccurrence lies totally inside a record we should submit for verication the smallest

p ossible o ccurrence For example the pattern babcde has many o ccurrences in the text

record bbbcdee and we should rep ort bcd in order to guarantee that no valid o ccurrence will

b e missed However it is p ossible that the context conditions require the presence of certain strings

immediately preceding or following the o ccurrence For example if the context condition tells that

the o ccurrence should b egin a record then the minimal o ccurrence bcd would not qualify while

bbbcde would do

Fortunately context conditions ab out the initial and nal p ositions of o ccurrences are indep en

dent and hence we can check them separately So we treat the forward and backward parts of the

verication separately For each one we traverse the text and nd all the p ositions that represent

the b eginning or ending of o ccurrences and submit them to the context checking mechanism until

one is accepted the record ends or the automaton runs out of active states

Selecting a Go o d Search Subpattern

As b efore we would like to select the b est subpattern for text scanning since we have anyway

to check the p otential o ccurrences We want to apply an algorithm similar to that for simple

patterns However this time the problem is more complicated b ecause there are more arrows in

the automaton

We compute pr ob as b efore adding another array spr ob m which is the probability of

staying at state i via the S c array The ma jor complication is that a factor of length starting

at pattern p osition i do es not necessarily nish at pattern p osition i Therefore we ll

an array ppr ob m m L where L is the minimum length of an o ccurrence of P at

most m and ppr obi j is the sum of the probabilities of all the factors of length that start at

p osition i and do not reach after pattern p osition j In our example abcdefgh ppr ob

should account for ccccc ccccd cccde ccdef and cdeff

This is computed as

ppr obi i

i j ppr obi j

i j ppr obi j pr obi ppr obi j

spr obi ppr obi j

if ith bit of A is set ppr obi j

3

which is lled for decreasing i and increasing in O m time Note that we are simplifying the

computation of probabilities since we are computing the probability of a set of factors as the sum

of the individual probabilities which is only true if the factors are disjoint

Similarly to ppr obi j we have mpr obi j as the probability of any factor of length starting

at p osition i or later and not surpassing p osition j This is trivially computed from ppr ob as for

simple patterns

Finally we compute the average cost p er character as b efore We consider subpatterns from

length minm w until a length that is so short that we will always prefer the b est solution found

up to now For each subpattern considered we compute the exp ected cost p er window as the sum

of the probabilities of the subpatterns of each length ie

pcosti j mpr obi j mpr obi j mpr obi j j i

and later obtain the cost p er character as pcosti j pcosti j where is the minimum

length of an o currence of the interval i j

As for simple patterns we ll ppr ob and mpr ob in lazy form together with the computation of

2

the b est factor This makes the exp ected cost of the algorithm closer to O m log m than to the

3 2

worst case O m really O m minm w

Note that it is not p ossible b ecause it is not optimal to select a subpattern that starts or

ends with optional characters or with If it happ ens that the b est subpattern gives one or

more op erations p er text character we switch to forward searching and select the rst minm w

characters of the pattern excluding initial and nal s or s

In particular note that it is p ossible that the scanning subpattern selected has not any optional

or rep eatable character In this case we use the scanning algorithm for simple patterns despite

that at verication time we use the checking algorithm of extended patterns

Searching for Regular Expressions

The most complex patterns that nrgrep can search for are regular expressions For this sake we use

the technique explained in Section b oth for forward and backward searching However

some asp ects need to b e dealt with in a real software

Space Problems

m

The rst problem is space since the T table needs O entries and this can b e unmanageable for

long patterns Despite that we exp ect that the patterns are not very long in typical text searching

some reasonable solution has to b e provided when this is not the case

We p ermit the user to sp ecify the amount of memory that can b e used for the table and split

the bitmasks in as many parts as needed to meet the space requirements Since we do not search

for masks longer than w bits it is in fact unlikely that the text scanning part needs more than

or table accesses p er text character Attending to the most common alternatives we develop ed

separate co de for the cases of and tables which p ermits much faster scanning since register

usage is enabled

Subpattern Filtering

As for simple and extended patterns we select the b est subpattern for the scanning phase and

check all the p otential o ccurrences for complete o ccurrences and for record and context conditions

This means according to Section that we need a forward and a backward automaton for the

selected subpattern and that we also need forward and backward verication automata to check

for the complete o ccurrence An exception is when given the result of selecting the b est factor

we prefer to use forward scanning in which case only that automaton is needed but the two

verication automata are still necessary All the complications addressed for extended patterns

are present on regular expressions as well namely those derived from the fact that the o ccurrences

may have dierent lengths

It may also happ en that after selecting the search subpattern it turns out to b e just a simple

or an extended pattern in which case the search is handled by the appropriate search algorithm

which should b e faster The verication is handled as a general regular expression Note that

heuristics like those of Gnu Grep which tries to nd a literal string inside the regular expression

in order to use it as a lter are no more than particular cases of our general optimization metho d

This makes our approach much smo other than others For example Gnu Grep will b e much faster

to search for caaaaabbbbbc where the strings caaaaac andor cbbbbbc can b e used

to lter the search than for cabababababc while the dierence should not b e as

large

Regarding optimal use of space and also b ecause accessing smaller tables is faster we use a

state remapping function When a subset of the states of the automaton is selected for scanning

we build a new automaton with only the necessary states This reduces the size of the tables

What is left is to explain how we select the b est factor to search for This is much more complex

on regular expressions than on extended patterns

Selecting an Optimal Necessary Factor

The rst nontrivial task on a regular expression is to determine what is a necessary factor which

is dened as a sub expression that has to match inside every o ccurrence of the whole expression

For example fgh is not a necessary factor of abcdefghijk but cdfgh is Note that

any range of states is a necessary factor in a simple or extended pattern

Determining necessary sub expressions is complicated if we try to do it using just the automaton

graph We rather make use of the syntax tree of the regular expression as well The main trouble

is caused by the j op erator which forces us to cho ose one necessary factor from each branch

We simplify the problem by xing the minimum o ccurrence length and searching for a necessary

factor of that length on each side In our previous example we could select cdfg or dehi

for example but not cdfgh However we are able to take the whole construction namely

cdefghi

The pro cedure is recursive and aims at nding the b est necessary factor of minimum length

provided we already know we consider later the computation of these data

w l ens m where w l ensi is the minimum length of a path from i to a nal state

mar k mar k f m L where mar k i is the bitmask of all the states reachable in

steps from state i and mar k f sets the corresp onding nal states

cost m L where costi is the average cost p er character if we search for all the

paths of length leaving from state i

The metho d starts by analyzing the ro ot op erator of the syntax tree Dep ending on the typ e

of op erand we do as follows

Concatenation cho ose the b est factor minimum cost among b oth sides of the expression Note

that factors starting in the left side can continue to the right side but we are deciding ab out

where the initial p ositions are

Union cho ose the b est factor from each side and take the union of the states The average cost is

the maximum over the two sides not the sum since the cost is related to the average numb er

of characters to insp ect

One or more rep etition the minimum is one rep etition so ignore the no de and treat the

subtree

Zero or more rep etitions the subtree cannot contain a necessary factor since it do es not

need to app ear in the o ccurrences Nothing can b e found starting inside it We assign a high

cost to the factors starting inside the sub expression to make sure that it is not chosen in a

concatenation and that a union containing it will b e equally undesirable

Simple character or class tree leaf there is only one p ossible initial p osition so cho ose it

unless its w l ens value is smaller than

The pro cedure delivers a set of selected states and the prop er initial and nal states for the se

lected subautomaton It also delivers a reasonable approximation of the average cost p er character

The rest of the work is to obtain the input for this pro cedure The ideas are similar as for ex

tended patterns but the techniques need to make heavier use of graph traversal and are more exp en

sive For example just computing w l ens and L w l ensinitial state ie shortest paths from any

3 3 4

state to a nal state we need a Floydlike algorithm that takes O Lm w O minm m w

time

Arrow probabilities pr ob are loaded as b efore but ppr obi now gives the total probability of

all the paths of length leaving state i and it includes the probability of reaching i from a previous

state In Glushkovs construction all the arrows reaching a state have the same lab el so ppr ob is

computed for increasing using the formulas

ppr obi

ppr obi pr obi

X

ppr obi pr obi ppr obj

j(ij )NFA

2 3 2

This takes O Lm O minm m w time At the same time we compute the mar k mar k f

3 4

masks at a total cost of O minm m w

m+1

mar k i mark f i

mi i1

mar k i mark f i

mar k j mar k i mar k i

j(ij )NFA

mar k f i mar k f j

j(ij )NFA

for increasing values Once ppr ob is computed we build mpr ob m L L so that

mpr obi is the probability of any path of length inside the area mar k i This is computed

2 2 4 2 2

in O L m O minm m w time with the formulas

mpr obi

mpr obi

Y

mpr obj mpr obi ppr obi

j(ij )NFA

2 3 2

Finally the search cost is computed as always using mpr ob in O L m O minm mw

time

Again it is p ossible that the scanning subpattern selected is in fact a simpler typ e of pattern

such as a simple or extended one In this case we use the appropriate scanning pro cedure which

should b e faster

Another nontrivial p ossibility is that even the b est necessary factor is to o bad ie it has a

high predicted cost p er character In this case we select from the initial state of the automaton a

subset of it that ts in the computer word ie at most w states intended for forward scanning

Since all the o ccurrences found with this automaton will have to b e checked we would like to

select the subautomaton that nds the least p ossible spurious matches If m w then we use

the whole automaton otherwise we try to minimize the probability of getting outside the set of

selected states To compute this we consider that the probability of reaching a state is inversely

prop ortional to its shortest path from the initial state Hence we add states farther and farther

2

from the ro ot until we have w states All this takes O m time

This probabilistic assumption is of course simplistic but an optimal solution is quite hard and

at this p oint we know that the search will b e costly anyway

Verication

A nal nontrivial problem is how to determine which states should b e present in the forward and

backward verication automata once the b est scanning subautomaton is selected Since we have

chosen a necessary factor we know for sure that the automaton is cut in two parts by the

necessary factor If we have chosen a backward scanning then the verications start from the

initial states of the scanning automaton otherwise they start from its nal states

Figure illustrates these automata for the pattern abcddefde where we have se

lected ddef as our scanning sub expression This corresp onds to the states f g

of the original automaton The selected arrows and states are in b oldface in the gure note that ar

rows leaving from the selected subautomaton are not selected Dep ending on whether the selected

subautomaton is searched with backward or forward scanning we know the text p osition where

its initial or nal states resp ectively were reached Hence in backward scanning the verication

starts from the initial states of the subautomaton while it starts from the nal states in case of

forward scanning We need two full automata the original one and one with the arrows reversed Backward scanning Backward verification automaton (reverse arrows) Forward verification automaton e d

a b c d d e d e 0 1 2 3 4 5 6 7 8 9

f f

Forward scanning Forward verification automaton Backward verification automaton e (reverse arrows) d a b c dd e d e 0 1 2 3 4 5 6 7 8 9

f

f

Figure Forward and backward verication automata corresp onding to forward and backward

scanning of the automaton for abcddefde The shaded states are the initial ones for

verication In practice we use the complete automaton b ecause cuttingo the exact verication

subautomaton is to o complicated

There is however a complication when verifying a regular expression in this scheme Assume

the search pattern is AX B jCYD for strings A B C D X and Y The algorithm could select

X jY as the scanning subpattern Now in the text AX D the backward verication for AjC would

nd A and the forward verication for B jD would nd D and therefore an o ccurrence would b e

incorrectly triggered Worse than that it could b e the case that X Y so there is no hop e in

distinguishing what to check based on the initial states of the scanning automaton

The problem which do es not app ear in simpler typ es of patterns is that there is a set of initial

states for the verication The backward and forward verications tell us that some state of the

set can b e extended to a complete o ccurrence but there is no guarantee that there is a single state

in that set that can b e extended in b oth directions To overcome this problem we check the initial

states one by one instead of using a set of initial states and doing just one verication

Approximate Pattern Matching

All the previous algorithms p ermit sp ecifying a exible search pattern but they do not allow any

dierence b etween the pattern sp ecied and its o ccurrence in the text We now consider the problem

of p ermitting at most k insertions deletions replacements or transp ositions in the pattern

We let the user to sp ecify that only a subset of the four allowed errors are p ermitted However

designing one dierent search algorithm for each subset was impractical and against the spirit of

a uniform software So we have considered that our criterion of p ermitting these four typ es of

errors would b e the most commonly preferred in practice and have a unique scanning phase under

this mo del as all the previous algorithms we have a scanning and a verication phase Only at

the verication phase which is hop efully executed a few times we take care of only applying the

p ermitted op erations Note that we cannot miss any p otential match b ecause we scan with the

most p ermissive error mo del We also wrote in nrgrep sp ecialized co de for k and k which

are the most common cases and using a xed k p ermits b etter register usage

Simple Patterns

We make use of the three techniques describ ed in Section cho osing the one that promises to

b e the b est

Forward and Backward Searching

In forward searching k similar rows corresp onding to the pattern are used There exists a

bitparallel algorithm to simulate the automaton in case of k insertions deletions and replacements

but despite that the automaton that incorp orates transp ositions has b een depicted see

Figure no bit parallel formula for the complete op eration has b een shown We do that now

We store each row in a machine word R R just as for the base technique The temp orary

0 k

mi i

states are stored as T T The bitmasks R are initialized to as b efore while all the T

1 k i i

m

are initialized to The up date pro cedure to pro duce R and T up on reading text character t

j

is as follows

m1

R j B t R

0 j

0

for i k do

R R B t j R j R j R

i j i1 i1

i i1

j T B t

i j

T R B t

i1 j

i

The rationale of the pro cedure is as follows The rst three lines are as for the base technique

without transp ositions The second line of the formula for R corresp onds to transp ositions Note

i

that the new T value is computed accordingly to the old R value so it is text p ositions b ehind

Once we compute the new bitmasks R for text character t we take those old R masks that were

j

not up dated with t shift them in two p ositions aligning them for the p osition t and killing the

j j +1

states that do not match t This is equivalent to pro cessing two characters t At the next

j j

iteration t we shift left the mask B t and kill the states of T that do not match The net

j +1 j +1

eect is that at iteration j we are oring R with

i

T j B t R j B t B t

i j +1 i1 j j +1

R j B t B t

i1 j +1 j

which corresp onds to two forward transitions with the characters t t If those characters matched

j +1 j

the text then we p ermit the activation of R

i

m1

A match is detected as b efore when R is not zero Since we may cut w characters

k

from the pattern the context conditions and the p ossibility that the user really wanted to p ermit

only some typ es of errors we have to check each match found by the automaton b efore rep orting

it We detail later the verication pro cedure

Backward searching is adapted from this technique exactly as it is done from the basic algorithm

without transp ositions A subtle p oint is that we cannot consider the automaton dead until b oth

the R and the T masks run out of active states since T can awake a dead R

As b efore we select the b est subpattern to search for which is of length at most w The

algorithm to select the b est subpattern has to account for the fact that we are allowing errors A

go o d approximation is obtained by considering that any factor will b e alive for k turns and then

adding the normal exp ected numb er of window characters to read until the factor do es not match

So we add k to costi j in Eq Now it is more p ossible than b efore for large k m that the

nal cost p er character is greater than one in which case we prefer forward scanning

Splitting into k Subpatterns

This technique can b e directly adapted from previous work taking care of leaving a hole b etween

each pair of pieces For the checking of complete o ccurrences in candidate text areas we use the

general verication engine we have a dierent prepro cessing for the case of each of the k pieces

matching Note that it makes no sense to design a forward search algorithm for this case if the

average cost p er character is more than one this means that the probability of nding a subpattern

is so high that we will pay to o much time verifying spurious matches in which case the whole

metho d do es not work

The main diculty that remains is how to nd the b est set of k equal length pieces to search

for We start by computing pr ob and ppr ob as in Section The only dierence is that now we

are interested only in lengths up to L bm k k c which is the maximum length of a piece

2

the m k comes out b ecause there are k unused characters in the partition This cuts down the

2

cost to compute these vectors to O m k

2

The numerator is in fact converted into m if no transp ositions are p ermitted Despite that the scanning phase will

allow transp ositions anyway the p ossible matches missed by converting the numerator to m all include transp ositions

which by hyp othesis are not p ermitted

Now we compute pcost m L where pcosti gives the average cost p er window

when searching for the factor P For this sake we compute for each i value the matrix

ii+1

mpr ob L L where mpr ob r is the probability of any factor of length r in P

ii+1

2 2

This is computed for each i in O m k time as

r mpr ob r

mpr ob

r mpr ob r ppr obi r r mpr ob r

L

X

mpr ob r pcosti

r =0

3 2

All this is computed for every i for increasing in O m k total time The justication

for the previous formula is similar to that in Section Now the most interesting part is

mbest m k where mbesti s is the exp ected cost p er character of the b est s pat

tern pieces starting at pattern p osition i The strings selected have the same length Together with

mbest we have ibesti s which tells where must the rst string start in order to obtain mbest

The maximum length for the pieces is L We try all the lengths from L to until we determine

that even in the b est case we cannot improve our current b est cost For each p ossible piece length

we compute the whole mbest and ibest as follows

mbesti

i m s s s mbesti s

i m s s s mbesti s minmbesti s

cost mbesti s

where cost min pcosti pcosti

which is computed for decreasing i The rationale is that mbesti s can cho ose whether to start

the rst piece immediately or not The second case is easy since mbesti s is already computed

and ibesti s is made equal to ibesti s In the rst case we have that the cost of the rst

piece is cost and the other s pieces are chosen in the b est way from P according to

i++1m

mbesti s In this case ibesti s i Note that as the total cost to search for the k

pieces we could take the maximum cost but we obtained b etter results by using the mo del of the

probability of the union of indep endent events

2

All this costs in the worst case O m but normally the longest pieces are the b est and the

cost b ecomes O mk At the end we have the b est length for the pieces the exp ected cost p er

character in mbest k and the optimal initial p ositions for the pieces in i ibest k

0

i ibesti k i ibesti k and so on Finally if mbest k actually

1 0 2 1

larger than a smaller numb er we know that we reach the window b eginning with probability high

enough and therefore the whole scheme will not work well In this case we have to cho ose among

forward or backward searching

3 2 2 3 2

Summarizing the costs we pay O m k m in the worst case and O m k k m in practice

3

Normally k is small so the cost is close to O m in all cases

Verication

Finally we explain the verication pro cedure This is necessary b ecause of the context conditions

of the p ossibly restricted set of edit op erations and b ecause in some cases we are not sure that

there is actually a match Verication can b e called from the forward scanning in which case in

general we have partitioned the pattern in P P SP and know that at a given text p osition

1 2

a match of S ends from the backward scanning where we have partitioned the pattern in the

same way and know that at a given text p osition the match of a factor of S b egins or from the

partitioning into k pieces in which case we have partitioned P P S P S P and

0 1 1 k +1 k +1

know that at a given text p osition the exact o ccurrence of a given S b egins

i

In general all we know ab out the p ossible match of P around text p osition j is that if we

partition P P P a partition that we know then there should b e a match of P ending at T

L R L j

and a match of P starting at T and that the total numb er of errors should not exceed k In

R j +1

all the cases we have precomputed forward automata corresp onding to P and P the rst one is

L R

reversed b ecause the verication go es backward In forward and backward scanning there is just

one choice for P and P while for partitioning into k pieces there are k choices all of

L R

them are precomputed

We run the forward and backward verication from text character j in b oth directions Fortu

nately the context conditions that make a match valid or invalid can b e checked at each extreme

separately So we go backward and among all the p ositions where a legal match of P can b egin

L

we cho ose the one with minimum error level k Then we go forward lo oking for legal o ccurrences

L

of P with k k errors If we nd one then we rep ort an o ccurrence and the whole record is

R L

rep orted In fact we take advantage of each match found during the traversal if we are lo oking

the pattern with k errors and nd a legal endp oint with k k we still continue searching for

b etter o ccurrences but now allowing just k errors This saves time b ecause the verication

automaton needs just to use k rows

A complicated condition can arise b ecause of transp ositions the optimal solution may involve

transp osing T with T an op eration that we are not p ermitting b ecause we chose to split the

j j +1

verication there We solve this in a rather simple but eective way if we cannot nd an o ccurrence

in a candidate area we give it a second chance after transp osing b oth characters in the text

Extended Patterns

The treatment for extended patterns is quite a combination of the extension from simple to extended

patterns without errors and the use of k rows or the partition into k pieces in order to

p ermit k errors However there are a few complications that deserve mention

A rst one is that the propagation of active states due to optional and rep eatable characters

do es not mix well with transp ositions for example try to draw an automaton that nds abcde

with one transp osition in the text adbe We solved the problem in a rather practical way

Instead of trying to simulate a complex automaton we have two parallel automata The R masks

are used in the normal way without p ermitting transp ositions but p ermitting the other errors and

the optional and rep eatable characters and they are always one character b ehind the current text

p osition Each T mask is obtained from the R mask of the previous row by pro cessing the last two

text characters in reverse order and its result is used for the R mask of the same row in the next

iteration The co de to pro cess text p osition j is follows the initial values are as always

m1

R R j B t

0 j 1

0

for i k do

R R B t j R S t

i j 1 i j 1

i

j R j R j R j T

i1 i1 i

i1

Df R j F

i

R R j A Df I b Df

i i

m1

T R j B t j R S t

i1 j i1 j

i

Df T j F

i

T T j A Df I b Df

i i

m1

T T j B t j T S t

j 1 j 1

i i i

where the A I and F masks are dened in Section

A second complication is that subpattern optimization changes For the forward and backward

automata we use the same technique for searching without errors but we add k to the numb er of

pro cessed window p ositions for any subpattern So it remains to b e explained how is the optimiza

tion for the partitioning into k subpatterns

We have pr ob and spr ob computed in O m time as for normal extended patterns A rst

condition is that the k pieces must b e disjoint so we compute r each m L where

L bL k k c is the maximum length of a piece as b efore L is the minimum length of

a pattern o ccurrence as in Section and r eachi gives the last pattern p osition reachable in

2

text characters from i This is computed in O m time as r eachi i and for increasing

t r eachi

mt t1 m

while t m A t t

r eachi t

3

We then compute ppr ob mpr ob and pcost exactly as in Section in O m time Finally the

selection of the b est set of k subpatterns is done with mbest and ibest just as in Section

the only dierence b eing that r each is used to determine which is the rst unused p osition if pattern

p osition i is chosen and a minimum piece length has to b e reserved for it The total pro cess takes

3

O m time

Regular Expressions

Finally nrgrep p ermits searching for regular expressions allowing errors The same mechanisms used

for simple and extended patterns are used namely using k replicas of the search automaton and

splitting the pattern into k disjoint pieces Both adaptations present imp ortant complications

with resp ect to their simpler counterparts

Forward and Backward Search

One problem in regular expressions is that the concept of forward do es not immediately mean

one shift to the right Approximate searching with k copies of the NFA of the regular expression

implies b eing able to move forward from row i to row i as Figure shows e d a b c d d e d e 0 errors f f e d

a b c d d e d e 1 error f f e d a b c d d e d e 2 errors f

f

Figure Glushkovs resulting NFAs for the search of the regular expression abcddefde

with two insertions deletions or replacements To simplify the plot the dashed lines represent

deletions and replacements ie they move by fg while the vertical lines represent insertions

ie they move by

3

For this sake we use our table T which for each state D gives the bit mask of all the states

reachable from D in one step The up date formula up on reading a new text character t is therefore

j

R T R B t

0 j

0

ol dR R

0 0

for i k do

R T R B t j R j T R j R j T T ol dR B t B t

i j i1 i1 i1 j j 1

i i1

ol dR R

i1 i1

The rationale is as follows R is computed according to the simple formula for regular expres

0

sion searching without errors For i R p ermits arrows coming from matching the current

i

character T R B t vertical arrows representing insertions in the pattern R and di

i j i1

agonal arrows representing replacements from R and deletions from R which are joined

i1

i1

in T R jR Finally transp ositions are arranged in a rather simple way In ol dR we store the

i1

i1

value of R two p ositions in the past note the way we up date it to avoid having two arrays ol dR

and ol dol dR and each time we p ermit from that state the pro cessing of the last two text character

in reverse order

Forward scanning backward scanning and the verication of o ccurrences are carried out in

the normal way using this up date technique The technique to select the b est necessary factor is

unaltered except that we add k to the numb er of characters that are scanned inside every text

window

3

Recall that in the deterministic simulation each bit mask of active NFA states D is identied as a state

Splitting into k Sub expressions

If we can select k necessary factors of the regular expression as done in Section for selecting

one necessary factor then we can b e sure that at least one of them will app ear unaltered in any

o ccurrence with k errors or less As b efore in order to include the transp ositions we need to ensure

that one character is left b etween consecutive necessary factors

We rst consider how once the sub expressions have b een selected can we p erform the multi

pattern search Each sub expression has a set of initial and nal states We reverse all the arrows

and convert the formerly initial states of all the sub expressions into nal states Since we search

for any reverse factor of the regular expression all the states are made initial

We set the window length to the shortest path from an initial to a nal state At each window

p osition we read the text characters backward and feed the transformed automaton Each time

we arrive to a nal state we know that the prex of a necessary sub expression has app eared If we

happ en to b e at the b eginning of the window then we check for the whole pattern as b efore We

keep masks with the nal states of each necessary factor in order to determine from the current

mask of active states which of them matched recall that a dierent verication is triggered for

each

Figure illustrates a p ossible selection of two necessary factors corresp onding to k for

our running example abcddefde These are abc and defde where the rst

d has b een left to separate b oth necessary factors Their minimum length is so this is the

window length to use

ε

e d

a b c d e d e

f

f

Figure An automaton recognizing reverse prexes of two necessary factors of

abcddefde based on the Glushkov construction of Figure Note that there are no

transitions among sub expressions

The dicult part is how to select k necessary and disjoint factors from a regular expression

Disjoint means that the subautomata do not share states this ensures that there is a character

separating them which is necessary for transp ositions Moreover we want the b est set and we

want that all them have the same minimum path length from an initial to a nal state

We b elieve that an algorithm nding the optimal choice has a time cost which grows exp o

nentially with k We have therefore made the following simplication Each no de s is assigned a

numb er I which corresp onds to the length of the shortest path reaching it from the initial state

s

We do not p ermit picking arbitrary necessary factors but only those subautomata formed by all

the states s that have numb ers i I i for some i and This is the minimum window

s

length to search using that subautomata and should b e the same for all the necessary factors

chosen Moreover every arrow leaving out of the chosen subautomaton is dropp ed Finally note

that if the i values corresp onding to any pair of subautomata chosen dier by more than then we

are sure that they are disjoint

Figure illustrates this numb ering for our running example The partition obtained in Fig

ure corresp onds to cho osing the ranges and as the sets of states this corresp onds to

the sets f g and f g Of course the metho d do es not p ermit us to cho ose all the

legal sets of subautomata In this example the necessary subautomaton f g cannot b e picked

b ecause it includes some but not all states numb ered

e d

a b c d d e d e 0 1 2 3 4 5 6 7 8 9

f f

I_s 0 1 2 3 4 5 5 5 6 7

Figure Numb ering Glushkovs NFA states for the regular expression abcddefde

Numb ering the states is easily done by a closure pro cess starting from the initial state in

2

O Lm w time where L is the minimum length of a string matching the whole regular expression

As b efore L bL k k c is the maximum p ossible length of a string matching a necessary

factor

To determine the b est factors we rst compute pr ob ppr ob mpr ob and cost exactly as in

Section The only dierence is that now we are only interested in starting from sets of initial

states of the same I numb er there are L p ossible choices and we are interested in analyzing

s

factors of length no more than L O Lk This lowers the costs to compute the previous arrays

2 2

to O L m O Lmk We also make sure that we do not select subautomata that need

more than w bits to b e represented

Once we have the exp ected cost of cho osing any p ossible I value and any p ossible minimum

s

factor length we can apply the optimization algorithm of Section since we have in fact

linearized the problem thanks to our simplication the optimization problem is similar to when

we had a single pattern and needed to extract the b est set of k disjoint factors from it of a

2

single length This takes O L in the worst case but should in practice b e closer to O k L

4

Hence the total optimization algorithm needs O m time at worst

A Pattern Matching Software

Finally we are in p osition to describ e the software nrgrep Nrgrep has b een develop ed in ANSI C

and tested on Linux and Solaris platforms Its source co de is publicly available under a Gnu license

4

We discuss now its main asp ects

4

From httpwwwdccuchileclgnavarropubcode

Usage and Options

Nrgrep receives in the command line a pattern and a list of les to search The syntax of the pattern

is given in Section If no options are given nrgrep searches the les in the order given and prints

all the lines where it nds the pattern If more than one le is given then the le name is printed

prior to the lines that have matched inside it if there is one If no le names are given it receives

the text from the standard input

The default b ehavior of nrgrep can b e mo died with a set of p ossible options which prexed

by the minus sign can precede the pattern sp ecication Most of them are inspired in agrep

i the search is case insensitive

w only whole words matching the pattern are accepted

x only whole records eg lines matching the pattern are accepted

c just counts the matches do es not print them

l outputs the names of the les containing matches but not the matches themselves

G outputs the whole contents of the les containing matches

h do es not output le names

n prints records preceded by their record numb er

v reverse matching rep orts the records that do not match

d del im sets the record delimiter to del im which is n by default A at the

end of the delimiter makes it app ear as a part of the previous record default is next so

by default the end of line is the record delimiter and is considered as a part of the previous

record

b buf siz e sets the buer size default is Kb This aects the eciency and the p ossibility

of cutting very long records that do not t in a buer

s sep prints the string sep b etween each pair of records output

k er r idst allows up to er r errors in the matches If idst is not present the errors

p ermitted are insertions deletions substitutions and transp ositions otherwise a subset

of them can b e sp ecied eg k ids

L takes the pattern literally no sp ecial characters

H explains the usage and exits

We now discuss briey how each of these options are implemented i is easily carried out by

using classes of characters w and x are handled using context conditions c l G h b and

s are easily handled but some changes are necessary such as stopping the search when the rst

match is found for l and G n and v are more complicated b ecause they force all the records to

b e pro cessed so we rst search for the record delimiters and then search for the pattern inside the

records normally we search for the pattern and nd record delimiters around o ccurrences only

d is arranged by just changing the record delimiter it has to b e a simple pattern but classes or

characters are p ermitted k switches to approximate searching and the idst ags are considered

at verication time only and L avoids the normal parsing pro cess and considers the pattern as a

simple string

Of course it is p ermitted to combine the options in a single string preceded by the minus sign

or as a sequence of strings each preceded by the minus sign Some combinations however make

no sense and are automatically overriden a warning message is issued lenames are not printed

if the text comes by the standard input c and G are not compatible and c dominates n and

l are not compatible and l dominates l and G are not compatible and G dominates and G

is ignored when working on the standard input since G works by printing the whole le from the

shell

Parsing the Pattern

One imp ortant asp ect that we have not discussed is how is the parsing done In principle we just

have to parse a regular expression However our parser mo dule carries out some other tasks such

as

Parsing our extended syntax some of our op erations such as and classes of characters

are not part of the classical regular expression syntax The result of the parsing is a syntax

tree which is not discarded since as we have seen it is useful later for prepro cessing regular

expressions

Map the pattern to bit mask p ositions the parser determines the numb er of states required

by the pattern and assigns the bit p ositions corresp onding to each part of the pattern sp eci

cation

Determine typ e of sub expression given the whole pattern or a subset of its states the parser

is able to determine which typ e of expression is involved simple extended or regular expres

sion

Algebraic simplication the parser p erforms some algebraic simplication on the pattern to

optimize the search and reduce the numb er of bits needed for it

The parsing phase op erates in a topdown way It rst tries to parse an or of sub ex

pressions Each of them is parsed as a concatenation of sub expressions each of which is a single

expression nished with a sequence of or terminators and the single expression is

either a single symb ol a class of characters or a toplevel expression in parentheses Apart from

the syntax tree the parser pro duces a mapping from p ositions of the pattern strings to leaves of

the syntax tree meaning that the character or class describ ed at that pattern p osition must b e

loaded at that tree leaf

The second step is the algebraic simplication which pro ceeds b ottomup and is able to enforce

the following rules at any level

If C and C are classes of characters then C j C C C

1 2 1 2 1 2

If C is a class then C j j C C

If E is any sub expression then E E E

If E is any sub expression then E E E E E E E E and

E E E E

j

A sub expression which can match the empty string and app ears at the b eginning or at the

end of the pattern is replaced by except when we need to match whole words or records

To collapse classes of characters in a single no de rst rule we traverse the obtained mapping

from pattern p ositions to tree leaves nd those mapping to the leaves that store C and C and

1 2

make all them p oint to the new leaf that represents C C

1 2

More simplications are p ossible but they are more complicated and we chose to stop here for

the current version of nrgrep Some interesting simplications that may reduce the numb er of bits

required to represent the pattern are xw z juw v xjuw z jv EE E and E jE E

for arbitrarily complex E right now we do that just for leaves

After the simplication is done we assign p ositions in the bit mask corresp onding to each

characterclass in the pattern This is easily done by traversing the leaves left to right and assigning

one bit to each leaf of the tree except to those storing instead of a class of characters Note that

this works as exp ected on simple and extended patterns as well For several purp oses including

Glushkovs NFA construction algorithm we need to store which are the bit p ositions used inside

every subtree which of these corresp ond to initial and nal states and whether the empty string

matches the sub expression All this is easily done in the same recursive traversal over the tree

Once we have determined the bits that corresp ond to each tree leaf and the mapping from

pattern p ositions to tree leaves we build a table of bit masks B such that B c tells which pattern

p ositions are activated by the character c This B table can b e directly used for simple and extended

patterns and it is the basis to build the DFA transitions of regular expressions

Finally the parser is able to tell which is the typ e of the pattern that it has pro cessed This

can b e dierent from what the syntax used to express the pattern may suggest for example

abcdefg is the simple pattern abcdefg which is discovered by the simpli

cation pro cedure This is in our general spirit of not b eing mislead by simple problems presented in

a complicated way which motivated us to select optimum subpatterns to search for As a secondary

eect of simplications the numb er of bits required to represent the pattern may b e reduced

A second reason to b e able to determine the real typ e of pattern is that the scanning subpattern

selected can b e simpler than the whole pattern and hence it can admit a faster search algorithm

For this sake we p ermit determining the typ e not only of the whole pattern but also of a selected

set of p ositions

The algorithm for determining the typ e of pattern or subpattern starts by assuming a simple

pattern until it nds evidence of an extended pattern or a regular expression The algorithm enters

recursively into the syntax tree of the pattern avo ding entering sub expressions where no selected

state is involved Among those that have to b e entered in order to reach the selected states it

answers simple for the leaves of the tree characters classes and and in concatenations it

takes the most complex typ e among the two sides When reaching an internal no de or

it assumes extended if the subtree was just a single character otherwise it assumes regular

expression The latter is always assumed when an or that could not b e simplied is found

Software Structure

Nrgrep is implemented as a set of mo dules which p ermits easy enrichment with new typ es of

patterns Each typ e of pattern simple extended and regular expression has two mo dules to deal

with the exact and the approximate case which makes six mo dules simple esimple extended

eextended regular eregular the prex e stands for allowing errors The other mo dules

are

basicsoptionsexceptmemio basic denitions exception handling and memory and IO man

agement functions

bitmasks handles op erations on simple and multiword bit masks

buffer implements the buering mechanism to read les

record takes care of context conditions and record management

parser p erforms the parsing

search template that exp orts the prepro cessing and search functions which are implemented in

the six mo dules describ ed simple etc

shell interacts with the user and provides the main program

The template search facilitates the management of dierent search algorithms for dierent

pattern typ es Moreover we remind that the scanning pro cedure for a pattern of a given typ e can

use a subpattern whose typ e is simpler and hence a dierent scanning function is used This is

easily handled with the template

Some Exp erimental Results

We present now some exp eriments comparing the p erformance of nrgrep version against that of

its b est known comp etitors Gnu grep version and agrep version

Agrep uses for simple strings a very ecient mo dication of the BoyerMo oreHorsp o ol

algorithm For classes of characters and wild cards denoted and equivalent to it

uses an extension of the ShiftOr algorithm explained in Section in all cases based on forward

scanning For regular expressions it uses a forward scanning with the bit parallel simulation of

Thompsons automaton as explained in Section Finally for approximate searching of simple

strings it tries to use partitioning into k pieces and a multipattern search algorithm based on

BoyerMo ore but if this do es not promise to yield go o d results it prefers a bitparallel forward

scanning with the automaton of k rows these techniques were explained in Section For

approximate searching of more complex patterns agrep uses only this last technique

Gnu Grep cannot search for approximate patterns but it p ermits all the extended patterns and

regular expressions Simple strings are searched for with a BoyerMo oreGosp er search algorithm

similar to Horsp o ol All the other complex patterns are searched for with a lazy deterministic

automaton ie a DFA whose states are built as needed using forward scanning To sp eed up the

search for complex patterns grep tries to extract their longest necessary string which is used as a

lter and searched for as a simple string In fact grep is able to extract a necessary set of strings

ie such that one of the strings in the set has to app ear in every match This set is searched for as a

lter using a CommentzWalter like algorithm which is a kind of multipattern BoyerMo ore As

we will see this extension makes grep very p owerful and closer to our goal of a smo oth degradation

in eciency as the pattern gets more complex

The exp eriments were carried out over Mb of English text extracted from Wall Street

Journal articles of which are part of the trec collection Two dierent machines were

used sun is a Sun UltraSparc of MHz with Mb RAM running Solaris and intel is

an i of MHz with Mb RAM running Linux Red Hat kernel Both are

bit machines w

To illustrate the complexity of the co de we show the sizes of the sources and executables in

Table The source size is obtained by summing up the sizes of all the c and h les The

size of the executables is computed after running Unixs strip command on them As can b e

seen nrgrep is in general a simpler software

Software Source size Executable size Executable size

sun intel

agrep v Kb Kb Kb

grep v Kb Kb Kb

nrgrep v Kb Kb Kb

Table Sizes of the dierent softwares under comparison

The exp eriments were rep eated for dierent patterns of each kind To minimize the inter

ference of io times in our measures we had the text in a lo cal disk we considered only user times

and we asked the programs to show the count of matching records only The records were the lines

of the le The same patterns were searched for on the same text for grep agrep and nrgrep

Our rst exp eriment shows the search times for simple strings of lengths to Those strings

were randomly selected from the same text starting at word b eginnings and taking care of including

only letters digits and spaces We also tested how the i case insensitive search and w

match whole words aected the p erformance but the dierences were negligible We also made

this exp eriment allowing and of errors ie k bmc or k bmc In the case of

errors grep is excluded from the comparison b ecause it do es not p ermit approximate searching

As Figure shows nrgrep is comp etitive against the others more or less dep ending on the

machine and the pattern length It works b etter on the intel architecture and on mo derate length

patterns rather than on very short ones It is interesting to notice that in our exp eriments grep

p erformed b etter than agrep

When searching allowing errors nrgrep is slower or faster than agrep dep ending on the case

With low error levels they are quite close except for m where for some reason agrep

p erforms consistently bad on sun With mo derate error levels the picture is more complicated

and each of them is b etter for dierent pattern lengths This is a p oint of very volatile b ehavior

b ecause happ ens to b e very close to the limit where splitting into k pieces ceases to b e a

go o d choice and a small error estimating the probability of a match pro duces dramatic changes in

the p erformance Dep ending on each pattern a dierent search technique is used

On the other hand the search p ermitting transp ositions is consistently worse than the one not

p ermitting them This is not only b ecause when splitting into k pieces they may have to b e

shorter to allow one free space among them but also b ecause even when they can have the same

length the subpattern selection pro cess has more options to nd the b est pieces if no space has to

b e left among them

The b ehavior of nrgrep however is not as erratic as it seems to b e The nonmonotonic b ehavior

can b e explained in terms of splitting into k pieces As m grows the length of the pieces that

can b e searched for grows tending to the real mk value from b elow For example for k

of m we have in the case of transp ositions to search for pieces of length when m

For m we have to search for pieces of length which is estimated to pro duce to o many

verications and then another metho d is used For m however we have to search for pieces

of length which has much lower probability of o ccurrence and recommends again splitting into

k pieces The dierences b etween searching using or not transp ositions is explained in the same

way For m we have to search for pieces of length no matter whether transp ositions are

p ermitted But for m we can search for pieces of length if no transp ositions are p ermitted

which yields much b etter search time for that case We remark that p ermitting transp ositions has

the p otential of yielding approximate searching of b etter quality and hence obtain the same or

b etter results using a smaller k

Our second exp eriment aims at evaluating the p erformance when searching for simple patterns

that include classes of characters We have selected strings as b efore from the text and have

replaced some random p ositions with a class of characters The classes have b een upp er and lower

case versions of the letter replaced called case in the exp eriments all the letters azAZ

called letters in the exp eriment and all the characters called all in the exp eriments We

have considered pattern lengths of and and an increasing numb er of p ositions converted into

classes We show also the case of length and k errors excluding grep

As Figure shows nrgrep deals much more eciently with classes of characters than its

comp etitors worsening slowly as there are more or bigger classes to consider Agrep yields always

the same search time coming from a ShiftOr like algorithm Grep on the other hand worsens

progressively as the numb er of classes grows but is indep endent on how big the classes are and it

worsens much faster than nrgrep We have included the case of zero classes basically to show the

eect of the change of agreps search algorithm

Our third exp eriment deals with extended patterns We have selected strings as b efore from the

text and have added an increasing numb er of op erators or to it at random p ositions

avoiding the rst and last ones For this sake we had to use the E option of grep which

deals with regular expressions and convert E into EE for agrep Moreover we found no way to

express the in agrep since even allowing regular expressions it did not p ermit sp ecifying the

string or the empty set a aj aj We also show how agrep and nrgrep p erform to

search for extended patterns allowing errors Because of agreps limitations we chose m and

k errors no transp ositions p ermitted for this test

As Figure shows agrep has a constant search time coming again from ShiftOr like searching

plus some size limitations on the pattern Both grep and nrgrep improve as the problem b ecomes Exact searching on SUN Exact searching on INTEL 2 34 Agrep Agrep Grep Grep 1.8 Nrgrep 32 Nrgrep 30 1.6 28 1.4 26 1.2 24 1 22

0.8 20

0.6 18 5 10 15 20 25 30 5 10 15 20 25 30 m m

Approximate searching on SUN Approximate searching on INTEL 45 60 Agrep 10% Agrep 10% 40 Nrgrep 10% Nrgrep 10% Nrgrep 10% (transp) 55 Nrgrep 10% (transp) 35 Agrep 20% Agrep 20% Nrgrep 20% Nrgrep 20% Nrgrep 20% (transp) 30 50 Nrgrep 20% (transp)

25 45 20

15 40

10 35 5

0 30 5 10 15 20 25 30 5 10 15 20 25 30

m m

Figure Search times on Mb for simple strings using exact and approximate searching m = 10 on SUN m = 10 on INTEL 10 42

9 40 Agrep - case Agrep - letters 8 Agrep - case 38 Agrep - all Agrep - letters Grep - case 7 Agrep - all 36 Grep - letters Grep - case Grep - all 6 Grep - letters 34 Nrgrep - case Grep - all Nrgrep - letters Nrgrep - all 5 Nrgrep - case 32 Nrgrep - letters Nrgrep - all 4 30

3 28

2 26

1 24 0 1 2 3 4 5 0 1 2 3 4 5 # of classes # of classes

m = 10 on SUN m = 20 on INTEL 10 42 9 40 Agrep - case Agrep - letters 8 Agrep - case 38 Agrep - all Agrep - letters Grep - case 7 Agrep - all 36 Grep - letters Grep - case Grep - all 6 Grep - letters 34 Nrgrep - case Grep - all Nrgrep - letters 5 Nrgrep - case 32 Nrgrep - all Nrgrep - letters 4 Nrgrep - all 30 3 28 2 26 1 24 0 22 0 2 4 6 8 10 0 2 4 6 8 10 # of classes # of classes

m = 15, allowing 2 errors on SUN m = 15, allowing 2 errors on INTEL 20 48

18 Agrep - case 46 Agrep - letters Agrep - case 16 Agrep - all 44 Agrep - letters Nrgrep - case Agrep - all 14 Nrgrep - letters Nrgrep - case Nrgrep - all 42 Nrgrep - letters 12 Nrgrep - all 40 10 38 8 36 6

4 34

2 32 1 2 3 4 5 6 7 1 2 3 4 5 6 7

# of classes # of classes

Figure Exact and approximate search times on Mb for simple patterns with a varying

numb er of classes of characters

simpler but nrgrep is consistently faster Note that the problem is necessarily easier with the

op erator b ecause the minimum length of the pattern is not reduced as the numb er of op erators

grows We have again included the case of zero op erators to show how agrep jumps

With resp ect to approximate searching nrgrep is consistently faster than agrep whose cost is a

kind of worst case which is reached when the minimum pattern length b ecomes This is indeed a

very dicult case for approximate searching and nrgrep wisely cho oses to do forward scanning on

that case

Our fourth exp eriment considers regular expressions It is not easy to dene what is a random

regular expression so we have tested complex patterns that we considered interesting to illustrate

the dierent alternatives for the eciency These have b een searched for with zero one and two

errors no transp ositions p ermitted Table shows the patterns selected and some asp ects that

explain the eciency problems to search for them

No Pattern Size Minimum of lines that match

chars length ` exactly error errors

AmericanCanadian

AmericanCanadianMexican

Amerazcan

AmerazcanCanazian

Ameirican

Amazriazan

AmCaernaicdian

Americanpolicy

Americanpolicy

Table The regular expressions searched for written with the syntax of nrgrep Note that some

are indeed extended patterns

Table shows the results These are more complex to interpret than in the previous cases and

there are imp ortant dierences in the b ehavior of the same co de dep ending on the machines

For example grep p erformed consistently well on intel while it showed wide dierences on

sun We b elieve that this comes from the fact that in some cases grep cannot nd a suitable set of

ltering strings patterns and In those cases the time corresp onds to that of a forward

DFA On the intel machine however the search times are always go o d

Another example is agrep which has basically two dierent times on sun and always the same

time on intel Despite that agrep uses always forward scanning with a bitparallel automaton it

takes on the sun half the time when the pattern is very short up to p ositions while it is slower

for longer patterns This dierence seems to come from the numb er of computer words needed for

the simulation but this seems not to b e imp ortant on the intel machine The approximate search

using agrep which works only on the shorter patterns scales in time accordingly to the numb er of

computer words used although there is a large constant factor added in the intel machine

Nrgrep p erforms well on exact searching All the patterns yield fast search times in general

b etter than those of grep on intel and worse on sun The exception is where grep do es not use

ltering in the sun machine and nrgrep b ecomes much faster When errors are p ermitted the

search times of nrgrep vary dep ending on its ability to lter the search The main asp ects that m = 10 on SUN m = 10 on INTEL 10 44 42 Agrep - *'s 8 Agrep - +'s 40 Agrep - *'s Grep - ?'s 38 Agrep - +'s Grep - *'s Grep - ?'s Grep - +'s 6 36 Grep - *'s Nrgrep - ?'s Grep - +'s Nrgrep - *'s 34 Nrgrep - ?'s Nrgrep - +'s Nrgrep - *'s 4 32 Nrgrep - +'s 30 2 28 26 0 24 0 1 2 3 4 5 0 1 2 3 4 5 # of special symbols # of special symbols

m = 20 on SUN m = 20 on INTEL 3.5 44 42 3 Agrep - *'s 40 Agrep - +'s Agrep - *'s Grep - ?'s 38 Agrep - +'s 2.5 Grep - *'s Grep - ?'s Grep - +'s 36 Grep - *'s Nrgrep - ?'s 34 Grep - +'s 2 Nrgrep - *'s Nrgrep - ?'s Nrgrep - +'s 32 Nrgrep - *'s Nrgrep - +'s 30 1.5 28

1 26 24 0.5 22 2 4 6 8 10 0 2 4 6 8 10 # of special symbols # of special symbols

m = 8, allowing 1 error on SUN m = 8, allowing 1 error on INTEL 30 44

43 25 42 Agrep - *'s Agrep - *'s Agrep - +'s 20 Agrep - +'s 41 Nrgrep - ?'s Nrgrep - ?'s Nrgrep - *'s Nrgrep - *'s Nrgrep - +'s 15 Nrgrep - +'s 40

39 10 38 5 37

0 36 1 1.5 2 2.5 3 3.5 4 1 1.5 2 2.5 3 3.5 4

# of special symbols # of special symbols

Figure Exact and approximate search times on Mb for extended patterns with a varying

numb er of op erators Where on sun Agrep is out of b ounds it takes nearly seconds

aect the search time are the frequency of the pattern and its size When compared to agrep nrgrep

p erforms always b etter on exact searching and b etter or similarly on approximate searching

sun

No grep agrep nrgrep

errors errors error errors errors error errors

intel

No grep agrep nrgrep

errors errors error errors errors error errors

Table Search times for the selected regular expressions There are empty cells b ecause agrep

severely restricts the lengths of the complex patterns that can b e approximately searched for

The general conclusions from the exp eriments are that nrgrep is for exact searching comp etitive

against agrep and grep while it is in general sup erior sometimes by far when searching for classes

of characters and extended patterns exactly or allowing errors When it comes to search for regular

expressions nrgrep is in general but not always faster than grep and agrep

One nal word ab out nrgreps smo othness is worthwhile The reader may get the impression

that nrgreps b ehavior is not as smo oth as promised b ecause it takes very dierent times for dierent

patterns for example on regular expressions while agreps b ehavior is much more predictable The

p oint is that some of these patterns are indeed much simpler than others and nrgrep is much faster

to search for the simpler ones Agrep on the other hand do es not distinguish b etween simple and

complicated cases Grep do es a much b etter job but it do es not deal with the complex area of

approximate searching Something similar happ ens on other cases nrgrep had the highest variance

when searching for patterns where classes were inserted at random p oints This is b ecause this

random pro cess do es pro duce a high variance in the complexity of the patterns it is much simpler

to search for the pattern when all the classes are in one extreme then cutting it out from the

scanning subpattern than when they are uniformly spread Nrgreps b ehavior simply mimics the

variance in its input precisely b ecause it takes time prop ortional to the real complexity of the

search problem

Conclusions

We have presented nrgrep a fast online pattern matching to ol esp ecially well suited for complex

pattern searching on natural language text Nrgrep is now at version and publicly available

under a Gnu license Our b elief is that it can b e a successful new memb er of the grep family The

Free Software Foundation has shown interest in making nrgrep an imp ortant part of a new release

of Gnu grep and we are currently dening de details

The most imp ortant improvements of nrgrep over the other memb ers of the grep family are

Eciency nrgrep is similar to the others when searching for simple strings a sequence of single

characters and some regular expressions and generally much faster for all the other typ es of

patterns

Uniformity our search mo del is uniform based on a single concept This translates into smo oth

variations in the search time as the pattern gets more complex and into an absence of obscure

restrictions present in agrep

Extended patterns we intro duce a class of patterns which is intermediate b etween simple pat

terns and regular expressions and develop ecient search algorithms for it

Error mo del we include character transp ositions in the error mo del

Optimization we nd the optimal subpattern to scan the text and check the p otential o ccurrences

for the complete pattern

Some p ossible extensions and improvements have b een left for future work The rst one is a

b etter computation of the matching probability which has a direct impact on the ability of nrgrep

for cho osing the right search metho d currently it normally succeeds but not always A second

one is a b etter algebraic optimization of the regular expressions which has also an impact on the

ability to correctly compute the matching probabilities Finally we would also like to b e able to

combine exact and approximate searching as agrep do es where parts of the pattern accept errors

and others do not It is not hard to do this by using bit masks that control the propagation of

errors

Acknowledgements

Most of the algorithmic work preceding nrgrep was done in collab oration with Mathieu Ranot

who unfortunately had no time to participate in this software

References

R BaezaYates Text retrieval Theory and practice In th IFIP World Computer Congress

volume I pages Septemb er

R BaezaYates and G Gonnet A new approach to text searching Communications of the

ACM

R BaezaYates and G Navarro Faster approximate string matching Algorithmica

R BaezaYates and B Rib eiroNeto Modern Information Retrieval AddisonWesley

G Berry and R Sethi From regular expression to deterministic automata Theoretical Com

puter Science

R Boyer and J Mo ore A fast string searching algorithm Communications of the ACM

B CommentzWalter A string matching algorithm fast on the average In Proc ICALP

LNCS v pages

M Cro chemore and W Rytter Text algorithms Oxford University Press

A Czuma j M Cro chemore L Gasieniec S Jarominek T Lecro q W Plandowski and

W Rytter Sp eeding up two stringmatching algorithms Algorithmica

N ElMabrouk and M Cro chemore Boyermo ore strategy to ecient approximate string

matching In th International Symposium on Combinatorial Pattern Matching CPM

pages

D Harman Overview of the Third Text REtrieval Conference In Proc Third Text REtrieval

Conference TREC pages NIST Sp ecial Publication

R Horsp o ol Practical fast searching in strings Software Practice and Experience

D Knuth J Morris and V Pratt Fast pattern matching in strings Siam Journal on

Computing

K Kukich Techniques for automatically correcting words in text ACM Computing Surveys

U Manb er and S Wu Glimpse A to ol to search through entire le systems In Proc

USENIX Technical Conference pages Winter

B Melichar String matching with k dierences by nite automata In Proc International

Congress on Pattern Recognition ICPR pages IEEE CS Press

E Moura G Navarro N Ziviani and R BaezaYates Fast and exible word searching on

compressed text ACM Transactions on Information Systems TOIS To app ear Earlier

versions in SIGIR and SPIRE

G Myers A fast bitvector algorithm for approximate string matching based on dynamic

progamming Journal of the ACM

G Navarro A guided tour to approximate string matching ACM Computing Surveys

To app ear

G Navarro and R BaezaYates Very fast and simple approximate string matching Informa

tion Processing Letters

G Navarro E Moura M Neub ert N Ziviani and R BaezaYates Adding compression to

blo ck addressing inverted indexes Kluwer Information Retrieval Journal

G Navarro and M Ranot A bitparallel approach to sux automata Fast extended string

matching In th International Symposium on Combinatorial Pattern Matching CPM

LNCS pages

G Navarro and M Ranot Fast and exible string matching by combining bitparallelism

and sux automata Technical Rep ort TRDCC Dept of Computer Science Univ of

Chile

G Navarro and M Ranot Fast regular expression searching In rd International Workshop

on Algorithm Engineering WAE LNCS pages

G Navarro and M Ranot Compact DFA representation for fast regular expression search

In th International Workshop on Algorithm Engineering WAE LNCS To app ear

G Navarro and M Ranot Flexible Pattern Matching in Strings Cambridge University

Press To app ear

M Ranot On the multi backward dawg matching algorithm MultiBDM In Proc th

South American Workshop on String Processing WSP pages Carleton University

Press

K Thompson Regular expression search algorithm Communications of the ACM

S Wu and U Manb er Agrep a fast approximate patternmatching to ol In Proc USENIX

Technical Conference pages

S Wu and U Manb er Fast text searching allowing errors Communications of the ACM