
<p>NR<a href="/tags/Grep/" rel="tag">grep</a> A Fast and Flexible Pattern Matching To ol</p><p></p><p>Gonzalo Navarro</p><p>Abstract</p><p>We present nrgrep nondeterministic reverse grep a new pattern matching to ol designed</p><p> for ecient search of complex patterns Unlike previous to ols of the grep family such as agrep</p><p> and Gnu grep nrgrep is based on a single and uniform concept the bitparallel simulation</p><p> of a nondeterministic sux automaton As a result nrgrep can nd from simple patterns to</p><p> regular expressions exactly or allowing errors in the matches with an eciency that degrades</p><p> smo othly as the complexity of the searched pattern increases Another concept fully integrated</p><p> into nrgrep and that contributes to this smo othness is the selection of adequate subpatterns for</p><p> fast scanning which is also absent in many current to ols We show that the eciency of nrgrep</p><p> is similar to that of the fastest existing string matching to ols for the simplest patterns and by</p><p> far unpaired for more complex patterns</p><p>Key words Online string matching <a href="/tags/Regular_expression/" rel="tag">regular expression</a> searching approximate string match</p><p> ing grep agrep BNDM</p><p> Intro duction</p><p>The purp ose of this pap er is to present a new pattern matching to ol which we have coined nrgrep</p><p> for nondeterministic reverse grep Nrgrep is aimed at ecient searching for complex patterns</p><p> inside natural language texts but it can b e used in many other scenarios</p><p>The pattern matching problem can b e stated as follows given a text T of n characters and a</p><p>1n</p><p> pattern P nd all the p ositions of T where P o ccurs The problem is basic in almost every area of</p><p> computer science and app ears in many dierent forms The pattern P can b e just a simple string</p><p> but it can also b e for example a regular expression An o ccurrence can b e dened as exactly or</p><p>approximately matching the pattern</p><p>In this pap er we concentrate on online string matching that is the text cannot b e indexed</p><p>Online string matching is useful for casual searching ie users lo oking for strings in their les</p><p> and unwilling to maintain an index for that purp ose dynamic text collections where the cost</p><p> of keeping an uptodate index is prohibitive including the searchers inside text editors and Web</p><p>1</p><p> interfaces for not very large texts up to a few hundred megabytes and even as internal to ols</p><p> of indexed schemes as agrep is used inside glimpse or cgrep is used inside compressed</p><p> indexes </p><p></p><p>Dept of Computer Science University of Chile Blanco Encalada Santiago Chile</p><p> gnavarrodccuchilecl Work develop ed while the author was at p ostdo ctoral stay at the Institut Gaspard Monge</p><p>Univ de MarnelaVallee France partially supp orted by Fundacion Andes and ECOSConicyt</p><p>1</p><p>We refer to the search in page facility not to confuse with searching the Web </p><p>There is a large class of string matching algorithms in the literature see for example </p><p> but not all of them are practical There is also a wide variety of fast online string matching to ols</p><p> in the public domain most prominently the grep family Among these Gnu grep and Wu and</p><p>Manb ers agrep are widely known and currently considered as the fastest string matching to ols</p><p> in practice Another distinguishing feature of these software systems is their exibility they can</p><p> search not only for simple strings but they also p ermit classes of characters that is a pattern</p><p> p osition matches a set of characters wild cards a pattern p osition that matches an arbitrary</p><p> string regular expression searching multipattern searching etc Agrep also p ermits approximate</p><p> searching the pattern matches the text after p erforming a limited numb er of alterations on it</p><p>The algorithmic principles b ehind agrep are diverse Exact string matching is done with</p><p> the Horsp o ol algorithm a variant of the BoyerMo ore family The sp eed of the Boyer</p><p>Mo ore string matching algorithms comes from their ability to skip ie not insp ect some text</p><p> characters Agrep deals with more complex patterns using a variant of ShiftOr an algorithm</p><p> exploiting bitparallelism a concept that we explain later to simulate nondeterministic automata</p><p>NFA eciently ShiftOr however cannot skip text characters Multipattern searching is treated</p><p> with bitparallelism or with a dierent algorithm dep ending on the case As a result the search</p><p> p erformance of agrep varies sharply dep ending on the typ e of search pattern and even slight</p><p> mo dications to the pattern yield widely dierent search times For example the search for the</p><p> string algorithm is times faster than for Aalgorithm where Aa is a class of characters</p><p> that matches A and a which is useful to detect the word either starting a sentence or not In</p><p> the rst case agrep uses Horsp o ols algorithm and in the second case it uses ShiftOr Intuitively</p><p> there should exist a more uniform approach where b oth strings could b e eciently searched for</p><p> without a signicant dierence in the search time</p><p>An answer to this challenge is the BNDM algorithm BNDM is based on a previous</p><p> algorithm BDM for backward DAWG matching The BDM algorithm to b e explained</p><p> later uses a sux automaton to detect substrings of the pattern inside a text window the Boyer</p><p>Mo ore family detects only suxes of the pattern As the BoyerMo ore algorithms BDM can also</p><p> skip text characters In the original BDM algorithm the sux automaton is made deterministic</p><p>BNDM is a recent version of BDM that keeps the sux automaton in nondeterministic form by</p><p> using bitparallelism As a result BNDM can search for complex patterns and still keep a search</p><p> eciency close to that of simple patterns It has b een shown exp erimentally that the</p><p>BNDM algorithm is by far the fastest one to search for complex patterns BNDM has b een later</p><p> extended to handle regular expressions </p><p>Nrgrep is a pattern matching to ol built over the BNDM algorithm hence the name nondeter</p><p> ministic reverse grep since BNDM scans windows of the text in reverse direction However there</p><p> is a gap b etween a pattern matching algorithm and a real software The purp ose of this work is to</p><p>ll that gap We have classied the allowed search patterns in three levels</p><p>Simple patterns a simple pattern is a sequence of m classes of characters note that a single</p><p> character is a particular case of a class Its distinguishing feature is that an o ccurrence of a</p><p> simple pattern has length m as well as each pattern p osition matches one text p osition</p><p>Extended patterns an extended pattern adds to simple patterns the ability to characterize indi</p><p> vidual classes as optional ie they can b e skipp ed when matching the text or rep eatable </p><p>ie they can app ear consecutively a numb er of times in the text The purp ose of extended</p><p> patterns is to capture the most commonly used extensions of the normal search patterns so</p><p> as to develop sp ecialized pattern matching algorithms for them</p><p>Regular expressions a regular expression is formed by simple classes the empty string or the</p><p>concatenation union or rep etition of other regular expressions This is the most general</p><p> typ e of pattern we can search for</p><p>We develop a dierent pattern matching algorithm with increasing complexity for each typ e of</p><p> pattern so simpler patterns are searched for with simpler and faster algorithms The classication</p><p> has b een made having in mind the typical search needs on natural language and it would b e</p><p> dierent say for DNA searching We have also this in mind when we design the error mo del for</p><p> approximate searching Approximate searching or searching allowing errors means that the</p><p> user gives an error threshold k and the system is able to nd the pattern in the text even if it</p><p> is necessary to p erform k or less op erations in the pattern to match its o ccurrence in the text</p><p>The op erations typically p ermitted are the insertion deletion and substitution of single characters</p><p>However transp osition of adjacent characters is an imp ortant typing error that is normally</p><p> disregarded b ecause it is dicult to deal with We allow the four op erations in nrgrep although</p><p> the user can sp ecify a subset of them</p><p>An imp ortant asp ect that deserves attention in order to obtain the desired smo othness in</p><p> the search time is the selection of an optimal subpattern to scan the text A typical case is an</p><p> extended pattern with a large and rep eatable class of characters close to one end For technical</p><p> reasons that will b e made clear later it may b e very exp ensive to search for the pattern as is while</p><p> pruning the extreme of the pattern that contains the class and verifying the p otential o ccurrences</p><p> found leads to much faster searching Some to ols such as Gnu grep for regular expressions try to</p><p> apply some heuristics of this typ e but we provide a general and uniform subpattern optimization</p><p> metho d that works well in all cases and under a simplied probabilistic mo del yields the optimal</p><p> search p erformance for that pattern Moreover the selected subpattern may b e of a simpler typ e</p><p> than the whole pattern and a faster search algorithm may b e p ossible Detecting the exact typ e</p><p> of pattern given by the user despite the syntax used is an imp ortant issue that is solved by the</p><p> pattern parser</p><p>We have followed the philosophy of agrep in some asp ects such as the recordoriented way</p><p> to present the results and most of the pattern syntax features and search options The main</p><p> advantages of nrgrep over the grep family are uniformity in design smo othness in search time</p><p> sp eed when searching for complex patterns p owerful extended patterns improved error mo del for</p><p> approximate searching and subpattern optimization</p><p>In this pap er we start by explaining the concepts of bit parallelism and searching with sux</p><p> automata Then we explain how these are combined to search for simple patterns extended patterns</p><p> and regular expressions We later consider the approximate search of these patterns Finally we</p><p> present the nrgrep software and show some exp erimental results on it Despite that the algorithmic</p><p> asp ects of the pap er b orrow from our previous work in some cases the pap er has</p><p> some novel and nontrivial algorithmic contributions such as</p><p> searching for extended patterns which implies the bit parallel simulation of new typ es of</p><p> restricted automata </p><p> approximate searching allowing transp ositions for the three typ es of patterns which has never</p><p> b een considered under the bitparallel approach and</p><p> algorithms to select optimal search subpatterns in the three typ es of patterns</p><p>The nrgrep to ol is freely available under a Gnu license from</p><p> httpwwwdccuchileclgnavarropubcode</p><p> Basic Concepts</p><p>We dene in this section the basic concepts and notation needed throughout the pap er</p><p> Notation</p><p>We consider that the text is a sequence of n characters T t t where t is the</p><p>1 n i</p><p> alphab et of the text and its size is denoted jj In the simplest case the pattern is denoted</p><p> as P p p a sequence of m characters p in which case it sp ecies the single string</p><p>1 m i</p><p> p p More general patterns sp ecify a nite or innite set of strings</p><p>1 m</p><p>We say that P matches T at p osition i whenever there exists a j such that t t b elongs</p><p> i i+j</p><p> to the set of strings sp ecied by the pattern The substring t t is called an o ccurrence of</p><p> i i+j</p><p>P in T Our goal is to nd all the text p ositions that start a pattern o ccurrence</p><p>The following notation is used for strings S denotes the string s s s In particular</p><p> ij i i+1 j</p><p>S the empty string if i j A string X is said to b e a prex sux and factor or</p><p> ij</p><p> substring resp ectively of XY YX and YXZ for any Y and Z </p><p>We use some notation to describ e bitparallel algorithms We use exp onentiation to denote</p><p>3</p><p> bit rep etition eg We denote as b b the bits of a mask of length which is</p><p> 1</p><p> stored somewhere inside the computer word of xed length w in bits We use Clike syntax for</p><p> op erations on the bits of computer words ie j is the bitwiseor is the bitwiseand b </p><p> is the bitwisexor complements all the bits and moves the bits to the left and enters</p><p> zeros from the right eg b b b b b b b We can also p erform arithmetic</p><p> 1 2 1 3 2 1</p><p> op erations on the bits such as addition and subtraction which op erate the bits as the binary</p><p> representation of a numb er for instance b b b b </p><p> x x</p><p>In the following we show that the pattern can b e a more complex entity matching in fact a set</p><p> of dierent text substrings</p><p> Simple Patterns</p><p>We call a simple pattern a sequence of characters or classes of characters Let m b e the numb er</p><p> of elements in the sequence then a simple pattern P is written as P p p where p </p><p>1 m i</p><p>We say that P matches at text p osition i whenever t p for i m The most imp ortant</p><p> i i</p><p> feature of simple patterns is that they match a substring of the same length m in the text</p><p>We use the following notation to describ e simple patterns we concatenate the elements of the</p><p> sequence together Simple characters ie classes of size are written down directly while other</p><p> classes of characters are written in square brackets The rst character inside the square brackets </p><p> can b e which means that the class is exactly the complement of what is sp ecied The rest</p><p> is a simple enumeration of the characters of the class except that we allow ranges xy means</p><p> all the characters b etween x and y inclusive we assume a total order in which is in practice the</p><p>ASCI I co de Finally the character represents a class equal to the whole alphab et and </p><p> represents the class of all separators ie non alphanumeric characters Most of these conventions</p><p> are the same used in <a href="/tags/Unix/" rel="tag">Unix</a> software Some examples are</p><p> Aamerican which matches American and american</p><p> nBegin which nds Begin if it is not preceded by a line break</p><p> which matches any date in the s</p><p> eazAZt which p ermits any character in the rst p osition then e then anything</p><p> except a letter or underscore in the third p osition then t and nishes with a separator</p><p>Note that we have used n to denote the newline We also use t for the tab xHH for</p><p> the character with hex ASCI I co de HH and in general C to interpret any character C literally</p><p>eg the backslash character itself as well as the sp ecial characters that follow</p><p>It is p ossible to sp ecify that the pattern has to app ear at the b eginning of a line by preceding</p><p> it with or at the end of the line by following it with Note that this is not the same as</p><p> adding the newline preceding or following the pattern b ecause the b eginningend of the le signals</p><p> also the b eginningend of the line and the same happ ens with records when the record delimiter</p><p> is not the end of line</p><p> Extended Patterns</p><p>In general an extended pattern adds some extra sp ecication capabilities to the simple pattern</p><p> mechanism In this work we have chosen some features which we b elieve are the most interesting</p><p> for typical text searching The reason to intro duce this intermediatelevel pattern b etween simple</p><p> patterns and regular expressions is that it is p ossible to devise sp ecialized search algorithms for</p><p> them which can b e faster than those for general regular expressions The op erations we p ermit for</p><p> extended patterns are sp ecify optional classes or characters and p ermit the rep etition of a class</p><p>or character The notation we use is to add a symb ol after the aected character or class </p><p> means an optional class means that the class can app ear zero or more times and means</p><p> that it can app ear one or more times Some examples are</p><p> colour which matches color and colour</p><p> azAZazAZ which matches valid variable names in most programming</p><p> languages a letter followed by letters or digits</p><p> LatinAmerica which matches Latin and America separated by one or more sepa</p><p> rator characters eg spaces tabs etc </p><p> Regular Expressions</p><p>A regular expression is the most sophisticated pattern that we allow to search for and it is in general</p><p> considered p owerful enough for most applications A regular expression is dened as follows</p><p> Basic elements any character and the empty string are regular expressions matching</p><p> themselves</p><p> Parenthesis if e is a regular expression then so is e which matches the same strings This</p><p> is used to change precedence</p><p> Concatenation if e and e are regular expressions then e e is a regular expression that</p><p>1 2 1 2</p><p> matches a string x i x can b e written as x y z where e matches y and e matches z </p><p>1 2</p><p> Union if e and e are regular expressions then e je is a regular expression that matches a</p><p>1 2 1 2</p><p> string x i e or e match x</p><p>1 2</p><p> Kleene closure if e is a regular expression then e is a regular expression that matches a</p><p> string x i for some n x can b e written as x x x and e matches each string x </p><p>1 n i</p><p>We follow the same syntax with the precedence order j except that we use square</p><p> brackets to abbreviate x jx j jx x x x where the x are characters we omit the</p><p>1 2 n 1 2 n i</p><p> concatenation op erator we add the two op erators e ee and e ej and we use the</p><p> empty string to denote eg abc denotes abjc These arrangements make the extended</p><p> patterns to b e a particular case of regular expressions Some examples are</p><p> dogcat which matches dog and cat</p><p> DrProfMrKnuth which matches Knuth preceded by a sequence of titles</p><p> Pattern Matching Algorithms</p><p>We explain in this section the basic string and regular expression search algorithms our software</p><p> builds on</p><p> Bit Parallelism and the ShiftOr Algorithm</p><p>In a new approach to text searching was prop osed It is based on bitparal lelism This</p><p> technique consists in taking advantage of the intrinsic parallelism of the bit op erations inside a</p><p> computer word By using cleverly this fact the numb er of op erations that an algorithm p erforms</p><p> can b e cut down by a factor of at most w where w is the numb er of bits in the computer word</p><p>Since in current architectures w is or the sp eedup is very signicant in practice</p><p>Figure shows a nondeterministic automaton that searches for a pattern in a text Classical</p><p> pattern matching algorithms such as KMP convert this automaton to a deterministic form</p><p> and achieve O n worst case search time The ShiftOr algorithm on the other hand uses</p><p> bitparallelism to simulate the automaton in its nondeterministic form It achieves O mnw </p><p> worstcase time ie an optimal sp eedup over a classical O mn simulation For m w ShiftOr</p><p> is twice as fast as KMP b ecause of b etter use of the computer registers Moreover it is easily</p><p> extended to handle classes of characters</p><p>Σ</p><p> a b c d e f g</p><p>0 1 23 4 5 6 7</p><p>Figure A nondeterministic automaton NFA to search for the pattern P abcdefg in a text</p><p> Text Scanning</p><p>We present now the ShiftAnd algorithm which is an easier to explain though a little less</p><p> ecient variant of ShiftOr Given a pattern P p p p p and a text T t t t </p><p>1 2 m i 1 2 n</p><p> t the algorithm builds rst a table B which for each character stores a bit mask b b The</p><p> i m 1</p><p> mask in B c has the ith bit set if and only if p c The state of the search is kept in a machine</p><p> i</p><p> word D d d where d is set whenever p p p matches the end of the text read up to</p><p> m 1 i 1 2 i</p><p> now another way to see it is to consider that d tells whether the state numb ered i in Figure is</p><p> i</p><p> active Therefore we rep ort a match whenever d is set</p><p> m</p><p> m</p><p>We set D originally and for each new text character t we up date D using the formula</p><p> j</p><p> m1</p><p>D D j B t </p><p> j</p><p>The formula is correct b ecause the ith bit is set if and only if the i th bit was set for</p><p> the previous text character and the new text character matches the pattern at p osition i In other</p><p> words t t p p if and only if t t p p and t p Again it is</p><p> j i+1 j 1 i j i+1 j 1 1 i1 j i</p><p> p ossible to relate this formula to the movement that o ccurs in the NFA for each new text character</p><p> each state gets the value of the previous state but this happ ens only if the text character matches</p><p> m1</p><p> the corresp onding arrow Finally the j after the shift allows a match to b egin at the</p><p> current text p osition this op eration is saved in the ShiftOr where all the bits are complemented</p><p>This corresp onds to the selflo op at the initial state of the automaton</p><p>The cost of this algorithm is O n For patterns longer than the computer word ie m w </p><p> the algorithm uses dmw e computer words for the simulation not all them are active all the time</p><p> with a worstcase cost of O mnw and still an average case cost of O n</p><p> Classes of Characters and Extended Patterns</p><p>The ShiftOr algorithm is not only very simple but it also has some further advantages The most</p><p> immediate one is that it is very easy to extend to handle classes of characters where each pattern</p><p> p osition may not only match a single character but a set of characters If p is the set of characters</p><p> i</p><p> that match the p osition i in the pattern we set the ith bit of B c for all c p No other change</p><p> i</p><p> is necessary to the algorithm In they show also how to allow a limited numb er k of mismatches</p><p> in the o ccurrences at O nm logk w cost </p><p>This paradigm was later enhanced to supp ort allow wild cards regular expressions ap</p><p> proximate search with nonuniform costs and combinations of them Further development of the</p><p> bitparallelism approach for approximate string matching yielded some of the fastest algorithms for</p><p> short patterns In most cases the key idea was to simulate an NFA</p><p>Bitparallelism has b ecome a general way to simulate simple NFAs instead of converting them</p><p> to deterministic automata This is how we use it in nrgrep</p><p> The BDM Algorithm</p><p>The main disadvantage of ShiftOr is its inability to skip characters which makes it slower than</p><p> the algorithms of the BoyerMo ore or the BDM families We describ e in this section the</p><p>BDM pattern matching algorithm which is able to skip some text characters</p><p>BDM is based on a sux automaton A sux automaton on a pattern P p p p is an</p><p>1 2 m</p><p> automaton that recognizes all the suxes of P A nondeterministic version of this automaton has</p><p> a very regular structure and is shown in Figure In the BDM algorithm this automaton is made deterministic</p><p>I ε ε ε ε ε ε ε ε a b c d e f g</p><p>0 1 2 3 4 5 6 7</p><p>Figure A nondeterministic sux automaton for the pattern P abcdefg Dashed lines</p><p> represent transitions ie they o ccur without consuming any input</p><p>A very imp ortant fact is that this automaton can b e used not only to recognize the suxes of</p><p>P but also factors of P Note that there is a path lab eled by x from the initial state if and only if</p><p> x is a factor of P That is the NFA will not run out of active states as long as it has read a factor</p><p> of P </p><p>The sux automaton is used to design a simple pattern matching algorithm This algorithm</p><p> runs in O mn time in the worst case but it is optimal on average O n log mm time Other</p><p></p><p> more complex variations such as Turb oBDM and MultiBDM achieve linear time in the</p><p> worst case</p><p>To search for a pattern P p p p in a text T t t t the sux automaton of</p><p>1 2 m 1 2 n</p><p> r</p><p>P p p p ie the pattern read backwards is built A window of length m is slid along</p><p> m m1 1</p><p> the text from left to right The algorithm searches backward inside the window for a factor of</p><p> the pattern P using the sux automaton ie the sux automaton of the reverse pattern is fed</p><p> with the characters in the text window read backward This backward search ends in two p ossible</p><p> forms</p><p> We fail to recognize a factor ie we reach a window character that makes the automaton run</p><p> out of active states This means that the sux of the window we have read is not anymore</p><p> a factor of P Figure illustrates this case We then shift the window to the right its</p><p> starting p osition corresp onding to the p osition following the character we cannot miss an </p><p> o ccurrence b ecause in that case the sux automaton would have found a factor of it in the</p><p> window</p><p>Window</p><p>Search for a factor with the sux automaton</p><p> u</p><p></p><p>Fail to recognize a factor at </p><p>New search</p><p></p><p>Safe shift</p><p>New window</p><p>Figure Sux automaton search</p><p> We reach the b eginning of the window therefore recognizing the pattern P since the lengthm</p><p> window is a factor of P indeed it is equal to P We rep ort the o ccurrence and shift the</p><p> window by one p osition</p><p> Combining ShiftOr and BDM the BNDM Algorithm</p><p>We describ e in this section the BNDM pattern matching algorithm This algorithm a com</p><p> bination of ShiftOr and BDM has all the advantages of the bitparallel forward scan algorithm</p><p> and in addition it is able to skip some text characters like BDM</p><p>Instead of making the automaton of Figure deterministic BNDM simulates it using bit</p><p> parallelism The bitparallel simulation works as follows Just as for ShiftAnd we keep the state</p><p> of the search using m bits of a computer word D d d Each time we p osition the window in</p><p> m 1</p><p> m</p><p> the text we initialize D this corresp onds to the transitions and scan the window backward</p><p>For each new text character read in the window we up date D If we run out of s in D then there</p><p> cannot b e a match and we susp end the scanning and shift the window If on the other hand we</p><p> can p erform m iterations then we rep ort the match</p><p>We use a table B which for each character c stores a bit mask This mask sets the bits</p><p> corresp onding to the p ositions where the reversed pattern has the character c just as in the Shift</p><p>And algorithm The formula to up date D is</p><p></p><p>D D B t </p><p> j</p><p>BNDM is not only faster than ShiftOr and BDM for ab out m but it can accom</p><p> mo date all the extensions mentioned in Section In particular it can easily deal with classes of</p><p> characters by just altering the prepro cessing and it is by far the fastest algorithm to search for</p><p> this typ e of patterns </p><p>Note that this typ e of search is called backward scanning b ecause the text characters inside</p><p> the window are read backwards However the search progresses from left to right in the text as</p><p> the window is shifted There have b een other few attempts to skip characters under a ShiftOr</p><p> approach for example </p><p> Regular Expression Searching</p><p>Bitparallelism has b een successfully used to deal with regular expressions ShiftOr was extended</p><p> in two ways to deal with this case rst using the Thompson and later Glushkovs</p><p> constructions of NFAs from the regular expression Figure shows b oth constructions for the</p><p> pattern abcddefde</p><p>ε f 5 7 10 12 ε ε ε ε a b c d d e 0 1 2 3 4 9 14 15 16 ε ε ε ε 6 8 11 13 d e</p><p> e d</p><p> a b c d d e d e 0 1 2 3 4 5 6 7 8 9</p><p> f</p><p> f</p><p>Figure Thompsons top and Glushkovs b ottom resulting NFAs for the regular expression</p><p>abcddefde</p><p>Given a regular expression with m positions each characterclass denes a new p osition</p><p>Thompsons construction pro duces an automaton of up to m states Its advantage is that the</p><p> states of the resulting automaton can b e arranged in a bit mask so that all the transitions move</p><p> forward except the transitions This is used for a bitparallel simulation which moves the bits</p><p> forward as for the simple ShiftOr and then applies all the moves corresp onding to transitions</p><p>For this sake a table E mapping from bit masks to bit masks is precomputed so that E x yields</p><p> a bit mask where x has b een expanded with all the moves The co de for a transition is therefore</p><p> s1</p><p>D D j B t </p><p> j</p><p>D E D </p><p>We have used s as the numb er of states in the Thompson automaton where m s m</p><p> s</p><p>The main problem is that the E table has entries This is handled by splitting the argument</p><p> horizontally so for example if s then two tables E and E can b e created which receive</p><p>1 2</p><p> half masks and deliver full masks with the expansion of only the bits of their half of course</p><p> the transitions can go to the other half this is why they deliver full masks In this way the</p><p> amount of memory required is largely reduced at the cost of two op erations to build the real E</p><p> value This takes advantage of the fact that if a bit mask x is split in two halves x y z then</p><p> jz j jy j</p><p>E y z E y j E z </p><p>Glushkovs construction has the advantage of pro ducing an automaton with exactly m states</p><p> which can as low as half the states generated by Thompsons construction On the other hand</p><p> the structure of arrows is not regular and the trick of forward shift plus moves cannot b e used</p><p>Instead it has another prop erty all the arrows leading to a given state are lab eled by the same</p><p> character or class This prop erty has b een used recently to provide a spaceeconomical bit</p><p> parallel simulation where the co de for a transition is</p><p>D T D B t </p><p> j</p><p> where T is a table that receives a bit map D of states and delivers another bit map of states</p><p> reachable from states in D no matter by which characters The transitions do not have to b e</p><p> dealt with b ecause Glushkovs construction do es not pro duce them The T table can b e horizontally</p><p> partitioned as well</p><p>It has b een shown that a bitparallel implementation of Glushkovs construction is faster</p><p> than one of Thompsons which should b e clear since in general the tables obtained are much smaller</p><p>An interesting improvement p ossible thanks to bit parallelism is that classes of characters can b e</p><p> dealt with the normal mechanism used for simple patterns without generating one state for each</p><p> alternative</p><p>On the other hand a deterministic automaton DFA can b e built from the nondeterministic</p><p> one It is not hard to see that indeed the previous constructions simulate a DFA since each</p><p> state of the DFA can b e seen as a set of states of the NFA and each p ossible set of states of the</p><p>NFA is represented by a bitmask Normally the DFA takes less space b ecause only the reachable</p><p> combinations are generated and stored while for direct access to the tables we need to store in</p><p> the bitparallel simulations all the p ossible combinations of active and inactive states On the</p><p> other hand bitparallelism p ermits extending regular expressions with classes of characters and</p><p> other features eg approximate searching which is dicult otherwise Furthermore Glushkovs</p><p> m</p><p> construction p ermits not storing a table of states char acter s of worst case size O in the</p><p> m</p><p> case of a DFA but just the table T of size O Finally in case of space problems the technique</p><p> of splitting the bitmasks can b e applied</p><p>Therefore we use the bitparallel simulation of Glushkovs automaton for nrgrep After the</p><p> up date op eration and we check whether a nal state of D is reached this means just an and</p><p> op eration with the mask of nal states Describing Glushkovs NFA construction algorithm is</p><p>2</p><p> outside the scop e of this pap er but it takes O m time The result of the construction can b e</p><p> represented as a table B c which yields the states reached by character c no matter from where</p><p> and a table F ol l ow i which yields the bitmask of states activated from state i no matter by which</p><p> m</p><p> character From F ol l ow the deterministic version T can b e built in O worst case time with</p><p> the following pro cedure</p><p>T </p><p> for i m</p><p> i</p><p> for j </p><p> i</p><p>T j F ol l ow i j T j </p><p>A backward search algorithm for regular expressions is also p ossible and in some cases</p><p> the search is much faster than a forward search The idea is as follows First we compute the </p><p> length of the shortest path from the initial to a nal state using a simple graph algorithm This</p><p> will b e the length of the window in order not to lose any o ccurrence Second we reverse all the</p><p> arrows of the automaton make all the states initial and take as the only nal state the original</p><p> initial state The resulting automaton will have active states as long as we have read a reverse</p><p> factor of a string matching the regular expression and will reach its nal state when we read in</p><p> particular a reverse prex Figure illustrates the result</p><p>ε</p><p> e d</p><p> a b cd d e d e</p><p> f</p><p> f</p><p>Figure An automaton recognizing reverse prexes of abcddefde based on the</p><p>Glushkov construction of Figure </p><p>We apply the same BNDM technique of reading backwards the text window If the automaton</p><p> runs out of active states then no factor of an o ccurrence of the pattern is present in the window</p><p> and we can shift the window aligning its b eginning to one p osition after the one that caused the</p><p> mismatch If on the other hand we reach the b eginning of the window in the backward scan we</p><p> cannot guarantee that an o ccurrence has b een found When searching for a simple string the only</p><p> way to reach the window b eginning is to have read the whole pattern Regular expressions on the</p><p> other hand can have o ccurrences of dierent length and all we know is that we have matched a</p><p> factor There are in fact two choices</p><p> The nal state of the automaton is not active which means that we have not read a prex</p><p> of an o ccurrence In this case we shift the window by one p osition and resume the scanning</p><p> The nal state of the automaton is active Since we have found a pattern prex we have to</p><p> p erform a forward verication starting at the window initial p osition until either we nd an</p><p> o ccurrence or the automaton runs out of active states</p><p>So we need apart from the reversed automaton also the normal automaton without initial</p><p> selflo op as in Figure for the verication of p otential o ccurrences</p><p>An extra complication comes from the fact that the NFA with reverse arrows do es not have the</p><p> prop erty that all arrows leading to a state are lab eled by the same character Rather all the arrows</p><p> leaving a state are lab eled by the same character Hence the simulation can b e done as follows</p><p>R</p><p>D T D B t </p><p> j</p><p>R</p><p> where T corresp onds to the reverse arrows but B is that of the forward automaton </p><p> Approximate String Matching</p><p>Approximate searching means nding the text substrings that can b e converted into the pattern</p><p> by p erforming at most k op erations on them Permitting a limited numb er k of such dierences </p><p>also called errors is an essential to ol to recover from typing sp elling and OCR optical character</p><p> recognition errors Despite that approximate pattern matching can b e reduced to a problem of</p><p> regular expression searching the regular expression grows exp onentially with the numb er of allowed</p><p> errors or dierences</p><p>A rst design decision is what should b e taken as an error Based on existing surveys </p><p> we have chosen the following four typ es of errors insertion of characters deletion of characters re</p><p> placement of a character by another character and exchange of adjacent characters transp osition</p><p>These errors are symmetric in the sense that one can consider that they o ccur in the pattern or in</p><p> the text and the result is the same Traditionally only the rst three errors have b een p ermitted</p><p> b ecause transp osition despite b eing recognized as a very imp ortant source of errors is harder to</p><p> handle However the problem is known to grow very fast in complexity as k increases and since</p><p> a transp osition can only b e simulated with two errors of the other kind ie an insertion and a</p><p> deletion we would need to double k in order to obtain a similar result One of the algorithmic</p><p> contributions of nrgrep is a bitparallel algorithm for p ermitting the transp ositions together with</p><p> the other typ es of errors This p ermits us searching with smaller k values and hence obtain faster</p><p> searching with similar or b etter retrieval results</p><p>Approximate searching is characterized by the fact that no known search algorithm is the b est</p><p> in all cases From the wealth of existing solutions we have selected those that adapt b est to</p><p> our goal of exibility and uniformity Three main ideas can b e used</p><p> Forward Searching</p><p>The most basic idea that is well suited to bit parallelism is to have k similar automata</p><p> representing the state of the search when zero to k errors have o ccurred Apart from the normal</p><p> arrows inside each automaton there are arrows going from automaton i to i corresp onding to</p><p> the dierent errors The original approach did not consider transp ositions which have b een</p><p> dealt with later </p><p>Figure shows an example with k Let us rst fo cus on the big no des and soliddashed</p><p> lines Apart from the normal forward arrows we have three typ es of arrows that lead from each</p><p> row to the next one ie increment the numb er of errors vertical arrows which represent the</p><p> insertion of characters in the pattern since they advance in the text but not in the pattern</p><p> diagonal arrows which represent replacement of the current text character by a pattern character</p><p>since they advance in the text and in the pattern and dashed diagonal arrows transitions</p><p> which represent deletion of characters in the pattern since they advance in the pattern without</p><p> consuming any text input The remaining arrows the dotted ones represent transp ositions which</p><p> p ermit reading the next two pattern characters in the wrong order and move to the next row This</p><p> is achieved by means of temp orary states which we have drawn as smaller circles</p><p>If we disregard the transp ositions there are dierent ways to simulate this automaton in O </p><p> time when it ts in a computer word but no bit parallel solution has b een presented to</p><p> account for the transp ositions This is one of our contributions and is explained later in the pap er</p><p>We extend a simpler bit parallel simulation which takes O k time p er text character as long</p><p> as m O w The technique stores each row in a machine word R R just as we did for</p><p>0 k</p><p> mi i</p><p>ShiftAnd in Section The bitmasks R are initialized to to account for i p ossible initial</p><p> i</p><p></p><p> deletions in the pattern The up date pro cedure to pro duce R up on reading text character t is as</p><p> j a b c d b c d 0 errors</p><p> a b c</p><p> a b c d b c d 1 error</p><p> a b c</p><p> a b c d</p><p>2 errors</p><p>Figure A nondeterministic automaton accepting the pattern abcd with at most errors</p><p>Unlab eled solid arrows match any character while the dashed not dotted lines are transitions</p><p> follows</p><p> m1</p><p>R R j B t </p><p>0 j</p><p>0</p><p> for i k do</p><p> </p><p>R R B t j R j R j R </p><p> i j i1 i1</p><p> i i1</p><p> where of course many co ding optimizations are p ossible and are done in nrgrep but make the co de</p><p> less clear In particular using the complemented version of the representation as in ShiftOr is a</p><p> bit faster</p><p>The rationale of the pro cedure is as follows R has the same up date formula as for the Shift</p><p>0</p><p>And algorithm For the others the up date formula is the or of four p ossible facts The rst one</p><p> corresp onds to the normal forward arrows note that there is no initial selflo op for them only for</p><p>R The second one brings s state activations from the upp er row at the same p osition which</p><p>0</p><p> corresp onds to a vertical arrow ie an insertion The third one brings s from the upp er row at</p><p> the previous p ositions this is obtained with the left shift corresp onding to a diagonal arrow ie</p><p></p><p> a replacement The fourth one is similar but it works on the newer value of the previous row R</p><p> i1</p><p> instead of R and hence it corresp onds to an transition ie a deletion</p><p> i1</p><p> m1</p><p>A match is detected when R is not zero It is not hard to show that whenever the</p><p> k</p><p>nal state of R is active the nal state of R is active to o so it suces to consider R as the only</p><p> i k k</p><p>nal state</p><p> Backward Searching</p><p>Backward searching can b e easily adapted from the forward searching automaton following the</p><p> same techniques used for exact searching That is we build the automaton of Figure on</p><p> the reverse pattern consider all the states as initial ones and consider as the only nal state the</p><p>rst no de of the last row This will recognize all the reverse prexes of P allowing at most k errors</p><p> and will have active states as long as some factor of P has b een seen with at most k errors</p><p>Some observations are of interest First note that we will never shift the window b efore exam</p><p> ining at least k characters since we cannot make k errors b efore that Second the length of the </p><p> window has to b e that of the shortest p ossible match which b ecause of deletions in the pattern</p><p> is of length m k Third just as it happ ens with regular expressions the fact that we arrive to</p><p> the b eginning of the window with some active states in the automaton do es not immediately mean</p><p> that we have an o ccurrence so we have to check the text for a complete o ccurrence starting at the</p><p> window b eginning</p><p>It has b een shown that this algorithm takes time O k k log mm k for m w </p><p></p><p> Splitting into k Subpatterns</p><p>A well known prop erty establishes that under the mo del of insertions deletions and</p><p> replacements if the pattern is cut in k contiguous pieces then at least one of the pieces o ccurs</p><p> unchanged inside any o ccurrences with k errors or less This is easily veried b ecause each op eration</p><p> can alter at most one piece So the technique consists of p erforming a multipattern searching for</p><p> the pieces without errors and checking the text surrounding the o ccurrences of each piece for a</p><p> complete approximate o ccurrence of the whole pattern This leads to the fastest algorithms for low</p><p> error levels </p><p>The prop erty is not true if we add the transp osition b ecause this op eration can alter two</p><p> contiguous pieces at the same time Much b etter than splitting the pattern in k pieces is to</p><p> split it in k pieces and leave one unused character b etween each pair of pieces Under this</p><p> partition a transp osition can alter at most one piece</p><p>We are now confronted with a multipattern search problem This can b e solved with a very</p><p> simple mo dication of the single pattern backward search algorithm Consider the pattern</p><p>abracadabra searched with two errors We split it in abr cad and bra Figure depicts</p><p> the automaton used for backward searching of the three pieces This is implemented with the same</p><p> bit parallel mechanism as for a single pattern except that there are more nal states and</p><p> an extra bit mask is necessary to avoid propagating s by the missing arrows This extra bit</p><p> mask is implemented at no cost by removing the corresp onding s from the B mask during the</p><p> prepro cessing Note that for this to work we need that all the pieces have the same length</p><p> a br a ca d a b r a</p><p>ε</p><p>Figure An automaton recognizing reverse prexes of selected pieces of abracadabra</p><p> Searching for Simple Patterns</p><p>Nrgrep directly uses the BNDM algorithm when it searches for simple patterns However some</p><p> mo dications are necessary to convert the pattern matching algorithm into a search software</p><p> Record Oriented Output</p><p>The rst issue to consider is what will we rep ort from the results of the search Printing the text</p><p> p ositions i where a match o ccurs is normally of little help for the user Printing the text p ortion </p><p> that matched ie the o ccurrence do es not help much either b ecause this is equal to the pattern</p><p>at least if no classes of characters are used We have followed agreps philosophy the most useful</p><p> way to present the result is to print a context of the text p ortion that matched the pattern</p><p>This context is dened as follows The text is considered to b e a sequence of records A user</p><p> dened record delimiter determines the text p ositions where a new record starts The text areas</p><p> until the rst record separator and after the last record separators are considered records as well</p><p>When a pattern is found in the text the whole record where the o ccurrence lies is printed If</p><p> the o ccurrence overlaps with or contains a record delimiter then it is not considered a pattern</p><p> o ccurrence</p><p>The record delimiter is by default the newline character but the user can sp ecify any other</p><p> simple pattern as a record delimiter For example the string From can b e used to delimit</p><p> email messages in a mail archive therefore b eing able to retrieve complete emails that contain a</p><p> given string The system p ermits to sp ecify whether the record delimiter should b e contained in</p><p> the next or in the previous record Moreover nrgrep p ermits to sp ecify an extra record separator</p><p> when the records are printed eg add another newline</p><p>It should b e clear by now that searching for longer strings is faster than for shorter ones Since</p><p> record delimiters tend to b e short strings it is not a go o d idea to delimit the text records rst and</p><p> then search for the pattern inside each record Rather we prefer to search for the pattern in the</p><p> text without any consideration for record separators and when the pattern is found search for the</p><p> next and previous record delimiters At this time we may determine that the o ccurrence overlaps</p><p> a record delimiter and discard it Note that we have to b e able to search for record delimiters</p><p> forward and backward We use the same BNDM algorithm to search for record delimiters</p><p>There are some nrgrep options however that make it necessary a recordwise traversal printing</p><p> record numb ers or printing records that do not contain the pattern In this case we advance in the</p><p>le by delimiting the records rst and then searching for the pattern inside each record</p><p> Text Buering</p><p>Text buering is necessary to cop e with large les and to achieve optimum p erformance For</p><p> example in nrgrep the buer size is set to Kb b ecause it ts well the cache size of many</p><p> machines but this default can b e overriden by the user To avoid complex interactions b etween</p><p> record limits and buer limits we discard the last incomplete record each time we read a new</p><p> buer from disk The discarded partial record is moved to the b eginning of the buer b efore</p><p> reading more text at the next buer loading If a record is larger than the buer size then an</p><p> articial record delimiter is inserted to correct the situation and a warning message is printed</p><p>Note that this also requires the ability to search for the record delimiter in backward direction</p><p>This technique works well unless the record size is large compared to the buer size in which case</p><p> the user should enlarge the buer size using the appropriate option</p><p> Contexts</p><p>Another capability of nrgrep is context sp ecication This means that the pattern o ccurrence has</p><p> to b e surrounded by certain characters in order to b e considered as such For example one may</p><p> sp ecify that the pattern should match as a whole word ie surrounded by separators or a whole </p><p> record ie surrounded by record delimiters However it is not just a matter of adding the</p><p> context strings at the ends of the pattern b ecause for example a word may b e in the b eginning of</p><p> a record and hence the separator may b e absent We solve this by checking each o ccurrence found</p><p> to determine that the required string is present b eforeafter the o ccurrence or that we reached the</p><p> b eginningend of the record</p><p>This seems trivial for a simple pattern b ecause its length is xed but for more complex patterns</p><p>such as regular expressions where there may b e many dierent o ccurrences in the same text area</p><p> we need a way to discard p ossible o ccurrences and still check for other ones that are basically in</p><p> the same place For example the search for aba as a whole word should match in the text</p><p>aaa aabaa aaa Despite that b alone is an o ccurrence of the pattern that do es not t the</p><p> whole word criterion the o ccurrence can b e extended to another one that do es We return to this</p><p> issue later</p><p> Subpattern Filter</p><p>The BNDM algorithm is designed for the case m w Otherwise we have in principle to simulate</p><p> the algorithm using many computer words However as shown in it is much faster to prune</p><p> the pattern to its rst w characters search for that subpattern and try to extend its o ccurrences to</p><p> an o ccurrence of the complete pattern This is b ecause for reasonably large w the probability of</p><p>nding a pattern of length w is low enough to make the cost of unnecessary verications negligible</p><p>On the other hand the b enet of a p ossible shift of length m w would b e cancelled by the need</p><p> to up date dmw e computer words p er text character read</p><p>Hence we select a contiguous subpattern of w characters or classes rememb er that a class</p><p> needs also one bit the same as a character and search for it Its o ccurrences are veried with the</p><p> complete pattern prior to checking records and contexts</p><p>The main p oint is which part of the pattern to search for In the abstract algorithms of </p><p> any part is equally go o d or bad b ecause a uniformly distributed mo del is assumed In practice</p><p> dierent characters have dierent probabilities and some pattern p ositions may have classes of</p><p> characters whose probability is the sum of those of the individual characters This in fact is</p><p> farther reaching than the problem of the limit w in the length of the pattern even in a short</p><p> pattern we may prefer not to include a part of the pattern in the fast scanning part This is</p><p> discussed in detail in the next subsection</p><p> Selecting the Optimal Scanning Subpattern</p><p>Let us consider the pattern helloa Pattern p ositions to match all the alphab et which</p><p> means that the search with the nondeterministic automaton inside the window will examine at least</p><p> four window p ositions even in a text window like xxxxxxxxx and will shift at most by so the</p><p> average numb er of comparisons p er character is at the very b est If we take the subpattern</p><p>hello then we can have an average much closer to op erations p er text character</p><p>We have designed a general algorithm that under the assumption that the text characters are</p><p>3</p><p> indep endent nds the b est search subpattern in O m worst case time although in practice it</p><p>2</p><p> is closer to O m log m This is a mo dest overhead in most practical text search scenarios The</p><p> algorithm is tailored to the BDMBNDM search technique and works as follows </p><p>First we build an array pr ob m which stores the sum of the probabilities of the characters</p><p> participating in the class of each pattern p osition N r g r ep stores an array of English letter prob</p><p> abilities but this can b e tailored to other purp oses and the nal scheme is robust with resp ect to</p><p> changes in those probabilities from one language to another The construction of pr ob takes O m </p><p> time in the worst case</p><p>Second we build an array ppr ob m m where ppr obi stores the probability of</p><p>2</p><p> matching the subpattern P This is computed in O m time by dynamic programming as</p><p> ii+1</p><p> follows</p><p> ppr obi ppr obi pr obi ppr obi </p><p> for increasing values</p><p>Third we build an array mpr ob m m m where mpr obi j gives the probability</p><p>3</p><p> of matching any pattern substring of length in P This is computed in O m time by</p><p> ij 1</p><p> dynamic programming using the formulas</p><p> mpr obi j </p><p> mpr obi i </p><p> j i mpr obi j ppr obi mpr obi j </p><p> for decreasing i values Note that we have used the formula for the union of indep endent events</p><p>P r A B P r A P r B </p><p>Finally the average cost p er character asso ciated to a subpattern P is computed with the</p><p> ij</p><p> following rationale With probability we insp ect one window character If any pattern p osition</p><p> in the range i j matches the window character read then we read a second character recall</p><p>Section If any consecutive pair of pattern p ositions in i j matches the two window characters</p><p> read then we read a third character and so on This is what we have computed in mpr ob The</p><p> exp ected numb er of window characters to read is therefore</p><p> pcosti j mpr obi j mpr obi j mpr obi j j i </p><p>In the BDMBNDM algorithm as so on as the window sux read ceases to b e found in the</p><p> pattern we shift the window to the p osition following the character that caused the mismatch A</p><p> simplied computation considers that the ab ove pcosti j which is an average can b e used as a</p><p>xed value and therefore we approximate the real average numb er of op erations p er text character</p><p> as</p><p> pcosti j </p><p> opsi j </p><p> j i pcosti j </p><p> which is the average numb er of characters insp ected divided by the shift obtained Once this is</p><p> obtained we select the i j pair that minimizes the work to do We also avoid considering cases</p><p>3</p><p> where j i w The total time needed to obtain this has b een O m </p><p>The amount of work is reduced by noting that we can check the ranges in increasing order of</p><p> i values and therefore we do not need the rst co ordinate of mpr ob which can b e indep endently</p><p> computed for each i Moreover we start by considering maximum j since in practice longer a b c d e f g h</p><p>ε ε ε</p><p>Figure A nondeterministic automaton accepting the pattern abcdefgh</p><p> subpatterns tend to b e b etter than shorter ones We keep the b est value found up to now and</p><p> avoid considering ranges i j which cannot b e b etter than the current b est solution even for</p><p> pcosti j note that since i is tried in ascending order and j in descending order the whole</p><p>2</p><p> pattern is tried rst This reduces the cost to O m log m in practice</p><p>As a result of this pro cedure we not only obtain the b est subpattern to search for under a</p><p> simplied cost mo del but also a hint of how many op erations p er character will we p erform If</p><p> this numb er is larger than then it is faster and safer to switch to plain ShiftOr This is precisely</p><p> what nrgrep do es</p><p> Searching for Extended Patterns</p><p>As explained in Section we have considered optional and rep eatable classes of characters as</p><p> the features allowed in our extended patterns Each of these features is treated in a dierent way</p><p> and all are integrated in an automaton which is more general than that of Figure Over this</p><p> automaton we later apply the general forward and backward search machinery</p><p> Optional Characters</p><p>Let us consider the pattern abcdefgh A nondeterministic automaton accepting that pattern</p><p> is drawn in Figure </p><p>The gure is chosen so as to show that multiple consecutive optional characters could exist</p><p>This outrules the simplest solution which works when that do es not happ en one could set up a</p><p> bit mask O with ones in the optional p ositions in our example O and let the ones</p><p> in previous states of D propagate to them Hence after the normal up date to D we could p erform</p><p>D D j D O </p><p>For example this works if we have read abcdef D and the next text character is</p><p>h since the ab ove op eration would convert D to b efore op erating it against B h </p><p> However it do es not work if the text is abefgh where b oth consecutive optional</p><p> characters have b een omitted</p><p>A general solution needs to propagate each active state in D so as to o o d all the states ahead</p><p> it that corresp ond to optional characters In our example we would like that when D is </p><p>and in general whenever its second bit is active it b ecomes after the o o ding</p><p>This is achieved with three masks A I and F marking dierent asp ects of the states related to</p><p> optional characters More sp ecically the ith bit of A is set if this p osition in P is optional that</p><p> of I is set if this is the p osition of the rst optional character of a blo ck of consecutive optional</p><p> characters and that of F is set if this is the p osition after the last optional character of a blo ck c f</p><p> a b c d e f g h</p><p>ε</p><p>Figure A nondeterministic automaton accepting the pattern abcdefgh</p><p>In our example A I and F After p erforming the normal</p><p> transition on D we do as follows</p><p>Df D j F</p><p>D D j A Df I b Df </p><p> whose rationale is as follows The rst line adds a at the p ositions following optional blo cks in</p><p>D In the second line we add some active states to D Since the states to add are anded with</p><p>A let us just consider what happ ens inside a sp ecic optional blo ck The eect that we want is</p><p> that the rst counting from the right o o ds all the blo ck bits to the left of it We subtract I</p><p> from Df which is equivalent to subtracting at each blo ck This subtraction cannot propagate</p><p> its eect outside the blo ck b ecause there is a coming from jF in Df after the highest bit</p><p> of the blo ck The eect of the subtraction is that all the bits until the rst counting from</p><p> the right are reversed eg and the rest are unchanged In general</p><p> z z</p><p> b b b b b b When this is reversed by the op eration we get</p><p> x x1 xy x x1 xy</p><p> z z</p><p> b b b Finally when this is xored with the same Df b b b we</p><p> x x1 xy x x1 xy</p><p> xy +1 z +1</p><p> get </p><p>This is precisely the eect we wanted the last o o ded all the bits to the left That itself</p><p> has b een converted to zero however but it is restored when the result is or ed with the original D </p><p>This works even if the last active state in the optional blo ck is the leftmost bit of the blo ck Note</p><p> that it is necessary to and with A at the end to lter out the bits of F that survive the pro cess</p><p> whenever the blo ck is not all zeros On the other hand it is necessary to or Df with F b ecause a</p><p> blo ck of all zeros would propagate the op eration outside its limits</p><p>Note that we could have a b order problem if there are optional characters at the b eginning of</p><p> the pattern As seen later however this cannot happ en when we select the b est subpattern for</p><p> fast scanning but it has to b e dealt with when verifying the whole pattern</p><p> Rep eatable Characters</p><p>There are two kinds of rep eatable characters marked in the syntax by zero or more rep etitions</p><p> and one or more rep etitions Each of them can b e simulated using the other since a aa</p><p> and a a For involved technical reasons that are related for example to the ability to build</p><p> easily the masks for the reversed patterns and b order conditions for the verication we preferred</p><p> the second choice despite that it uses one more bit than necessary for the op eration Figure </p><p> shows the automaton for abcdefgh</p><p>The bitparallel simulation of this automaton is more straightforward than for the optional</p><p> characters We just need to have a mask S c that for each character c tells which pattern p ositions </p><p> can remain active when we read character c In the ab ove example S c and S f </p><p> The transition is handled with the mechanism for optional characters Therefore a</p><p> complete simulation step p ermitting optional and rep eatable characters is as follows</p><p> m1</p><p>D D j B t j D S t </p><p> j j</p><p>Df D j F</p><p>D D j A Df I b Df </p><p> Forward and Backward Search</p><p>Our aim is to extend the approach used for simple patterns to patterns containing optional symb ols</p><p>Forward scanning is immediate once we learn how to simulate the dierent automata using bit</p><p> parallelism We just have to add an initial selflo op to enable text scanning this is already done</p><p> in the last formula We detect the nal p ositions of o ccurrences and then check the surrounding</p><p> record and context conditions</p><p>Backward searching needs in principle just to obtain an automaton that recognizes reverse</p><p> factors of the pattern This is obtained by building exactly the same automaton of the forward</p><p> scan without initial selflo op on the reverse pattern and letting all the states b e initial ie</p><p> initializing D with all active states However there are some problems to deal with all of them</p><p> deriving from the fact that the o ccurrences have variable length now</p><p>Since the o ccurrences do not have a xed length we have to compute the minimum length</p><p> of a p ossible match of the pattern eg in the example abcdefgh and use this value as</p><p> the width of the search window in order not to lose any p otential o ccurrence As b efore we set</p><p> up an automaton that recognizes all the reverse factors of the automaton and use it to traverse</p><p> the window backward Because the o ccurrences are not all of the same length the fact that we</p><p> arrive to the b eginning of the window do es not immediately imply that the pattern is present For</p><p> example a length text window could b e cdefffg Despite that this is a factor of the pattern</p><p> and therefore we would reach the window b eginning no pattern o ccurrence starts in the b eginning</p><p> of the window</p><p>Therefore each time we arrive to the b eginning of the window we have to check that the initial</p><p> state is active and then run a forward verication from the window b eginning on until either we</p><p>nd a match ie the last automaton state is activated or we determine that no match can start</p><p> at the window p osition under consideration ie the automaton runs out of active states The</p><p> automaton used for this forward verication is the same as for forward scanning except that the</p><p> initial selflo op is absent However as we see next verication is in fact a little more complicated</p><p> and we mix it with the rest of verications that are needed on every o ccurrence surrounding record</p><p> context etc</p><p> Verifying Occurrences</p><p>In fact the verication is a little dierent since as we see in the next subsection we select a subpat</p><p> tern for the scanning as b efore this is necessary if the pattern has more than w charactersclasses</p><p> but can also b e convenient on shorter patterns Say that P P SP where S is the subpattern</p><p>1 2</p><p> that has b een selected for fast text scanning Each time the backward search determines that a </p><p> given text p osition is a p otential start p oint for an o ccurrence of S we obtain the surrounding</p><p> record and check from the candidate text p osition the o ccurrence of SP in forward direction</p><p>2</p><p> and P in backward direction do not confuse forwardbackward scanning with forwardbackward</p><p>1</p><p> verication</p><p>When we had a simple pattern this just needed to check that each text character b elonged</p><p> to the corresp onding pattern class Since the o ccurrences have variable length we need to use</p><p> prebuilt automata for SP and for the reversed version of P These automata do not have the</p><p>2 1</p><p> initial selflo op However this time we need to use a multiword bitparallel simulation since the</p><p> patterns could b e longer than the computer word</p><p>Note also that under this scenario the forward scanning also needs verication In this case</p><p> we nd a guaranteed end p osition of S in the text not a candidate one as for backward searching</p><p>Hence we check from that nal p osition P in forward direction and P S in backward direction</p><p>2 1</p><p>Note that we need to check S again b ecause the automaton cannot tell where is the b eginning of</p><p>S and there could b e various b eginning p ositions</p><p>A nal complication is intro duced by record limits and context conditions Since we require</p><p> that a valid o ccurrence lies totally inside a record we should submit for verication the smallest</p><p> p ossible o ccurrence For example the pattern babcde has many o ccurrences in the text</p><p> record bbbcdee and we should rep ort bcd in order to guarantee that no valid o ccurrence will</p><p> b e missed However it is p ossible that the context conditions require the presence of certain strings</p><p> immediately preceding or following the o ccurrence For example if the context condition tells that</p><p> the o ccurrence should b egin a record then the minimal o ccurrence bcd would not qualify while</p><p>bbbcde would do</p><p>Fortunately context conditions ab out the initial and nal p ositions of o ccurrences are indep en</p><p> dent and hence we can check them separately So we treat the forward and backward parts of the</p><p> verication separately For each one we traverse the text and nd all the p ositions that represent</p><p> the b eginning or ending of o ccurrences and submit them to the context checking mechanism until</p><p> one is accepted the record ends or the automaton runs out of active states</p><p> Selecting a Go o d Search Subpattern</p><p>As b efore we would like to select the b est subpattern for text scanning since we have anyway</p><p> to check the p otential o ccurrences We want to apply an algorithm similar to that for simple</p><p> patterns However this time the problem is more complicated b ecause there are more arrows in</p><p> the automaton</p><p>We compute pr ob as b efore adding another array spr ob m which is the probability of</p><p> staying at state i via the S c array The ma jor complication is that a factor of length starting</p><p> at pattern p osition i do es not necessarily nish at pattern p osition i Therefore we ll</p><p> an array ppr ob m m L where L is the minimum length of an o ccurrence of P at</p><p> most m and ppr obi j is the sum of the probabilities of all the factors of length that start at</p><p> p osition i and do not reach after pattern p osition j In our example abcdefgh ppr ob </p><p> should account for ccccc ccccd cccde ccdef and cdeff </p><p>This is computed as</p><p> ppr obi i </p><p>i j ppr obi j </p><p>i j ppr obi j pr obi ppr obi j </p><p> spr obi ppr obi j </p><p> if ith bit of A is set ppr obi j </p><p>3</p><p> which is lled for decreasing i and increasing in O m time Note that we are simplifying the</p><p> computation of probabilities since we are computing the probability of a set of factors as the sum</p><p> of the individual probabilities which is only true if the factors are disjoint</p><p>Similarly to ppr obi j we have mpr obi j as the probability of any factor of length starting</p><p> at p osition i or later and not surpassing p osition j This is trivially computed from ppr ob as for</p><p> simple patterns</p><p>Finally we compute the average cost p er character as b efore We consider subpatterns from</p><p> length minm w until a length that is so short that we will always prefer the b est solution found</p><p> up to now For each subpattern considered we compute the exp ected cost p er window as the sum</p><p> of the probabilities of the subpatterns of each length ie</p><p> pcosti j mpr obi j mpr obi j mpr obi j j i</p><p> and later obtain the cost p er character as pcosti j pcosti j where is the minimum</p><p> length of an o currence of the interval i j </p><p>As for simple patterns we ll ppr ob and mpr ob in lazy form together with the computation of</p><p>2</p><p> the b est factor This makes the exp ected cost of the algorithm closer to O m log m than to the</p><p>3 2</p><p> worst case O m really O m minm w </p><p>Note that it is not p ossible b ecause it is not optimal to select a subpattern that starts or</p><p> ends with optional characters or with If it happ ens that the b est subpattern gives one or</p><p> more op erations p er text character we switch to forward searching and select the rst minm w </p><p> characters of the pattern excluding initial and nal s or s</p><p>In particular note that it is p ossible that the scanning subpattern selected has not any optional</p><p> or rep eatable character In this case we use the scanning algorithm for simple patterns despite</p><p> that at verication time we use the checking algorithm of extended patterns</p><p> Searching for Regular Expressions</p><p>The most complex patterns that nrgrep can search for are regular expressions For this sake we use</p><p> the technique explained in Section b oth for forward and backward searching However</p><p> some asp ects need to b e dealt with in a real software</p><p> Space Problems</p><p> m</p><p>The rst problem is space since the T table needs O entries and this can b e unmanageable for</p><p> long patterns Despite that we exp ect that the patterns are not very long in typical text searching</p><p> some reasonable solution has to b e provided when this is not the case </p><p>We p ermit the user to sp ecify the amount of memory that can b e used for the table and split</p><p> the bitmasks in as many parts as needed to meet the space requirements Since we do not search</p><p> for masks longer than w bits it is in fact unlikely that the text scanning part needs more than </p><p> or table accesses p er text character Attending to the most common alternatives we develop ed</p><p> separate co de for the cases of and tables which p ermits much faster scanning since register</p><p> usage is enabled</p><p> Subpattern Filtering</p><p>As for simple and extended patterns we select the b est subpattern for the scanning phase and</p><p> check all the p otential o ccurrences for complete o ccurrences and for record and context conditions</p><p>This means according to Section that we need a forward and a backward automaton for the</p><p> selected subpattern and that we also need forward and backward verication automata to check</p><p> for the complete o ccurrence An exception is when given the result of selecting the b est factor</p><p> we prefer to use forward scanning in which case only that automaton is needed but the two</p><p> verication automata are still necessary All the complications addressed for extended patterns</p><p> are present on regular expressions as well namely those derived from the fact that the o ccurrences</p><p> may have dierent lengths</p><p>It may also happ en that after selecting the search subpattern it turns out to b e just a simple</p><p> or an extended pattern in which case the search is handled by the appropriate search algorithm</p><p> which should b e faster The verication is handled as a general regular expression Note that</p><p> heuristics like those of Gnu Grep which tries to nd a literal string inside the regular expression</p><p> in order to use it as a lter are no more than particular cases of our general optimization metho d</p><p>This makes our approach much smo other than others For example Gnu Grep will b e much faster</p><p> to search for caaaaabbbbbc where the strings caaaaac andor cbbbbbc can b e used</p><p> to lter the search than for cabababababc while the dierence should not b e as</p><p> large</p><p>Regarding optimal use of space and also b ecause accessing smaller tables is faster we use a</p><p> state remapping function When a subset of the states of the automaton is selected for scanning</p><p> we build a new automaton with only the necessary states This reduces the size of the tables</p><p>What is left is to explain how we select the b est factor to search for This is much more complex</p><p> on regular expressions than on extended patterns</p><p> Selecting an Optimal Necessary Factor</p><p>The rst nontrivial task on a regular expression is to determine what is a necessary factor which</p><p> is dened as a sub expression that has to match inside every o ccurrence of the whole expression</p><p>For example fgh is not a necessary factor of abcdefghijk but cdfgh is Note that</p><p> any range of states is a necessary factor in a simple or extended pattern</p><p>Determining necessary sub expressions is complicated if we try to do it using just the automaton</p><p> graph We rather make use of the syntax tree of the regular expression as well The main trouble</p><p> is caused by the j op erator which forces us to cho ose one necessary factor from each branch</p><p>We simplify the problem by xing the minimum o ccurrence length and searching for a necessary</p><p> factor of that length on each side In our previous example we could select cdfg or dehi </p><p> for example but not cdfgh However we are able to take the whole construction namely</p><p>cdefghi</p><p>The pro cedure is recursive and aims at nding the b est necessary factor of minimum length </p><p> provided we already know we consider later the computation of these data</p><p> w l ens m where w l ensi is the minimum length of a path from i to a nal state</p><p> mar k mar k f m L where mar k i is the bitmask of all the states reachable in </p><p> steps from state i and mar k f sets the corresp onding nal states</p><p> cost m L where costi is the average cost p er character if we search for all the</p><p> paths of length leaving from state i</p><p>The metho d starts by analyzing the ro ot op erator of the syntax tree Dep ending on the typ e</p><p> of op erand we do as follows</p><p>Concatenation cho ose the b est factor minimum cost among b oth sides of the expression Note</p><p> that factors starting in the left side can continue to the right side but we are deciding ab out</p><p> where the initial p ositions are</p><p>Union cho ose the b est factor from each side and take the union of the states The average cost is</p><p> the maximum over the two sides not the sum since the cost is related to the average numb er</p><p> of characters to insp ect</p><p>One or more rep etition the minimum is one rep etition so ignore the no de and treat the</p><p> subtree</p><p>Zero or more rep etitions the subtree cannot contain a necessary factor since it do es not</p><p> need to app ear in the o ccurrences Nothing can b e found starting inside it We assign a high</p><p> cost to the factors starting inside the sub expression to make sure that it is not chosen in a</p><p> concatenation and that a union containing it will b e equally undesirable</p><p>Simple character or class tree leaf there is only one p ossible initial p osition so cho ose it</p><p> unless its w l ens value is smaller than </p><p>The pro cedure delivers a set of selected states and the prop er initial and nal states for the se</p><p> lected subautomaton It also delivers a reasonable approximation of the average cost p er character</p><p>The rest of the work is to obtain the input for this pro cedure The ideas are similar as for ex</p><p> tended patterns but the techniques need to make heavier use of graph traversal and are more exp en</p><p> sive For example just computing w l ens and L w l ensinitial state ie shortest paths from any</p><p>3 3 4</p><p> state to a nal state we need a Floydlike algorithm that takes O Lm w O minm m w </p><p> time</p><p>Arrow probabilities pr ob are loaded as b efore but ppr obi now gives the total probability of</p><p> all the paths of length leaving state i and it includes the probability of reaching i from a previous </p><p> state In Glushkovs construction all the arrows reaching a state have the same lab el so ppr ob is</p><p> computed for increasing using the formulas</p><p> ppr obi </p><p> ppr obi pr obi</p><p>X</p><p> ppr obi pr obi ppr obj </p><p> j(ij )NFA</p><p>2 3 2</p><p>This takes O Lm O minm m w time At the same time we compute the mar k mar k f</p><p>3 4</p><p> masks at a total cost of O minm m w </p><p> m+1</p><p> mar k i mark f i </p><p> mi i1</p><p> mar k i mark f i </p><p></p><p> mar k j mar k i mar k i </p><p> j(ij )NFA</p><p></p><p> mar k f i mar k f j </p><p> j(ij )NFA</p><p> for increasing values Once ppr ob is computed we build mpr ob m L L so that</p><p> </p><p> mpr obi is the probability of any path of length inside the area mar k i This is computed</p><p>2 2 4 2 2</p><p> in O L m O minm m w time with the formulas</p><p></p><p></p><p> mpr obi </p><p> mpr obi </p><p>Y</p><p> </p><p> </p><p></p><p> mpr obj mpr obi ppr obi </p><p> j(ij )NFA</p><p>2 3 2</p><p>Finally the search cost is computed as always using mpr ob in O L m O minm mw </p><p> time</p><p>Again it is p ossible that the scanning subpattern selected is in fact a simpler typ e of pattern</p><p> such as a simple or extended one In this case we use the appropriate scanning pro cedure which</p><p> should b e faster</p><p>Another nontrivial p ossibility is that even the b est necessary factor is to o bad ie it has a</p><p> high predicted cost p er character In this case we select from the initial state of the automaton a</p><p> subset of it that ts in the computer word ie at most w states intended for forward scanning</p><p>Since all the o ccurrences found with this automaton will have to b e checked we would like to</p><p> select the subautomaton that nds the least p ossible spurious matches If m w then we use</p><p> the whole automaton otherwise we try to minimize the probability of getting outside the set of</p><p> selected states To compute this we consider that the probability of reaching a state is inversely</p><p> prop ortional to its shortest path from the initial state Hence we add states farther and farther</p><p>2</p><p> from the ro ot until we have w states All this takes O m time</p><p>This probabilistic assumption is of course simplistic but an optimal solution is quite hard and</p><p> at this p oint we know that the search will b e costly anyway </p><p> Verication</p><p>A nal nontrivial problem is how to determine which states should b e present in the forward and</p><p> backward verication automata once the b est scanning subautomaton is selected Since we have</p><p> chosen a necessary factor we know for sure that the automaton is cut in two parts by the</p><p> necessary factor If we have chosen a backward scanning then the verications start from the</p><p> initial states of the scanning automaton otherwise they start from its nal states</p><p>Figure illustrates these automata for the pattern abcddefde where we have se</p><p> lected ddef as our scanning sub expression This corresp onds to the states f g</p><p> of the original automaton The selected arrows and states are in b oldface in the gure note that ar</p><p> rows leaving from the selected subautomaton are not selected Dep ending on whether the selected</p><p> subautomaton is searched with backward or forward scanning we know the text p osition where</p><p> its initial or nal states resp ectively were reached Hence in backward scanning the verication</p><p> starts from the initial states of the subautomaton while it starts from the nal states in case of</p><p> forward scanning We need two full automata the original one and one with the arrows reversed Backward scanning Backward verification automaton (reverse arrows) Forward verification automaton e d</p><p> a b c d d e d e 0 1 2 3 4 5 6 7 8 9</p><p> f f</p><p>Forward scanning Forward verification automaton Backward verification automaton e (reverse arrows) d a b c dd e d e 0 1 2 3 4 5 6 7 8 9</p><p> f</p><p> f</p><p>Figure Forward and backward verication automata corresp onding to forward and backward</p><p> scanning of the automaton for abcddefde The shaded states are the initial ones for</p><p> verication In practice we use the complete automaton b ecause cuttingo the exact verication</p><p> subautomaton is to o complicated</p><p>There is however a complication when verifying a regular expression in this scheme Assume</p><p> the search pattern is AX B jCYD for strings A B C D X and Y The algorithm could select</p><p>X jY as the scanning subpattern Now in the text AX D the backward verication for AjC would</p><p>nd A and the forward verication for B jD would nd D and therefore an o ccurrence would b e </p><p> incorrectly triggered Worse than that it could b e the case that X Y so there is no hop e in</p><p> distinguishing what to check based on the initial states of the scanning automaton</p><p>The problem which do es not app ear in simpler typ es of patterns is that there is a set of initial</p><p> states for the verication The backward and forward verications tell us that some state of the</p><p> set can b e extended to a complete o ccurrence but there is no guarantee that there is a single state</p><p> in that set that can b e extended in b oth directions To overcome this problem we check the initial</p><p> states one by one instead of using a set of initial states and doing just one verication</p><p> Approximate Pattern Matching</p><p>All the previous algorithms p ermit sp ecifying a exible search pattern but they do not allow any</p><p> dierence b etween the pattern sp ecied and its o ccurrence in the text We now consider the problem</p><p> of p ermitting at most k insertions deletions replacements or transp ositions in the pattern</p><p>We let the user to sp ecify that only a subset of the four allowed errors are p ermitted However</p><p> designing one dierent search algorithm for each subset was impractical and against the spirit of</p><p> a uniform software So we have considered that our criterion of p ermitting these four typ es of</p><p> errors would b e the most commonly preferred in practice and have a unique scanning phase under</p><p> this mo del as all the previous algorithms we have a scanning and a verication phase Only at</p><p> the verication phase which is hop efully executed a few times we take care of only applying the</p><p> p ermitted op erations Note that we cannot miss any p otential match b ecause we scan with the</p><p> most p ermissive error mo del We also wrote in nrgrep sp ecialized co de for k and k which</p><p> are the most common cases and using a xed k p ermits b etter register usage</p><p> Simple Patterns</p><p>We make use of the three techniques describ ed in Section cho osing the one that promises to</p><p> b e the b est</p><p> Forward and Backward Searching</p><p>In forward searching k similar rows corresp onding to the pattern are used There exists a</p><p> bitparallel algorithm to simulate the automaton in case of k insertions deletions and replacements</p><p> but despite that the automaton that incorp orates transp ositions has b een depicted see</p><p>Figure no bit parallel formula for the complete op eration has b een shown We do that now</p><p>We store each row in a machine word R R just as for the base technique The temp orary</p><p>0 k</p><p> mi i</p><p> states are stored as T T The bitmasks R are initialized to as b efore while all the T</p><p>1 k i i</p><p> m </p><p> are initialized to The up date pro cedure to pro duce R and T up on reading text character t</p><p> j</p><p> is as follows</p><p> m1 </p><p> R j B t R</p><p>0 j</p><p>0</p><p> for i k do</p><p> </p><p>R R B t j R j R j R </p><p> i j i1 i1</p><p> i i1</p><p> j T B t </p><p> i j</p><p></p><p>T R B t </p><p> i1 j</p><p> i </p><p>The rationale of the pro cedure is as follows The rst three lines are as for the base technique</p><p></p><p> without transp ositions The second line of the formula for R corresp onds to transp ositions Note</p><p> i</p><p></p><p> that the new T value is computed accordingly to the old R value so it is text p ositions b ehind</p><p></p><p>Once we compute the new bitmasks R for text character t we take those old R masks that were</p><p> j</p><p> not up dated with t shift them in two p ositions aligning them for the p osition t and killing the</p><p> j j +1</p><p> states that do not match t This is equivalent to pro cessing two characters t At the next</p><p> j j</p><p> iteration t we shift left the mask B t and kill the states of T that do not match The net</p><p> j +1 j +1</p><p></p><p> eect is that at iteration j we are oring R with</p><p> i</p><p>T j B t R j B t B t </p><p> i j +1 i1 j j +1</p><p> R j B t B t </p><p> i1 j +1 j</p><p> which corresp onds to two forward transitions with the characters t t If those characters matched</p><p> j +1 j</p><p></p><p> the text then we p ermit the activation of R </p><p> i</p><p> m1</p><p>A match is detected as b efore when R is not zero Since we may cut w characters</p><p> k</p><p> from the pattern the context conditions and the p ossibility that the user really wanted to p ermit</p><p> only some typ es of errors we have to check each match found by the automaton b efore rep orting</p><p> it We detail later the verication pro cedure</p><p>Backward searching is adapted from this technique exactly as it is done from the basic algorithm</p><p> without transp ositions A subtle p oint is that we cannot consider the automaton dead until b oth</p><p> the R and the T masks run out of active states since T can awake a dead R</p><p>As b efore we select the b est subpattern to search for which is of length at most w The</p><p> algorithm to select the b est subpattern has to account for the fact that we are allowing errors A</p><p> go o d approximation is obtained by considering that any factor will b e alive for k turns and then</p><p> adding the normal exp ected numb er of window characters to read until the factor do es not match</p><p>So we add k to costi j in Eq Now it is more p ossible than b efore for large k m that the</p><p>nal cost p er character is greater than one in which case we prefer forward scanning</p><p> Splitting into k Subpatterns</p><p>This technique can b e directly adapted from previous work taking care of leaving a hole b etween</p><p> each pair of pieces For the checking of complete o ccurrences in candidate text areas we use the</p><p> general verication engine we have a dierent prepro cessing for the case of each of the k pieces</p><p> matching Note that it makes no sense to design a forward search algorithm for this case if the</p><p> average cost p er character is more than one this means that the probability of nding a subpattern</p><p> is so high that we will pay to o much time verifying spurious matches in which case the whole</p><p> metho d do es not work</p><p>The main diculty that remains is how to nd the b est set of k equal length pieces to search</p><p> for We start by computing pr ob and ppr ob as in Section The only dierence is that now we</p><p> are interested only in lengths up to L bm k k c which is the maximum length of a piece</p><p>2</p><p>the m k comes out b ecause there are k unused characters in the partition This cuts down the</p><p>2</p><p> cost to compute these vectors to O m k </p><p>2</p><p>The numerator is in fact converted into m if no transp ositions are p ermitted Despite that the scanning phase will</p><p> allow transp ositions anyway the p ossible matches missed by converting the numerator to m all include transp ositions</p><p> which by hyp othesis are not p ermitted </p><p>Now we compute pcost m L where pcosti gives the average cost p er window</p><p> when searching for the factor P For this sake we compute for each i value the matrix</p><p> ii+1</p><p> mpr ob L L where mpr ob r is the probability of any factor of length r in P </p><p> ii+1</p><p>2 2</p><p>This is computed for each i in O m k time as</p><p> r mpr ob r </p><p> mpr ob </p><p> r mpr ob r ppr obi r r mpr ob r </p><p>L</p><p>X</p><p> mpr ob r pcosti </p><p> r =0</p><p>3 2</p><p>All this is computed for every i for increasing in O m k total time The justication</p><p> for the previous formula is similar to that in Section Now the most interesting part is</p><p> mbest m k where mbesti s is the exp ected cost p er character of the b est s pat</p><p> tern pieces starting at pattern p osition i The strings selected have the same length Together with</p><p> mbest we have ibesti s which tells where must the rst string start in order to obtain mbest</p><p>The maximum length for the pieces is L We try all the lengths from L to until we determine</p><p> that even in the b est case we cannot improve our current b est cost For each p ossible piece length</p><p> we compute the whole mbest and ibest as follows</p><p> mbesti </p><p>i m s s s mbesti s </p><p>i m s s s mbesti s minmbesti s</p><p> cost mbesti s </p><p> where cost min pcosti pcosti </p><p> which is computed for decreasing i The rationale is that mbesti s can cho ose whether to start</p><p> the rst piece immediately or not The second case is easy since mbesti s is already computed</p><p> and ibesti s is made equal to ibesti s In the rst case we have that the cost of the rst</p><p> piece is cost and the other s pieces are chosen in the b est way from P according to</p><p> i++1m</p><p> mbesti s In this case ibesti s i Note that as the total cost to search for the k</p><p> pieces we could take the maximum cost but we obtained b etter results by using the mo del of the</p><p> probability of the union of indep endent events</p><p>2</p><p>All this costs in the worst case O m but normally the longest pieces are the b est and the</p><p> cost b ecomes O mk At the end we have the b est length for the pieces the exp ected cost p er</p><p> character in mbest k and the optimal initial p ositions for the pieces in i ibest k </p><p>0</p><p> i ibesti k i ibesti k and so on Finally if mbest k actually</p><p>1 0 2 1</p><p> larger than a smaller numb er we know that we reach the window b eginning with probability high</p><p> enough and therefore the whole scheme will not work well In this case we have to cho ose among</p><p> forward or backward searching</p><p>3 2 2 3 2</p><p>Summarizing the costs we pay O m k m in the worst case and O m k k m in practice</p><p>3</p><p>Normally k is small so the cost is close to O m in all cases </p><p> Verication</p><p>Finally we explain the verication pro cedure This is necessary b ecause of the context conditions</p><p> of the p ossibly restricted set of edit op erations and b ecause in some cases we are not sure that</p><p> there is actually a match Verication can b e called from the forward scanning in which case in</p><p> general we have partitioned the pattern in P P SP and know that at a given text p osition</p><p>1 2</p><p> a match of S ends from the backward scanning where we have partitioned the pattern in the</p><p> same way and know that at a given text p osition the match of a factor of S b egins or from the</p><p> partitioning into k pieces in which case we have partitioned P P S P S P and</p><p>0 1 1 k +1 k +1</p><p> know that at a given text p osition the exact o ccurrence of a given S b egins</p><p> i</p><p>In general all we know ab out the p ossible match of P around text p osition j is that if we</p><p> partition P P P a partition that we know then there should b e a match of P ending at T</p><p>L R L j</p><p> and a match of P starting at T and that the total numb er of errors should not exceed k In</p><p>R j +1</p><p> all the cases we have precomputed forward automata corresp onding to P and P the rst one is</p><p>L R</p><p> reversed b ecause the verication go es backward In forward and backward scanning there is just</p><p> one choice for P and P while for partitioning into k pieces there are k choices all of</p><p>L R</p><p> them are precomputed</p><p>We run the forward and backward verication from text character j in b oth directions Fortu</p><p> nately the context conditions that make a match valid or invalid can b e checked at each extreme</p><p> separately So we go backward and among all the p ositions where a legal match of P can b egin</p><p>L</p><p> we cho ose the one with minimum error level k Then we go forward lo oking for legal o ccurrences</p><p>L</p><p> of P with k k errors If we nd one then we rep ort an o ccurrence and the whole record is</p><p>R L</p><p> rep orted In fact we take advantage of each match found during the traversal if we are lo oking</p><p></p><p> the pattern with k errors and nd a legal endp oint with k k we still continue searching for</p><p></p><p> b etter o ccurrences but now allowing just k errors This saves time b ecause the verication</p><p></p><p> automaton needs just to use k rows</p><p>A complicated condition can arise b ecause of transp ositions the optimal solution may involve</p><p> transp osing T with T an op eration that we are not p ermitting b ecause we chose to split the</p><p> j j +1</p><p> verication there We solve this in a rather simple but eective way if we cannot nd an o ccurrence</p><p> in a candidate area we give it a second chance after transp osing b oth characters in the text</p><p> Extended Patterns</p><p>The treatment for extended patterns is quite a combination of the extension from simple to extended</p><p> patterns without errors and the use of k rows or the partition into k pieces in order to</p><p> p ermit k errors However there are a few complications that deserve mention</p><p>A rst one is that the propagation of active states due to optional and rep eatable characters</p><p> do es not mix well with transp ositions for example try to draw an automaton that nds abcde</p><p> with one transp osition in the text adbe We solved the problem in a rather practical way</p><p>Instead of trying to simulate a complex automaton we have two parallel automata The R masks</p><p> are used in the normal way without p ermitting transp ositions but p ermitting the other errors and</p><p> the optional and rep eatable characters and they are always one character b ehind the current text</p><p> p osition Each T mask is obtained from the R mask of the previous row by pro cessing the last two</p><p> text characters in reverse order and its result is used for the R mask of the same row in the next </p><p> iteration The co de to pro cess text p osition j is follows the initial values are as always</p><p> m1</p><p>R R j B t </p><p>0 j 1</p><p>0</p><p> for i k do</p><p></p><p>R R B t j R S t </p><p> i j 1 i j 1</p><p> i</p><p></p><p> j R j R j R j T</p><p> i1 i1 i</p><p> i1</p><p></p><p>Df R j F</p><p> i</p><p> </p><p>R R j A Df I b Df </p><p> i i</p><p> m1</p><p>T R j B t j R S t </p><p> i1 j i1 j</p><p> i</p><p></p><p>Df T j F</p><p> i</p><p> </p><p>T T j A Df I b Df </p><p> i i</p><p> m1 </p><p>T T j B t j T S t </p><p> j 1 j 1</p><p> i i i</p><p> where the A I and F masks are dened in Section </p><p>A second complication is that subpattern optimization changes For the forward and backward</p><p> automata we use the same technique for searching without errors but we add k to the numb er of</p><p> pro cessed window p ositions for any subpattern So it remains to b e explained how is the optimiza</p><p> tion for the partitioning into k subpatterns</p><p>We have pr ob and spr ob computed in O m time as for normal extended patterns A rst</p><p></p><p> condition is that the k pieces must b e disjoint so we compute r each m L where</p><p></p><p>L bL k k c is the maximum length of a piece as b efore L is the minimum length of</p><p> a pattern o ccurrence as in Section and r eachi gives the last pattern p osition reachable in </p><p>2</p><p> text characters from i This is computed in O m time as r eachi i and for increasing </p><p> t r eachi </p><p> mt t1 m</p><p> while t m A t t </p><p> r eachi t</p><p>3</p><p>We then compute ppr ob mpr ob and pcost exactly as in Section in O m time Finally the</p><p> selection of the b est set of k subpatterns is done with mbest and ibest just as in Section </p><p> the only dierence b eing that r each is used to determine which is the rst unused p osition if pattern</p><p> p osition i is chosen and a minimum piece length has to b e reserved for it The total pro cess takes</p><p>3</p><p>O m time</p><p> Regular Expressions</p><p>Finally nrgrep p ermits searching for regular expressions allowing errors The same mechanisms used</p><p> for simple and extended patterns are used namely using k replicas of the search automaton and</p><p> splitting the pattern into k disjoint pieces Both adaptations present imp ortant complications</p><p> with resp ect to their simpler counterparts</p><p> Forward and Backward Search</p><p>One problem in regular expressions is that the concept of forward do es not immediately mean</p><p> one shift to the right Approximate searching with k copies of the NFA of the regular expression</p><p> implies b eing able to move forward from row i to row i as Figure shows e d a b c d d e d e 0 errors f f e d</p><p> a b c d d e d e 1 error f f e d a b c d d e d e 2 errors f</p><p> f</p><p>Figure Glushkovs resulting NFAs for the search of the regular expression abcddefde</p><p> with two insertions deletions or replacements To simplify the plot the dashed lines represent</p><p> deletions and replacements ie they move by fg while the vertical lines represent insertions</p><p>ie they move by </p><p>3</p><p>For this sake we use our table T which for each state D gives the bit mask of all the states</p><p> reachable from D in one step The up date formula up on reading a new text character t is therefore</p><p> j</p><p></p><p>R T R B t </p><p>0 j</p><p>0</p><p> ol dR R</p><p>0 0</p><p> for i k do</p><p> </p><p>R T R B t j R j T R j R j T T ol dR B t B t </p><p> i j i1 i1 i1 j j 1</p><p> i i1</p><p> ol dR R</p><p> i1 i1</p><p></p><p>The rationale is as follows R is computed according to the simple formula for regular expres</p><p>0</p><p></p><p> sion searching without errors For i R p ermits arrows coming from matching the current</p><p> i</p><p> character T R B t vertical arrows representing insertions in the pattern R and di</p><p> i j i1</p><p></p><p> agonal arrows representing replacements from R and deletions from R which are joined</p><p> i1</p><p> i1</p><p></p><p> in T R jR Finally transp ositions are arranged in a rather simple way In ol dR we store the</p><p> i1</p><p> i1</p><p> value of R two p ositions in the past note the way we up date it to avoid having two arrays ol dR</p><p> and ol dol dR and each time we p ermit from that state the pro cessing of the last two text character</p><p> in reverse order</p><p>Forward scanning backward scanning and the verication of o ccurrences are carried out in</p><p> the normal way using this up date technique The technique to select the b est necessary factor is</p><p> unaltered except that we add k to the numb er of characters that are scanned inside every text</p><p> window</p><p>3</p><p>Recall that in the deterministic simulation each bit mask of active NFA states D is identied as a state </p><p> Splitting into k Sub expressions</p><p>If we can select k necessary factors of the regular expression as done in Section for selecting</p><p> one necessary factor then we can b e sure that at least one of them will app ear unaltered in any</p><p> o ccurrence with k errors or less As b efore in order to include the transp ositions we need to ensure</p><p> that one character is left b etween consecutive necessary factors</p><p>We rst consider how once the sub expressions have b een selected can we p erform the multi</p><p> pattern search Each sub expression has a set of initial and nal states We reverse all the arrows</p><p> and convert the formerly initial states of all the sub expressions into nal states Since we search</p><p> for any reverse factor of the regular expression all the states are made initial</p><p>We set the window length to the shortest path from an initial to a nal state At each window</p><p> p osition we read the text characters backward and feed the transformed automaton Each time</p><p> we arrive to a nal state we know that the prex of a necessary sub expression has app eared If we</p><p> happ en to b e at the b eginning of the window then we check for the whole pattern as b efore We</p><p> keep masks with the nal states of each necessary factor in order to determine from the current</p><p> mask of active states which of them matched recall that a dierent verication is triggered for</p><p> each</p><p>Figure illustrates a p ossible selection of two necessary factors corresp onding to k for</p><p> our running example abcddefde These are abc and defde where the rst</p><p>d has b een left to separate b oth necessary factors Their minimum length is so this is the</p><p> window length to use</p><p>ε</p><p> e d</p><p> a b c d e d e</p><p> f</p><p> f</p><p>Figure An automaton recognizing reverse prexes of two necessary factors of</p><p>abcddefde based on the Glushkov construction of Figure Note that there are no</p><p> transitions among sub expressions</p><p>The dicult part is how to select k necessary and disjoint factors from a regular expression</p><p>Disjoint means that the subautomata do not share states this ensures that there is a character</p><p> separating them which is necessary for transp ositions Moreover we want the b est set and we</p><p> want that all them have the same minimum path length from an initial to a nal state</p><p>We b elieve that an algorithm nding the optimal choice has a time cost which grows exp o</p><p> nentially with k We have therefore made the following simplication Each no de s is assigned a</p><p> numb er I which corresp onds to the length of the shortest path reaching it from the initial state</p><p> s</p><p>We do not p ermit picking arbitrary necessary factors but only those subautomata formed by all</p><p> the states s that have numb ers i I i for some i and This is the minimum window</p><p> s</p><p> length to search using that subautomata and should b e the same for all the necessary factors</p><p> chosen Moreover every arrow leaving out of the chosen subautomaton is dropp ed Finally note </p><p> that if the i values corresp onding to any pair of subautomata chosen dier by more than then we</p><p> are sure that they are disjoint</p><p>Figure illustrates this numb ering for our running example The partition obtained in Fig</p><p> ure corresp onds to cho osing the ranges and as the sets of states this corresp onds to</p><p> the sets f g and f g Of course the metho d do es not p ermit us to cho ose all the</p><p> legal sets of subautomata In this example the necessary subautomaton f g cannot b e picked</p><p> b ecause it includes some but not all states numb ered </p><p> e d</p><p> a b c d d e d e 0 1 2 3 4 5 6 7 8 9</p><p> f f</p><p>I_s 0 1 2 3 4 5 5 5 6 7</p><p>Figure Numb ering Glushkovs NFA states for the regular expression abcddefde</p><p>Numb ering the states is easily done by a closure pro cess starting from the initial state in</p><p>2</p><p>O Lm w time where L is the minimum length of a string matching the whole regular expression</p><p></p><p>As b efore L bL k k c is the maximum p ossible length of a string matching a necessary</p><p> factor</p><p>To determine the b est factors we rst compute pr ob ppr ob mpr ob and cost exactly as in</p><p>Section The only dierence is that now we are only interested in starting from sets of initial</p><p> states of the same I numb er there are L p ossible choices and we are interested in analyzing</p><p> s</p><p></p><p> factors of length no more than L O Lk This lowers the costs to compute the previous arrays</p><p> 2 2</p><p> to O L m O Lmk We also make sure that we do not select subautomata that need</p><p> more than w bits to b e represented</p><p>Once we have the exp ected cost of cho osing any p ossible I value and any p ossible minimum</p><p> s</p><p> factor length we can apply the optimization algorithm of Section since we have in fact</p><p>linearized the problem thanks to our simplication the optimization problem is similar to when</p><p> we had a single pattern and needed to extract the b est set of k disjoint factors from it of a</p><p>2</p><p> single length This takes O L in the worst case but should in practice b e closer to O k L</p><p>4</p><p>Hence the total optimization algorithm needs O m time at worst</p><p> A Pattern Matching Software</p><p>Finally we are in p osition to describ e the software nrgrep Nrgrep has b een develop ed in ANSI C</p><p> and tested on Linux and Solaris platforms Its source co de is publicly available under a Gnu license</p><p>4</p><p> We discuss now its main asp ects</p><p>4</p><p>From httpwwwdccuchileclgnavarropubcode </p><p> Usage and Options</p><p>Nrgrep receives in the command line a pattern and a list of les to search The syntax of the pattern</p><p> is given in Section If no options are given nrgrep searches the les in the order given and prints</p><p> all the lines where it nds the pattern If more than one le is given then the le name is printed</p><p> prior to the lines that have matched inside it if there is one If no le names are given it receives</p><p> the text from the standard input</p><p>The default b ehavior of nrgrep can b e mo died with a set of p ossible options which prexed</p><p> by the minus sign can precede the pattern sp ecication Most of them are inspired in agrep</p><p> i the search is case insensitive</p><p> w only whole words matching the pattern are accepted</p><p> x only whole records eg lines matching the pattern are accepted</p><p> c just counts the matches do es not print them</p><p> l outputs the names of the les containing matches but not the matches themselves</p><p>G outputs the whole contents of the les containing matches</p><p> h do es not output le names</p><p> n prints records preceded by their record numb er</p><p> v reverse matching rep orts the records that do not match</p><p> d del im sets the record delimiter to del im which is n by default A at the</p><p> end of the delimiter makes it app ear as a part of the previous record default is next so</p><p> by default the end of line is the record delimiter and is considered as a part of the previous</p><p> record</p><p> b buf siz e sets the buer size default is Kb This aects the eciency and the p ossibility</p><p> of cutting very long records that do not t in a buer</p><p> s sep prints the string sep b etween each pair of records output</p><p> k er r idst allows up to er r errors in the matches If idst is not present the errors</p><p> p ermitted are insertions deletions substitutions and transp ositions otherwise a subset</p><p> of them can b e sp ecied eg k ids</p><p>L takes the pattern literally no sp ecial characters</p><p>H explains the usage and exits</p><p>We now discuss briey how each of these options are implemented i is easily carried out by</p><p> using classes of characters w and x are handled using context conditions c l G h b and</p><p>s are easily handled but some changes are necessary such as stopping the search when the rst</p><p> match is found for l and G n and v are more complicated b ecause they force all the records to</p><p> b e pro cessed so we rst search for the record delimiters and then search for the pattern inside the</p><p> records normally we search for the pattern and nd record delimiters around o ccurrences only</p><p>d is arranged by just changing the record delimiter it has to b e a simple pattern but classes or</p><p> characters are p ermitted k switches to approximate searching and the idst ags are considered</p><p> at verication time only and L avoids the normal parsing pro cess and considers the pattern as a</p><p> simple string </p><p>Of course it is p ermitted to combine the options in a single string preceded by the minus sign</p><p> or as a sequence of strings each preceded by the minus sign Some combinations however make</p><p> no sense and are automatically overriden a warning message is issued lenames are not printed</p><p> if the text comes by the standard input c and G are not compatible and c dominates n and</p><p>l are not compatible and l dominates l and G are not compatible and G dominates and G</p><p> is ignored when working on the standard input since G works by printing the whole le from the</p><p> shell</p><p> Parsing the Pattern</p><p>One imp ortant asp ect that we have not discussed is how is the parsing done In principle we just</p><p> have to parse a regular expression However our parser mo dule carries out some other tasks such</p><p> as</p><p>Parsing our extended syntax some of our op erations such as and classes of characters</p><p> are not part of the classical regular expression syntax The result of the parsing is a syntax</p><p> tree which is not discarded since as we have seen it is useful later for prepro cessing regular</p><p> expressions</p><p>Map the pattern to bit mask p ositions the parser determines the numb er of states required</p><p> by the pattern and assigns the bit p ositions corresp onding to each part of the pattern sp eci</p><p>cation</p><p>Determine typ e of sub expression given the whole pattern or a subset of its states the parser</p><p> is able to determine which typ e of expression is involved simple extended or regular expres</p><p> sion</p><p>Algebraic simplication the parser p erforms some algebraic simplication on the pattern to</p><p> optimize the search and reduce the numb er of bits needed for it</p><p>The parsing phase op erates in a topdown way It rst tries to parse an or of sub ex</p><p> pressions Each of them is parsed as a concatenation of sub expressions each of which is a single</p><p> expression nished with a sequence of or terminators and the single expression is</p><p> either a single symb ol a class of characters or a toplevel expression in parentheses Apart from</p><p> the syntax tree the parser pro duces a mapping from p ositions of the pattern strings to leaves of</p><p> the syntax tree meaning that the character or class describ ed at that pattern p osition must b e</p><p> loaded at that tree leaf</p><p>The second step is the algebraic simplication which pro ceeds b ottomup and is able to enforce</p><p> the following rules at any level</p><p> If C and C are classes of characters then C j C C C </p><p>1 2 1 2 1 2</p><p> If C is a class then C j j C C </p><p> If E is any sub expression then E E E </p><p> If E is any sub expression then E E E E E E E E and</p><p>E E E E </p><p> j </p><p> A sub expression which can match the empty string and app ears at the b eginning or at the</p><p> end of the pattern is replaced by except when we need to match whole words or records</p><p>To collapse classes of characters in a single no de rst rule we traverse the obtained mapping</p><p> from pattern p ositions to tree leaves nd those mapping to the leaves that store C and C and</p><p>1 2</p><p> make all them p oint to the new leaf that represents C C </p><p>1 2</p><p>More simplications are p ossible but they are more complicated and we chose to stop here for</p><p> the current version of nrgrep Some interesting simplications that may reduce the numb er of bits</p><p> required to represent the pattern are xw z juw v xjuw z jv EE E and E jE E</p><p> for arbitrarily complex E right now we do that just for leaves</p><p>After the simplication is done we assign p ositions in the bit mask corresp onding to each</p><p> characterclass in the pattern This is easily done by traversing the leaves left to right and assigning</p><p> one bit to each leaf of the tree except to those storing instead of a class of characters Note that</p><p> this works as exp ected on simple and extended patterns as well For several purp oses including</p><p>Glushkovs NFA construction algorithm we need to store which are the bit p ositions used inside</p><p> every subtree which of these corresp ond to initial and nal states and whether the empty string</p><p> matches the sub expression All this is easily done in the same recursive traversal over the tree</p><p>Once we have determined the bits that corresp ond to each tree leaf and the mapping from</p><p> pattern p ositions to tree leaves we build a table of bit masks B such that B c tells which pattern</p><p> p ositions are activated by the character c This B table can b e directly used for simple and extended</p><p> patterns and it is the basis to build the DFA transitions of regular expressions</p><p>Finally the parser is able to tell which is the typ e of the pattern that it has pro cessed This</p><p> can b e dierent from what the syntax used to express the pattern may suggest for example</p><p>abcdefg is the simple pattern abcdefg which is discovered by the simpli</p><p> cation pro cedure This is in our general spirit of not b eing mislead by simple problems presented in</p><p> a complicated way which motivated us to select optimum subpatterns to search for As a secondary</p><p> eect of simplications the numb er of bits required to represent the pattern may b e reduced</p><p>A second reason to b e able to determine the real typ e of pattern is that the scanning subpattern</p><p> selected can b e simpler than the whole pattern and hence it can admit a faster search algorithm</p><p>For this sake we p ermit determining the typ e not only of the whole pattern but also of a selected</p><p> set of p ositions</p><p>The algorithm for determining the typ e of pattern or subpattern starts by assuming a simple</p><p> pattern until it nds evidence of an extended pattern or a regular expression The algorithm enters</p><p> recursively into the syntax tree of the pattern avo ding entering sub expressions where no selected</p><p> state is involved Among those that have to b e entered in order to reach the selected states it</p><p> answers simple for the leaves of the tree characters classes and and in concatenations it</p><p> takes the most complex typ e among the two sides When reaching an internal no de or</p><p> it assumes extended if the subtree was just a single character otherwise it assumes regular</p><p> expression The latter is always assumed when an or that could not b e simplied is found </p><p> Software Structure</p><p>Nrgrep is implemented as a set of mo dules which p ermits easy enrichment with new typ es of</p><p> patterns Each typ e of pattern simple extended and regular expression has two mo dules to deal</p><p> with the exact and the approximate case which makes six mo dules simple esimple extended</p><p> eextended regular eregular the prex e stands for allowing errors The other mo dules</p><p> are</p><p> basicsoptionsexceptmemio basic denitions exception handling and memory and IO man</p><p> agement functions</p><p> bitmasks handles op erations on simple and multiword bit masks</p><p> buffer implements the buering mechanism to read les</p><p> record takes care of context conditions and record management</p><p> parser p erforms the parsing</p><p> search template that exp orts the prepro cessing and search functions which are implemented in</p><p> the six mo dules describ ed simple etc</p><p> shell interacts with the user and provides the main program</p><p>The template search facilitates the management of dierent search algorithms for dierent</p><p> pattern typ es Moreover we remind that the scanning pro cedure for a pattern of a given typ e can</p><p> use a subpattern whose typ e is simpler and hence a dierent scanning function is used This is</p><p> easily handled with the template</p><p> Some Exp erimental Results</p><p>We present now some exp eriments comparing the p erformance of nrgrep version against that of</p><p> its b est known comp etitors Gnu grep version and agrep version </p><p>Agrep uses for simple strings a very ecient mo dication of the BoyerMo oreHorsp o ol</p><p> algorithm For classes of characters and wild cards denoted and equivalent to it</p><p> uses an extension of the ShiftOr algorithm explained in Section in all cases based on forward</p><p> scanning For regular expressions it uses a forward scanning with the bit parallel simulation of</p><p>Thompsons automaton as explained in Section Finally for approximate searching of simple</p><p> strings it tries to use partitioning into k pieces and a multipattern search algorithm based on</p><p>BoyerMo ore but if this do es not promise to yield go o d results it prefers a bitparallel forward</p><p> scanning with the automaton of k rows these techniques were explained in Section For</p><p> approximate searching of more complex patterns agrep uses only this last technique</p><p>Gnu Grep cannot search for approximate patterns but it p ermits all the extended patterns and</p><p> regular expressions Simple strings are searched for with a BoyerMo oreGosp er search algorithm</p><p>similar to Horsp o ol All the other complex patterns are searched for with a lazy deterministic</p><p> automaton ie a DFA whose states are built as needed using forward scanning To sp eed up the </p><p> search for complex patterns grep tries to extract their longest necessary string which is used as a</p><p>lter and searched for as a simple string In fact grep is able to extract a necessary set of strings</p><p> ie such that one of the strings in the set has to app ear in every match This set is searched for as a</p><p>lter using a CommentzWalter like algorithm which is a kind of multipattern BoyerMo ore As</p><p> we will see this extension makes grep very p owerful and closer to our goal of a smo oth degradation</p><p> in eciency as the pattern gets more complex</p><p>The exp eriments were carried out over Mb of English text extracted from Wall Street</p><p>Journal articles of which are part of the trec collection Two dierent machines were</p><p> used sun is a Sun UltraSparc of MHz with Mb RAM running Solaris and intel is</p><p> an i of MHz with Mb RAM running Linux Red Hat kernel Both are</p><p>bit machines w </p><p>To illustrate the complexity of the co de we show the sizes of the sources and executables in</p><p>Table The source size is obtained by summing up the sizes of all the c and h les The</p><p> size of the executables is computed after running Unixs strip command on them As can b e</p><p> seen nrgrep is in general a simpler software</p><p>Software Source size Executable size Executable size</p><p>sun intel</p><p> agrep v Kb Kb Kb</p><p> grep v Kb Kb Kb</p><p> nrgrep v Kb Kb Kb</p><p>Table Sizes of the dierent softwares under comparison</p><p>The exp eriments were rep eated for dierent patterns of each kind To minimize the inter</p><p> ference of io times in our measures we had the text in a lo cal disk we considered only user times</p><p> and we asked the programs to show the count of matching records only The records were the lines</p><p> of the le The same patterns were searched for on the same text for grep agrep and nrgrep</p><p>Our rst exp eriment shows the search times for simple strings of lengths to Those strings</p><p> were randomly selected from the same text starting at word b eginnings and taking care of including</p><p> only letters digits and spaces We also tested how the i case insensitive search and w</p><p>match whole words aected the p erformance but the dierences were negligible We also made</p><p> this exp eriment allowing and of errors ie k bmc or k bmc In the case of</p><p> errors grep is excluded from the comparison b ecause it do es not p ermit approximate searching</p><p>As Figure shows nrgrep is comp etitive against the others more or less dep ending on the</p><p> machine and the pattern length It works b etter on the intel architecture and on mo derate length</p><p> patterns rather than on very short ones It is interesting to notice that in our exp eriments grep</p><p> p erformed b etter than agrep</p><p>When searching allowing errors nrgrep is slower or faster than agrep dep ending on the case</p><p>With low error levels they are quite close except for m where for some reason agrep</p><p> p erforms consistently bad on sun With mo derate error levels the picture is more complicated</p><p> and each of them is b etter for dierent pattern lengths This is a p oint of very volatile b ehavior</p><p> b ecause happ ens to b e very close to the limit where splitting into k pieces ceases to b e a </p><p> go o d choice and a small error estimating the probability of a match pro duces dramatic changes in</p><p> the p erformance Dep ending on each pattern a dierent search technique is used</p><p>On the other hand the search p ermitting transp ositions is consistently worse than the one not</p><p> p ermitting them This is not only b ecause when splitting into k pieces they may have to b e</p><p> shorter to allow one free space among them but also b ecause even when they can have the same</p><p> length the subpattern selection pro cess has more options to nd the b est pieces if no space has to</p><p> b e left among them</p><p>The b ehavior of nrgrep however is not as erratic as it seems to b e The nonmonotonic b ehavior</p><p> can b e explained in terms of splitting into k pieces As m grows the length of the pieces that</p><p> can b e searched for grows tending to the real mk value from b elow For example for k </p><p> of m we have in the case of transp ositions to search for pieces of length when m </p><p>For m we have to search for pieces of length which is estimated to pro duce to o many</p><p> verications and then another metho d is used For m however we have to search for pieces</p><p> of length which has much lower probability of o ccurrence and recommends again splitting into</p><p> k pieces The dierences b etween searching using or not transp ositions is explained in the same</p><p> way For m we have to search for pieces of length no matter whether transp ositions are</p><p> p ermitted But for m we can search for pieces of length if no transp ositions are p ermitted</p><p> which yields much b etter search time for that case We remark that p ermitting transp ositions has</p><p> the p otential of yielding approximate searching of b etter quality and hence obtain the same or</p><p> b etter results using a smaller k </p><p>Our second exp eriment aims at evaluating the p erformance when searching for simple patterns</p><p> that include classes of characters We have selected strings as b efore from the text and have</p><p> replaced some random p ositions with a class of characters The classes have b een upp er and lower</p><p> case versions of the letter replaced called case in the exp eriments all the letters azAZ</p><p> called letters in the exp eriment and all the characters called all in the exp eriments We</p><p> have considered pattern lengths of and and an increasing numb er of p ositions converted into</p><p> classes We show also the case of length and k errors excluding grep</p><p>As Figure shows nrgrep deals much more eciently with classes of characters than its</p><p> comp etitors worsening slowly as there are more or bigger classes to consider Agrep yields always</p><p> the same search time coming from a ShiftOr like algorithm Grep on the other hand worsens</p><p> progressively as the numb er of classes grows but is indep endent on how big the classes are and it</p><p> worsens much faster than nrgrep We have included the case of zero classes basically to show the</p><p> eect of the change of agreps search algorithm</p><p>Our third exp eriment deals with extended patterns We have selected strings as b efore from the</p><p> text and have added an increasing numb er of op erators or to it at random p ositions</p><p>avoiding the rst and last ones For this sake we had to use the E option of grep which</p><p> deals with regular expressions and convert E into EE for agrep Moreover we found no way to</p><p> express the in agrep since even allowing regular expressions it did not p ermit sp ecifying the</p><p> string or the empty set a aj aj We also show how agrep and nrgrep p erform to</p><p> search for extended patterns allowing errors Because of agreps limitations we chose m and</p><p> k errors no transp ositions p ermitted for this test</p><p>As Figure shows agrep has a constant search time coming again from ShiftOr like searching</p><p>plus some size limitations on the pattern Both grep and nrgrep improve as the problem b ecomes Exact searching on SUN Exact searching on INTEL 2 34 Agrep Agrep Grep Grep 1.8 Nrgrep 32 Nrgrep 30 1.6 28 1.4 26 1.2 24 1 22</p><p>0.8 20</p><p>0.6 18 5 10 15 20 25 30 5 10 15 20 25 30 m m</p><p>Approximate searching on SUN Approximate searching on INTEL 45 60 Agrep 10% Agrep 10% 40 Nrgrep 10% Nrgrep 10% Nrgrep 10% (transp) 55 Nrgrep 10% (transp) 35 Agrep 20% Agrep 20% Nrgrep 20% Nrgrep 20% Nrgrep 20% (transp) 30 50 Nrgrep 20% (transp)</p><p>25 45 20</p><p>15 40</p><p>10 35 5</p><p>0 30 5 10 15 20 25 30 5 10 15 20 25 30</p><p> m m</p><p>Figure Search times on Mb for simple strings using exact and approximate searching m = 10 on SUN m = 10 on INTEL 10 42</p><p>9 40 Agrep - case Agrep - letters 8 Agrep - case 38 Agrep - all Agrep - letters Grep - case 7 Agrep - all 36 Grep - letters Grep - case Grep - all 6 Grep - letters 34 Nrgrep - case Grep - all Nrgrep - letters Nrgrep - all 5 Nrgrep - case 32 Nrgrep - letters Nrgrep - all 4 30</p><p>3 28</p><p>2 26</p><p>1 24 0 1 2 3 4 5 0 1 2 3 4 5 # of classes # of classes</p><p> m = 10 on SUN m = 20 on INTEL 10 42 9 40 Agrep - case Agrep - letters 8 Agrep - case 38 Agrep - all Agrep - letters Grep - case 7 Agrep - all 36 Grep - letters Grep - case Grep - all 6 Grep - letters 34 Nrgrep - case Grep - all Nrgrep - letters 5 Nrgrep - case 32 Nrgrep - all Nrgrep - letters 4 Nrgrep - all 30 3 28 2 26 1 24 0 22 0 2 4 6 8 10 0 2 4 6 8 10 # of classes # of classes</p><p> m = 15, allowing 2 errors on SUN m = 15, allowing 2 errors on INTEL 20 48</p><p>18 Agrep - case 46 Agrep - letters Agrep - case 16 Agrep - all 44 Agrep - letters Nrgrep - case Agrep - all 14 Nrgrep - letters Nrgrep - case Nrgrep - all 42 Nrgrep - letters 12 Nrgrep - all 40 10 38 8 36 6</p><p>4 34</p><p>2 32 1 2 3 4 5 6 7 1 2 3 4 5 6 7</p><p># of classes # of classes</p><p>Figure Exact and approximate search times on Mb for simple patterns with a varying</p><p> numb er of classes of characters </p><p> simpler but nrgrep is consistently faster Note that the problem is necessarily easier with the </p><p> op erator b ecause the minimum length of the pattern is not reduced as the numb er of op erators</p><p> grows We have again included the case of zero op erators to show how agrep jumps</p><p>With resp ect to approximate searching nrgrep is consistently faster than agrep whose cost is a</p><p> kind of worst case which is reached when the minimum pattern length b ecomes This is indeed a</p><p> very dicult case for approximate searching and nrgrep wisely cho oses to do forward scanning on</p><p> that case</p><p>Our fourth exp eriment considers regular expressions It is not easy to dene what is a random</p><p> regular expression so we have tested complex patterns that we considered interesting to illustrate</p><p> the dierent alternatives for the eciency These have b een searched for with zero one and two</p><p> errors no transp ositions p ermitted Table shows the patterns selected and some asp ects that</p><p> explain the eciency problems to search for them</p><p>No Pattern Size Minimum of lines that match</p><p> chars length ` exactly error errors</p><p> AmericanCanadian </p><p> AmericanCanadianMexican </p><p> Amerazcan </p><p> AmerazcanCanazian </p><p> Ameirican </p><p> Amazriazan </p><p> AmCaernaicdian </p><p> Americanpolicy </p><p> Americanpolicy </p><p>Table The regular expressions searched for written with the syntax of nrgrep Note that some</p><p> are indeed extended patterns</p><p>Table shows the results These are more complex to interpret than in the previous cases and</p><p> there are imp ortant dierences in the b ehavior of the same co de dep ending on the machines</p><p>For example grep p erformed consistently well on intel while it showed wide dierences on</p><p> sun We b elieve that this comes from the fact that in some cases grep cannot nd a suitable set of</p><p>ltering strings patterns and In those cases the time corresp onds to that of a forward</p><p>DFA On the intel machine however the search times are always go o d</p><p>Another example is agrep which has basically two dierent times on sun and always the same</p><p> time on intel Despite that agrep uses always forward scanning with a bitparallel automaton it</p><p> takes on the sun half the time when the pattern is very short up to p ositions while it is slower</p><p> for longer patterns This dierence seems to come from the numb er of computer words needed for</p><p> the simulation but this seems not to b e imp ortant on the intel machine The approximate search</p><p> using agrep which works only on the shorter patterns scales in time accordingly to the numb er of</p><p> computer words used although there is a large constant factor added in the intel machine</p><p>Nrgrep p erforms well on exact searching All the patterns yield fast search times in general</p><p> b etter than those of grep on intel and worse on sun The exception is where grep do es not use</p><p>ltering in the sun machine and nrgrep b ecomes much faster When errors are p ermitted the</p><p> search times of nrgrep vary dep ending on its ability to lter the search The main asp ects that m = 10 on SUN m = 10 on INTEL 10 44 42 Agrep - *'s 8 Agrep - +'s 40 Agrep - *'s Grep - ?'s 38 Agrep - +'s Grep - *'s Grep - ?'s Grep - +'s 6 36 Grep - *'s Nrgrep - ?'s Grep - +'s Nrgrep - *'s 34 Nrgrep - ?'s Nrgrep - +'s Nrgrep - *'s 4 32 Nrgrep - +'s 30 2 28 26 0 24 0 1 2 3 4 5 0 1 2 3 4 5 # of special symbols # of special symbols</p><p> m = 20 on SUN m = 20 on INTEL 3.5 44 42 3 Agrep - *'s 40 Agrep - +'s Agrep - *'s Grep - ?'s 38 Agrep - +'s 2.5 Grep - *'s Grep - ?'s Grep - +'s 36 Grep - *'s Nrgrep - ?'s 34 Grep - +'s 2 Nrgrep - *'s Nrgrep - ?'s Nrgrep - +'s 32 Nrgrep - *'s Nrgrep - +'s 30 1.5 28</p><p>1 26 24 0.5 22 2 4 6 8 10 0 2 4 6 8 10 # of special symbols # of special symbols</p><p> m = 8, allowing 1 error on SUN m = 8, allowing 1 error on INTEL 30 44</p><p>43 25 42 Agrep - *'s Agrep - *'s Agrep - +'s 20 Agrep - +'s 41 Nrgrep - ?'s Nrgrep - ?'s Nrgrep - *'s Nrgrep - *'s Nrgrep - +'s 15 Nrgrep - +'s 40</p><p>39 10 38 5 37</p><p>0 36 1 1.5 2 2.5 3 3.5 4 1 1.5 2 2.5 3 3.5 4</p><p># of special symbols # of special symbols</p><p>Figure Exact and approximate search times on Mb for extended patterns with a varying</p><p> numb er of op erators Where on sun Agrep is out of b ounds it takes nearly seconds </p><p> aect the search time are the frequency of the pattern and its size When compared to agrep nrgrep</p><p> p erforms always b etter on exact searching and b etter or similarly on approximate searching</p><p> sun</p><p>No grep agrep nrgrep</p><p> errors errors error errors errors error errors</p><p> </p><p> </p><p> </p><p> </p><p> </p><p> </p><p> </p><p> </p><p> </p><p> intel</p><p>No grep agrep nrgrep</p><p> errors errors error errors errors error errors</p><p> </p><p> </p><p> </p><p> </p><p> </p><p> </p><p> </p><p> </p><p> </p><p>Table Search times for the selected regular expressions There are empty cells b ecause agrep</p><p> severely restricts the lengths of the complex patterns that can b e approximately searched for</p><p>The general conclusions from the exp eriments are that nrgrep is for exact searching comp etitive</p><p> against agrep and grep while it is in general sup erior sometimes by far when searching for classes</p><p> of characters and extended patterns exactly or allowing errors When it comes to search for regular</p><p> expressions nrgrep is in general but not always faster than grep and agrep</p><p>One nal word ab out nrgreps smo othness is worthwhile The reader may get the impression</p><p> that nrgreps b ehavior is not as smo oth as promised b ecause it takes very dierent times for dierent</p><p> patterns for example on regular expressions while agreps b ehavior is much more predictable The</p><p> p oint is that some of these patterns are indeed much simpler than others and nrgrep is much faster</p><p> to search for the simpler ones Agrep on the other hand do es not distinguish b etween simple and</p><p> complicated cases Grep do es a much b etter job but it do es not deal with the complex area of</p><p> approximate searching Something similar happ ens on other cases nrgrep had the highest variance </p><p> when searching for patterns where classes were inserted at random p oints This is b ecause this</p><p> random pro cess do es pro duce a high variance in the complexity of the patterns it is much simpler</p><p> to search for the pattern when all the classes are in one extreme then cutting it out from the</p><p> scanning subpattern than when they are uniformly spread Nrgreps b ehavior simply mimics the</p><p> variance in its input precisely b ecause it takes time prop ortional to the real complexity of the</p><p> search problem</p><p> Conclusions</p><p>We have presented nrgrep a fast online pattern matching to ol esp ecially well suited for complex</p><p> pattern searching on natural language text Nrgrep is now at version and publicly available</p><p> under a Gnu license Our b elief is that it can b e a successful new memb er of the grep family The</p><p>Free Software Foundation has shown interest in making nrgrep an imp ortant part of a new release</p><p> of Gnu grep and we are currently dening de details</p><p>The most imp ortant improvements of nrgrep over the other memb ers of the grep family are</p><p>Eciency nrgrep is similar to the others when searching for simple strings a sequence of single</p><p> characters and some regular expressions and generally much faster for all the other typ es of</p><p> patterns</p><p>Uniformity our search mo del is uniform based on a single concept This translates into smo oth</p><p> variations in the search time as the pattern gets more complex and into an absence of obscure</p><p> restrictions present in agrep</p><p>Extended patterns we intro duce a class of patterns which is intermediate b etween simple pat</p><p> terns and regular expressions and develop ecient search algorithms for it</p><p>Error mo del we include character transp ositions in the error mo del</p><p>Optimization we nd the optimal subpattern to scan the text and check the p otential o ccurrences</p><p> for the complete pattern</p><p>Some p ossible extensions and improvements have b een left for future work The rst one is a</p><p> b etter computation of the matching probability which has a direct impact on the ability of nrgrep</p><p> for cho osing the right search metho d currently it normally succeeds but not always A second</p><p> one is a b etter algebraic optimization of the regular expressions which has also an impact on the</p><p> ability to correctly compute the matching probabilities Finally we would also like to b e able to</p><p> combine exact and approximate searching as agrep do es where parts of the pattern accept errors</p><p> and others do not It is not hard to do this by using bit masks that control the propagation of</p><p> errors</p><p>Acknowledgements</p><p>Most of the algorithmic work preceding nrgrep was done in collab oration with Mathieu Ranot</p><p> who unfortunately had no time to participate in this software </p><p>References</p><p> R BaezaYates Text retrieval Theory and practice In th IFIP World Computer Congress</p><p> volume I pages Septemb er </p><p> R BaezaYates and G Gonnet A new approach to text searching Communications of the</p><p>ACM </p><p> R BaezaYates and G Navarro Faster approximate string matching Algorithmica </p><p> </p><p> R BaezaYates and B Rib eiroNeto Modern Information Retrieval AddisonWesley </p><p> G Berry and R Sethi From regular expression to deterministic automata Theoretical Com</p><p> puter Science </p><p> R Boyer and J Mo ore A fast string searching algorithm Communications of the ACM</p><p> </p><p> B CommentzWalter A string matching algorithm fast on the average In Proc ICALP</p><p>LNCS v pages </p><p> M Cro chemore and W Rytter Text algorithms Oxford University Press </p><p> A Czuma j M Cro chemore L Gasieniec S Jarominek T Lecro q W Plandowski and</p><p>W Rytter Sp eeding up two stringmatching algorithms Algorithmica </p><p> N ElMabrouk and M Cro chemore Boyermo ore strategy to ecient approximate string</p><p> matching In th International Symposium on Combinatorial Pattern Matching CPM</p><p> pages </p><p> D Harman Overview of the Third Text REtrieval Conference In Proc Third Text REtrieval</p><p>Conference TREC pages NIST Sp ecial Publication </p><p> R Horsp o ol Practical fast searching in strings Software Practice and Experience </p><p></p><p> D Knuth J Morris and V Pratt Fast pattern matching in strings Siam Journal on</p><p>Computing </p><p> K Kukich Techniques for automatically correcting words in text ACM Computing Surveys</p><p> </p><p> U Manb er and S Wu Glimpse A to ol to search through entire le systems In Proc</p><p>USENIX Technical Conference pages Winter </p><p> B Melichar String matching with k dierences by nite automata In Proc International</p><p>Congress on Pattern Recognition ICPR pages IEEE CS Press </p><p> E Moura G Navarro N Ziviani and R BaezaYates Fast and exible word searching on</p><p> compressed text ACM Transactions on Information Systems TOIS To app ear Earlier</p><p> versions in SIGIR and SPIRE</p><p> G Myers A fast bitvector algorithm for approximate string matching based on dynamic</p><p> progamming Journal of the ACM </p><p> G Navarro A guided tour to approximate string matching ACM Computing Surveys </p><p>To app ear</p><p> G Navarro and R BaezaYates Very fast and simple approximate string matching Informa</p><p> tion Processing Letters </p><p> G Navarro E Moura M Neub ert N Ziviani and R BaezaYates Adding compression to</p><p> blo ck addressing inverted indexes Kluwer Information Retrieval Journal </p><p> G Navarro and M Ranot A bitparallel approach to sux automata Fast extended string</p><p> matching In th International Symposium on Combinatorial Pattern Matching CPM</p><p>LNCS pages </p><p> G Navarro and M Ranot Fast and exible string matching by combining bitparallelism</p><p> and sux automata Technical Rep ort TRDCC Dept of Computer Science Univ of</p><p>Chile </p><p> G Navarro and M Ranot Fast regular expression searching In rd International Workshop</p><p> on Algorithm Engineering WAE LNCS pages </p><p> G Navarro and M Ranot Compact DFA representation for fast regular expression search</p><p>In th International Workshop on Algorithm Engineering WAE LNCS To app ear</p><p> G Navarro and M Ranot Flexible Pattern Matching in Strings Cambridge University</p><p>Press To app ear</p><p> M Ranot On the multi backward dawg matching algorithm MultiBDM In Proc th</p><p>South American Workshop on String Processing WSP pages Carleton University</p><p>Press </p><p> K Thompson Regular expression search algorithm Communications of the ACM </p><p> </p><p> S Wu and U Manb er Agrep a fast approximate patternmatching to ol In Proc USENIX</p><p>Technical Conference pages </p><p> S Wu and U Manb er Fast text searching allowing errors Communications of the ACM</p><p> </p>
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages0 Page
-
File Size-