NR-Grep: a Fast and Flexible Pattern Matching Tool 1 Introduction

NR�<a href="/tags/Grep/" rel="tag">grep</a>� A Fast and Flexible Pattern Matching To ol�Gonzalo NavarroAbstractWe present nrgrep ��nondeterministic reverse grep�� a new pattern matching to ol designed for e�cient search of complex patterns� Unlike previous to ols of the grep family� such as agrep and Gnu grep� nrgrep is based on a single and uniform concept� the bit�parallel simulation of a nondeterministic su�x automaton� As a result� nrgrep can �nd from simple patterns to regular expressions� exactly or allowing errors in the matches� with an e�ciency that degrades smo othly as the complexity of the searched pattern increases� Another concept fully integrated into nrgrep and that contributes to this smo othness is the selection of adequate subpatterns for fast scanning� which is also absent in many current to ols� We show that the e�ciency of nrgrep is similar to that of the fastest existing string matching to ols for the simplest patterns� and by far unpaired for more complex patterns�Key words� Online string matching� <a href="/tags/Regular_expression/" rel="tag">regular expression</a> searching� approximate string match� ing� grep� agrep� BNDM�� Intro ductionThe purp ose of this pap er is to present a new pattern matching to ol which we have coined nrgrep� for �nondeterministic reverse grep�� Nrgrep is aimed at e�cient searching for complex patterns inside natural language texts� but it can b e used in many other scenarios�The pattern matching problem can b e stated as follows� given a text T of n characters and a1��n pattern P � �nd all the p ositions of T where P o ccurs� The problem is basic in almost every area of computer science and app ears in many di�erent forms� The pattern P can b e just a simple string� but it can also b e� for example� a regular expression� An �o ccurrence� can b e de�ned as exactly or�approximately� matching the pattern�In this pap er we concentrate on online string matching� that is� the text cannot b e indexed�Online string matching is useful for casual searching �i�e� users lo oking for strings in their �les and unwilling to maintain an index for that purp ose�� dynamic text collections �where the cost of keeping an up�to�date index is prohibitive� including the searchers inside text editors and Web1 interfaces �� for not very large texts �up to a few hundred megabytes� and even as internal to ols of indexed schemes �as agrep �� is used inside glimpse �� or cgrep �� is used inside compressed indexes ��Dept� of Computer Science� University of Chile� Blanco Encalada �� Santiago� Chile� gnavarro�dcc�uchile�cl� Work develop ed while the author was at p ostdo ctoral stay at the Institut Gaspard Monge�Univ� de Marne�la�Vall�ee� France� partially supp orted by Fundaci�on Andes and ECOS�Conicyt�1We refer to the �search in page� facility� not to confuse with searching the Web� �There is a large class of string matching algorithms in the literature �see� for example� �� but not all of them are practical� There is also a wide variety of fast online string matching to ols in the public domain� most prominently the grep family� Among these� Gnu grep and Wu andManb er�s agrep �� are widely known and currently considered as the fastest string matching to ols in practice� Another distinguishing feature of these software systems is their �exibility� they can search not only for simple strings� but they also p ermit classes of characters �that is� a pattern p osition matches a set of characters�� wild cards �a pattern p osition that matches an arbitrary string�� regular expression searching� multipattern searching� etc� Agrep also p ermits approximate searching� the pattern matches the text after p erforming a limited numb er of alterations on it�The algorithmic principles b ehind agrep are diverse �� Exact string matching is done with the Horsp o ol algorithm �� a variant of the Boyer�Mo ore family �� The sp eed of the Boyer�Mo ore string matching algorithms comes from their ability to �skip� �i�e� not insp ect� some text characters� Agrep deals with more complex patterns using a variant of Shift�Or �� an algorithm exploiting �bit�parallelism� �a concept that we explain later� to simulate nondeterministic automata�NFA� e�ciently� Shift�Or� however� cannot skip text characters� Multipattern searching is treated with bit�parallelism or with a di�erent algorithm dep ending on the case� As a result� the search p erformance of agrep varies sharply dep ending on the typ e of search pattern� and even slight mo di�cations to the pattern yield widely di�erent search times� For example� the search for the string �algorithm� is � times faster than for ��Aa�lgorithm� �where ��Aa�� is a class of characters that matches �A� and �a�� which is useful to detect the word either starting a sentence or not�� In the �rst case agrep uses Horsp o ol�s algorithm and in the second case it uses Shift�Or� Intuitively� there should exist a more uniform approach where b oth strings could b e e�ciently searched for without a signi�cant di�erence in the search time�An answer to this challenge is the BNDM algorithm �� BNDM is based on a previous algorithm� BDM �for �backward DAWG matching�� The BDM algorithm �to b e explained later� uses a �su�x automaton� to detect substrings of the pattern inside a text window �the Boyer�Mo ore family detects only su�xes of the pattern�� As the Boyer�Mo ore algorithms� BDM can also skip text characters� In the original BDM algorithm the su�x automaton is made deterministic�BNDM is a recent version of BDM that keeps the su�x automaton in nondeterministic form by using bit�parallelism� As a result� BNDM can search for complex patterns and still keep a search e�ciency close to that of simple patterns� It has b een shown exp erimentally �� that theBNDM algorithm is by far the fastest one to search for complex patterns� BNDM has b een later extended to handle regular expressions ��Nrgrep is a pattern matching to ol built over the BNDM algorithm �hence the name �nondeter� ministic reverse grep�� since BNDM scans windows of the text in reverse direction�� However� there is a gap b etween a pattern matching algorithm and a real software� The purp ose of this work is to�ll that gap� We have classi�ed the allowed search patterns in three levels�Simple patterns� a simple pattern is a sequence of m classes of characters �note that a single character is a particular case of a class�� Its distinguishing feature is that an o ccurrence of a simple pattern has length m as well� as each pattern p osition matches one text p osition�Extended patterns� an extended pattern adds to simple patterns the ability to characterize indi� vidual classes as �optional� �i�e� they can b e skipp ed when matching the text� or �rep eatable� ��i�e� they can app ear consecutively a numb er of times in the text�� The purp ose of extended patterns is to capture the most commonly used extensions of the normal search patterns so as to develop sp ecialized pattern matching algorithms for them�Regular expressions� a regular expression is formed by simple classes� the empty string� or the�concatenation�� union� or �rep etition� of other regular expressions� This is the most general typ e of pattern we can search for�We develop a di�erent pattern matching algorithm �with increasing complexity� for each typ e of pattern� so simpler patterns are searched for with simpler and faster algorithms� The classi�cation has b een made having in mind the typical search needs on natural language� and it would b e di�erent� say� for DNA searching� We have also this in mind when we design the error mo del for approximate searching� �Approximate searching� or �searching allowing errors� means that the user gives an error threshold k and the system is able to �nd the pattern in the text even if it is necessary to p erform k or less �op erations� in the pattern to match its o ccurrence in the text�The op erations typically p ermitted are the insertion� deletion and substitution of single characters�However� transp osition of adjacent characters is an imp ortant typing error �� that is normally disregarded b ecause it is di�cult to deal with� We allow the four op erations in nrgrep� although the user can sp ecify a subset of them�An imp ortant asp ect that deserves attention in order to obtain the desired �smo othness� in the search time is the selection of an optimal subpattern to scan the text� A typical case is an extended pattern with a large and rep eatable class of characters close to one end� For technical reasons that will b e made clear later� it may b e very exp ensive to search for the pattern as is� while pruning the extreme of the pattern that contains the class �and verifying the p otential o ccurrences found� leads to much faster searching� Some to ols �such as Gnu grep for regular expressions� try to apply some heuristics of this typ e� but we provide a general and uniform subpattern optimization metho d that works well in all cases and� under a simpli�ed probabilistic mo del� yields the optimal search p erformance for that pattern� Moreover� the selected subpattern may b e of a simpler typ e than the whole pattern and a faster search algorithm may b e p ossible� Detecting the exact typ e of pattern given by the user �despite the syntax used� is an imp ortant issue that is solved by the pattern parser�We have followed the philosophy of agrep in some asp ects� such as the record�oriented way to present the results and most of the pattern syntax features and search options� The main advantages of nrgrep over the grep family are uniformity in design� smo othness in search time� sp eed when searching for complex patterns� p owerful extended patterns� improved error mo del for approximate searching� and subpattern optimization�In this pap er we start by explaining the concepts of bit parallelism and searching with su�x automata� Then we explain how these are combined to search for simple patterns� extended patterns and regular expressions� We later consider the approximate search of these patterns� Finally� we present the nrgrep software and show some exp erimental results on it� Despite that the algorithmic asp ects of the pap er b orrow from our previous work in some cases �� the pap er has some novel and nontrivial algorithmic contributions� such as� searching for extended patterns� which implies the bit parallel simulation of new typ es of restricted automata� �� approximate searching allowing transp ositions for the three typ es of patterns� which has never b een considered under the bit�parallel approach� and� algorithms to select optimal search subpatterns in the three typ es of patterns�The nrgrep to ol is freely available under a Gnu license from http��www�dcc�uchile�cl��gnavarro�pubcode��� Basic ConceptsWe de�ne in this section the basic concepts and notation needed throughout the pap er��� NotationWe consider that the text is a sequence of n characters� T � t � � � t � where t � �� is the1 n i alphab et of the text and its size is denoted j�j � � � In the simplest case the pattern is denoted as P � p � � � p � a sequence of m characters p � �� in which case it sp eci�es the single string1 m i p � � � p � More general patterns sp ecify a �nite or in�nite set of strings�1 mWe say that P matches T at p osition i whenever there exists a j � � such that t � � � t b elongs i i+j to the set of strings sp eci�ed by the pattern� The substring t � � � t is called an �o ccurrence� of i i+jP in T � Our goal is to �nd all the text p ositions that start a pattern o ccurrence�The following notation is used for strings� S denotes the string s s � � � s � In particular� i��j i i+1 jS � � �the empty string� if i � j � A string X is said to b e a pre�x� su�x and factor �or i��j substring�� resp ectively� of XY � YX and YXZ � for any Y and Z �We use some notation to describ e bit�parallel algorithms� We use exp onentiation to denote3 bit rep etition� e�g� � � � �� We denote as b � � � b the bits of a mask of length �� which is� 1 stored somewhere inside the computer word of �xed length w �in bits�� We use C�like syntax for op erations on the bits of computer words� i�e� �j� is the bitwise�or� �� is the bitwise�and� � b � is the bitwise�xor� �� complements all the bits� and �� moves the bits to the left and enters zeros from the right� e�g� b b � � � b b �� b � � � b b �� We can also p erform arithmetic� ��1 2 1 ��3 2 1 op erations on the bits� such as addition and subtraction� which op erate the bits as the binary representation of a numb er� for instance b � � � b �� b � � � b ��� x � xIn the following we show that the pattern can b e a more complex entity� matching in fact a set of di�erent text substrings��� Simple PatternsWe call a �simple� pattern a sequence of characters or classes of characters� Let m b e the numb er of elements in the sequence� then a simple pattern P is written as P � p � � � p � where p � ��1 m iWe say that P matches at text p osition i � � whenever t � p for i � � � � � m� The most imp ortant i i feature of simple patterns is that they match a substring of the same length m in the text�We use the following notation to describ e simple patterns� we concatenate the elements of the sequence together� Simple characters �i�e� classes of size �� are written down directly� while other classes of characters are written in square brackets� The �rst character inside the square brackets � can b e �� which means that the class is exactly the complement of what is sp eci�ed� The rest is a simple enumeration of the characters of the class� except that we allow ranges� �x�y � means all the characters b etween x and y inclusive �we assume a total order in �� which is in practice theASCI I co de�� Finally� the character �� represents a class equal to the whole alphab et and �� represents the class of all separators �i�e� non alphanumeric characters�� Most of these conventions are the same used in <a href="/tags/Unix/" rel="tag">Unix</a> software� Some examples are�� ��Aa�merican�� which matches �American� and �american��� ��n�Begin�� which �nds �Begin� if it is not preceded by a line break�� �� which matches any date in the ��s�� ��e��a�zA�Z��t�� which p ermits any character in the �rst p osition� then �e�� then anything except a letter or underscore in the third p osition� then �t�� and �nishes with a separator�Note that we have used ��n� to denote the newline� We also use ��t� for the tab� ��xHH� for the character with hex ASCI I co de HH � and in general ��C� to interpret any character C literally�e�g� the backslash character itself� as well as the sp ecial characters that follow��It is p ossible to sp ecify that the pattern has to app ear at the b eginning of a line by preceding it with �� or at the end of the line by following it with �� Note that this is not the same as adding the newline preceding or following the pattern b ecause the b eginning�end of the �le signals also the b eginning�end of the line� and the same happ ens with records when the record delimiter is not the end of line��� Extended PatternsIn general� an extended pattern adds some extra sp eci�cation capabilities to the simple pattern mechanism� In this work we have chosen some features which we b elieve are the most interesting for typical text searching� The reason to intro duce this intermediate�level pattern �b etween simple patterns and regular expressions� is that it is p ossible to devise sp ecialized search algorithms for them which can b e faster than those for general regular expressions� The op erations we p ermit for extended patterns are� sp ecify optional classes �or characters�� and p ermit the rep etition of a class�or character�� The notation we use is to add a symb ol after the a�ected character or class� �� means an optional class� �� means that the class can app ear zero or more times� and �� means that it can app ear one or more times� Some examples are�� �colou�r�� which matches �color� and �colour��� ��a�zA�Z��a�zA�Z�� which matches valid variable names in most programming languages �a letter followed by letters or digits��� �Latin��America�� which matches �Latin� and �America� separated by one or more sepa� rator characters �e�g� spaces� tabs� etc�� �� Regular ExpressionsA regular expression is the most sophisticated pattern that we allow to search for� and it is in general considered p owerful enough for most applications� A regular expression is de�ned as follows�� Basic elements� any character and the empty string �� are regular expressions matching themselves�� Parenthesis� if e is a regular expression then so is �e�� which matches the same strings� This is used to change precedence�� Concatenation� if e and e are regular expressions� then e � e is a regular expression that1 2 1 2 matches a string x i� x can b e written as x � y z � where e matches y and e matches z �1 2� Union� if e and e are regular expressions� then e je is a regular expression that matches a1 2 1 2 string x i� e or e match x�1 2� Kleene closure� if e is a regular expression then e� is a regular expression that matches a string x i�� for some n� x can b e written as x � x � � � x and e matches each string x �1 n iWe follow the same syntax �with the precedence order � � � � j � except that we use square brackets to abbreviate �x jx j � � � jx � � �x x � � � x � �where the x are characters�� we omit the1 2 n 1 2 n i concatenation op erator �� we add the two op erators e� � ee� and e� � �ej�� and we use the empty string to denote �� e�g� �a�b��c� denotes a�bj��c� These arrangements make the extended patterns to b e a particular case of regular expressions� Some examples are�� �dog�cat�� which matches �dog� and �cat��� ��Dr��Prof��Mr��Knuth�� which matches �Knuth� preceded by a sequence of titles�� Pattern Matching AlgorithmsWe explain in this section the basic string and regular expression search algorithms our software builds on��� Bit Parallelism and the Shift�Or AlgorithmIn �� a new approach to text searching was prop osed� It is based on bit�paral lelism �� This technique consists in taking advantage of the intrinsic parallelism of the bit op erations inside a computer word� By using cleverly this fact� the numb er of op erations that an algorithm p erforms can b e cut down by a factor of at most w � where w is the numb er of bits in the computer word�Since in current architectures w is �� or �� the sp eedup is very signi�cant in practice�Figure � shows a non�deterministic automaton that searches for a pattern in a text� Classical pattern matching algorithms� such as KMP �� convert this automaton to a deterministic form and achieve O �n� worst case search time� The Shift�Or algorithm �� on the other hand� uses bit�parallelism to simulate the automaton in its non�deterministic form� It achieves O �mn�w � � worst�case time� i�e� an optimal sp eedup over a classical O �mn� simulation� For m � w � Shift�Or is twice as fast as KMP b ecause of b etter use of the computer registers� Moreover� it is easily extended to handle classes of characters�Σ a b c d e f g0 1 23 4 5 6 7Figure �� A nondeterministic automaton �NFA� to search for the pattern P � �abcdefg� in a text��� Text ScanningWe present now the Shift�And algorithm �� which is an easier to explain �though a little less e�cient� variant of Shift�Or� Given a pattern P � p p � � � p � p � � and a text T � t t � � � t �1 2 m i 1 2 n t � �� the algorithm builds �rst a table B which for each character stores a bit mask b � � � b � The i m 1 mask in B �c� has the i�th bit set if and only if p � c� The state of the search is kept in a machine i word D � d � � � d � where d is set whenever p p � � � p matches the end of the text read up to m 1 i 1 2 i now �another way to see it is to consider that d tells whether the state numb ered i in Figure � is i active�� Therefore� we rep ort a match whenever d is set� m mWe set D � � originally� and for each new text character t � we up date D using the formula j� m�1D �� D �� j � �� B �t � jThe formula is correct b ecause the i�th bit is set if and only if the �i � ��th bit was set for the previous text character and the new text character matches the pattern at p osition i� In other words� t � � � t � p � � � p if and only if t � � � t � p � � � p and t � p � Again� it is j �i+1 j 1 i j �i+1 j �1 1 i�1 j i p ossible to relate this formula to the movement that o ccurs in the NFA for each new text character� each state gets the value of the previous state� but this happ ens only if the text character matches m�1 the corresp onding arrow� Finally� the �j � �� after the shift allows a match to b egin at the current text p osition �this op eration is saved in the Shift�Or� where all the bits are complemented��This corresp onds to the self�lo op at the initial state of the automaton�The cost of this algorithm is O �n�� For patterns longer than the computer word �i�e� m � w �� the algorithm uses dm�w e computer words for the simulation �not all them are active all the time�� with a worst�case cost of O �mn�w � and still an average case cost of O �n���� Classes of Characters and Extended PatternsThe Shift�Or algorithm is not only very simple� but it also has some further advantages� The most immediate one is that it is very easy to extend to handle classes of characters� where each pattern p osition may not only match a single character but a set of characters� If p is the set of characters i that match the p osition i in the pattern� we set the i�th bit of B �c� for all c � p � No other change i is necessary to the algorithm� In �� they show also how to allow a limited numb er k of mismatches in the o ccurrences� at O �nm log�k ��w � cost� �This paradigm was later enhanced �� to supp ort allow wild cards� regular expressions� ap� proximate search with nonuniform costs� and combinations of them� Further development of the bit�parallelism approach for approximate string matching yielded some of the fastest algorithms for short patterns �� In most cases� the key idea was to simulate an NFA�Bit�parallelism has b ecome a general way to simulate simple NFAs instead of converting them to deterministic automata� This is how we use it in nrgrep��� The BDM AlgorithmThe main disadvantage of Shift�Or is its inability to skip characters� which makes it slower than the algorithms of the Boyer�Mo ore �� or the BDM �� families� We describ e in this section theBDM pattern matching algorithm� which is able to skip some text characters�BDM is based on a su�x automaton� A su�x automaton on a pattern P � p p � � � p is an1 2 m automaton that recognizes all the su�xes of P � A nondeterministic version of this automaton has a very regular structure and is shown in Figure �� In the BDM algorithm �� this automaton is made deterministic�I ε ε ε ε ε ε ε ε a b c d e f g0 1 2 3 4 5 6 7Figure �� A nondeterministic su�x automaton for the pattern P � �abcdefg�� Dashed lines represent ��transitions �i�e� they o ccur without consuming any input��A very imp ortant fact is that this automaton can b e used not only to recognize the su�xes ofP � but also factors of P � Note that there is a path lab eled by x from the initial state if and only if x is a factor of P � That is� the NFA will not run out of active states as long as it has read a factor of P �The su�x automaton is used to design a simple pattern matching algorithm� This algorithm runs in O �mn� time in the worst case� but it is optimal on average �O �n log m�m� time�� Other� more complex variations such as Turb oBDM �� and MultiBDM �� achieve linear time in the worst case�To search for a pattern P � p p � � � p in a text T � t t � � � t � the su�x automaton of1 2 m 1 2 n rP � p p � � � p �i�e the pattern read backwards� is built� A window of length m is slid along m m�1 1 the text� from left to right� The algorithm searches backward inside the window for a factor of the pattern P using the su�x automaton� i�e� the su�x automaton of the reverse pattern is fed with the characters in the text window read backward� This backward search ends in two p ossible forms��� We fail to recognize a factor� i�e we reach a window character � that makes the automaton run out of active states� This means that the su�x of the window we have read is not anymore a factor of P � Figure � illustrates this case� We then shift the window to the right� its starting p osition corresp onding to the p osition following the character � �we cannot miss an � o ccurrence b ecause in that case the su�x automaton would have found a factor of it in the window��WindowSearch for a factor with the su�x automaton u�Fail to recognize a factor at � �New search�Safe shiftNew windowFigure �� Su�x automaton search��� We reach the b eginning of the window� therefore recognizing the pattern P since the length�m window is a factor of P �indeed� it is equal to P �� We rep ort the o ccurrence� and shift the window by one p osition��� Combining Shift�Or and BDM� the BNDM AlgorithmWe describ e in this section the BNDM pattern matching algorithm �� This algorithm� a com� bination of Shift�Or and BDM� has all the advantages of the bit�parallel forward scan algorithm� and in addition it is able to skip some text characters like BDM�Instead of making the automaton of Figure � deterministic� BNDM simulates it using bit� parallelism� The bit�parallel simulation works as follows� Just as for Shift�And� we keep the state of the search using m bits of a computer word D � d � � � d � Each time we p osition the window in m 1 m the text we initialize D � � �this corresp onds to the ��transitions� and scan the window backward�For each new text character read in the window we up date D � If we run out of ��s in D then there cannot b e a match and we susp end the scanning and shift the window� If� on the other hand� we can p erform m iterations� then we rep ort the match�We use a table B which for each character c stores a bit mask� This mask sets the bits corresp onding to the p ositions where the reversed pattern has the character c �just as in the Shift�And algorithm�� The formula to up date D is�D �� D � B �t �� jBNDM is not only faster than Shift�Or and BDM �for ab out � � m � �� but it can accom� mo date all the extensions mentioned in Section �� In particular� it can easily deal with classes of characters by just altering the prepro cessing� and it is by far the fastest algorithm to search for this typ e of patterns �� Note that this typ e of search is called �backward� scanning b ecause the text characters inside the window are read backwards� However� the search progresses from left to right in the text as the window is shifted� There have b een other �few� attempts to skip characters under a Shift�Or approach� for example ���� Regular Expression SearchingBit�parallelism has b een successfully used to deal with regular expressions� Shift�Or was extended in two ways �� to deal with this case� �rst using the Thompson �� and later Glushkov�s�� constructions of NFAs from the regular expression� Figure � shows b oth constructions for the pattern �abcd�d��e�f�de��ε f 5 7 10 12 ε ε ε ε a b c d d e 0 1 2 3 4 9 14 15 16 ε ε ε ε 6 8 11 13 d e e d a b c d d e d e 0 1 2 3 4 5 6 7 8 9 f fFigure �� Thompson�s �top� and Glushkov�s �b ottom� resulting NFAs for the regular expression�abcd�d��e�f�de��Given a regular expression with m positions �each character�class de�nes a new p osition��Thompson�s construction pro duces an automaton of up to �m states� Its advantage is that the states of the resulting automaton can b e arranged in a bit mask so that all the transitions move forward except the ��transitions� This is used �� for a bit�parallel simulation which moves the bits forward �as for the simple Shift�Or� and then applies all the moves corresp onding to ��transitions�For this sake� a table E mapping from bit masks to bit masks is precomputed� so that E �x� yields a bit mask where x has b een expanded with all the ��moves� The co de for a transition is therefore s�1D �� D �� j � �� B �t � jD �� E �D �We have used s as the numb er of states in the Thompson automaton� where m � s � �m� sThe main problem is that the E table has � entries� This is handled by splitting the argument horizontally� so for example if s � �� then two tables E and E can b e created which receive1 2 half masks and deliver full masks with the ��expansion of only the bits of their half �of course the ��transitions can go to the other half� this is why they deliver full masks�� In this way the amount of memory required is largely reduced at the cost of two op erations to build the real E value� This takes advantage of the fact that� if a bit mask x is split in two halves x � y z � then jz j jy jE �y z � � E �y � � j E �� z �� Glushkov�s construction has the advantage of pro ducing an automaton with exactly m � � states� which can as low as half the states generated by Thompson�s construction� On the other hand� the structure of arrows is not regular and the trick of forward shift plus ��moves cannot b e used�Instead� it has another prop erty� all the arrows leading to a given state are lab eled by the same character or class� This prop erty has b een used recently �� to provide a space�economical bit parallel simulation where the co de for a transition is�D �� T �D � � B �t � j where T is a table that receives a bit map D of states and delivers another bit map of states reachable from states in D � no matter by which characters� The ��transitions do not have to b e dealt with b ecause Glushkov�s construction do es not pro duce them� The T table can b e horizontally partitioned as well�It has b een shown �� that a bit�parallel implementation of Glushkov�s construction is faster than one of Thompson�s� which should b e clear since in general the tables obtained are much smaller�An interesting improvement� p ossible thanks to bit parallelism� is that classes of characters can b e dealt with the normal mechanism used for simple patterns� without generating one state for each alternative�On the other hand� a deterministic automaton �DFA� can b e built from the nondeterministic one� It is not hard to see that indeed the previous constructions simulate a DFA� since each state of the DFA can b e seen as a set of states of the NFA� and each p ossible set of states of theNFA is represented by a bitmask� Normally the DFA takes less space b ecause only the reachable combinations are generated and stored� while for direct access to the tables we need to store in the bit�parallel simulations all the p ossible combinations of active and inactive states� On the other hand� bit�parallelism p ermits extending regular expressions with classes of characters and other features �e�g� approximate searching�� which is di�cult otherwise� Furthermore� Glushkov�s m construction p ermits not storing a table of states � char acter s� of worst case size O �� in the m case of a DFA� but just the table T of size O �� Finally� in case of space problems the technique of splitting the bitmasks can b e applied�Therefore� we use the bit�parallel simulation of Glushkov�s automaton for nrgrep� After the up date op eration and we check whether a �nal state of D is reached �this means just an and op eration with the mask of �nal states�� Describing Glushkov�s NFA construction algorithm �� is2 outside the scop e of this pap er� but it takes O �m � time� The result of the construction can b e represented as a table B �c�� which yields the states reached by character c �no matter from where�� and a table F ol l ow �i�� which yields the bitmask of states activated from state i� no matter by which m character� From F ol l ow � the deterministic version T can b e built in O �� worst case time with the following pro cedure�T �� for i � � � � � m i for j � � � � � � � � iT �� j � �� F ol l ow �i� j T �j �A backward search algorithm for regular expressions is also p ossible �� and in some cases the search is much faster than a forward search� The idea is as follows� First� we compute the �� length of the shortest path from the initial to a �nal state �using a simple graph algorithm�� This will b e the length of the window in order not to lose any o ccurrence� Second� we reverse all the arrows of the automaton� make all the states initial� and take as the only �nal state the original initial state� The resulting automaton will have active states as long as we have read a reverse factor of a string matching the regular expression� and will reach its �nal state when we read in particular a reverse pre�x� Figure � illustrates the result�ε e d a b cd d e d e f fFigure �� An automaton recognizing reverse pre�xes of �abcd�d��e�f�de�� based on theGlushkov construction of Figure ��We apply the same BNDM technique of reading backwards the text window� If the automaton runs out of active states� then no factor of an o ccurrence of the pattern is present in the window and we can shift the window� aligning its b eginning to one p osition after the one that caused the mismatch� If� on the other hand� we reach the b eginning of the window in the backward scan� we cannot guarantee that an o ccurrence has b een found� When searching for a simple string� the only way to reach the window b eginning is to have read the whole pattern� Regular expressions� on the other hand� can have o ccurrences of di�erent length� and all we know is that we have matched a factor� There are in fact two choices�� The �nal state of the automaton is not active� which means that we have not read a pre�x of an o ccurrence� In this case we shift the window by one p osition and resume the scanning�� The �nal state of the automaton is active� Since we have found a pattern pre�x� we have to p erform a forward veri�cation starting at the window initial p osition until either we �nd an o ccurrence or the automaton runs out of active states�So we need� apart from the reversed automaton� also the normal automaton �without initial self�lo op� as in Figure �� for the veri�cation of p otential o ccurrences�An extra complication comes from the fact that the NFA with reverse arrows do es not have the prop erty that all arrows leading to a state are lab eled by the same character� Rather� all the arrows leaving a state are lab eled by the same character� Hence the simulation can b e done as followsRD �� T �D � B �t �� jR where T corresp onds to the reverse arrows but B is that of the forward automaton ���� Approximate String MatchingApproximate searching means �nding the text substrings that can b e converted into the pattern by p erforming at most k �op erations� on them� Permitting a limited numb er k of such di�erences ���also called errors� is an essential to ol to recover from typing� sp elling and OCR �optical character recognition� errors� Despite that approximate pattern matching can b e reduced to a problem of regular expression searching� the regular expression grows exp onentially with the numb er of allowed errors �or di�erences��A �rst design decision is what should b e taken as an error� Based on existing surveys �� we have chosen the following four typ es of errors� insertion of characters� deletion of characters� re� placement of a character by another character� and exchange of adjacent characters �transp osition��These errors are symmetric in the sense that one can consider that they o ccur in the pattern or in the text and the result is the same� Traditionally� only the �rst three errors have b een p ermitted b ecause transp osition� despite b eing recognized as a very imp ortant source of errors� is harder to handle� However� the problem is known to grow very fast in complexity as k increases� and since a transp osition can only b e simulated with two errors of the other kind �i�e� an insertion and a deletion�� we would need to double k in order to obtain a similar result� One of the algorithmic contributions of nrgrep is a bit�parallel algorithm for p ermitting the transp ositions together with the other typ es of errors� This p ermits us searching with smaller k values and hence obtain faster searching with similar �or b etter� retrieval results�Approximate searching is characterized by the fact that no known search algorithm is the b est in all cases �� From the wealth of existing solutions� we have selected those that adapt b est to our goal of �exibility and uniformity� Three main ideas can b e used��� Forward SearchingThe most basic idea that is well suited to bit parallelism �� is to have k � � similar automata� representing the state of the search when zero to k errors have o ccurred� Apart from the normal arrows inside each automaton� there are arrows going from automaton i to i � � corresp onding to the di�erent errors� The original approach �� did not consider transp ositions� which have b een dealt with later ��Figure � shows an example with k � �� Let us �rst fo cus on the big no des and solid�dashed lines� Apart from the normal forward arrows we have three typ es of arrows that lead from each row to the next one �i�e� increment the numb er of errors�� vertical arrows� which represent the insertion of characters in the pattern �since they advance in the text but not in the pattern�� diagonal arrows� which represent replacement of the current text character by a pattern character�since they advance in the text and in the pattern�� and dashed diagonal arrows ��transitions�� which represent deletion of characters in the pattern �since they advance in the pattern without consuming any text input�� The remaining arrows �the dotted ones� represent transp ositions� which p ermit reading the next two pattern characters in the wrong order and move to the next row� This is achieved by means of �temp orary� states� which we have drawn as smaller circles�If we disregard the transp ositions� there are di�erent ways to simulate this automaton in O �� time when it �ts in a computer word �� but no bit parallel solution has b een presented to account for the transp ositions� This is one of our contributions and is explained later in the pap er�We extend a simpler bit parallel simulation �� which takes O �k � time p er text character as long as m � O �w �� The technique stores each row in a machine word� R � � � R � just as we did for0 k m�i iShift�And in Section �� The bitmasks R are initialized to � � to account for i p ossible initial i� deletions in the pattern� The up date pro cedure to pro duce R up on reading text character t is as j �� a b c d b c d 0 errors a b c a b c d b c d 1 error a b c a b c d2 errorsFigure �� A nondeterministic automaton accepting the pattern �abcd� with at most � errors�Unlab eled solid arrows match any character� while the dashed �not dotted� lines are ��transitions� follows�� m�1R �� R �� j � �� B �t �0 j0 for i � � � � � k do� �R �� R �� B �t �� j R j �R �� j �R �� i j i�1 i�1 i i�1 where of course many co ding optimizations are p ossible �and are done in nrgrep� but make the co de less clear� In particular� using the complemented version of the representation �as in Shift�Or� is a bit faster�The rationale of the pro cedure is as follows� R has the same up date formula as for the Shift�0And algorithm� For the others� the up date formula is the or of four p ossible facts� The �rst one corresp onds to the normal forward arrows �note that there is no initial self�lo op for them� only forR �� The second one brings ��s �state activations� from the upp er row at the same p osition� which0 corresp onds to a vertical arrow� i�e� an insertion� The third one brings ��s from the upp er row at the previous p ositions �this is obtained with the left shift�� corresp onding to a diagonal arrow� i�e�� a replacement� The fourth one is similar but it works on the newer value of the previous row �R i�1 instead of R �� and hence it corresp onds to an ��transition� i�e� a deletion� i�1 m�1A match is detected when R � �� is not zero� It is not hard to show that whenever the k�nal state of R is active� the �nal state of R is active to o� so it su�ces to consider R as the only i k k�nal state��� Backward SearchingBackward searching can b e easily adapted from the forward searching automaton following the same techniques used for exact searching �� That is� we build the automaton of Figure � on the reverse pattern� consider all the states as initial ones� and consider as the only �nal state the�rst no de of the last row� This will recognize all the reverse pre�xes of P allowing at most k errors� and will have active states as long as some factor of P has b een seen �with at most k errors��Some observations are of interest� First� note that we will never shift the window b efore exam� ining at least k � � characters �since we cannot make k errors b efore that�� Second� the length of the �� window has to b e that of the shortest p ossible match� which� b ecause of deletions in the pattern� is of length m � k � Third� just as it happ ens with regular expressions� the fact that we arrive to the b eginning of the window with some active states in the automaton do es not immediately mean that we have an o ccurrence� so we have to check the text for a complete o ccurrence starting at the window b eginning�It has b een shown �� that this algorithm takes time O �k �k � log m��m � k �� for m � w ���� Splitting into k � � SubpatternsA well known prop erty �� establishes that� under the mo del of insertions� deletions� and replacements� if the pattern is cut in k � � contiguous pieces� then at least one of the pieces o ccurs unchanged inside any o ccurrences with k errors or less� This is easily veri�ed b ecause each op eration can alter at most one piece� So the technique consists of p erforming a multipattern searching for the pieces without errors� and checking the text surrounding the o ccurrences of each piece for a complete approximate o ccurrence of the whole pattern� This leads to the fastest algorithms for low error levels �� The prop erty is not true if we add the transp osition� b ecause this op eration can alter two contiguous pieces at the same time� Much b etter than splitting the pattern in �k � � pieces is to split it in k � � pieces and leave one unused character b etween each pair of pieces �� Under this partition a transp osition can alter at most one piece�We are now confronted with a multipattern search problem� This can b e solved with a very simple mo di�cation of the single pattern backward search algorithm �� Consider the pattern�abracadabra� searched with two errors� We split it in �abr�� cad� and �bra�� Figure � depicts the automaton used for backward searching of the three pieces� This is implemented with the same bit parallel mechanism as for a single pattern� except that �� there are more �nal states� and�� an extra bit mask is necessary to avoid propagating ��s by the missing arrows� This extra bit mask is implemented at no cost by removing the corresp onding ��s from the B mask during the prepro cessing� Note that for this to work we need that all the pieces have the same length� a br a ca d a b r aεFigure �� An automaton recognizing reverse pre�xes of selected pieces of �abracadabra��� Searching for Simple PatternsNrgrep directly uses the BNDM algorithm when it searches for simple patterns� However� some mo di�cations are necessary to convert the pattern matching algorithm into a search software��� Record Oriented OutputThe �rst issue to consider is what will we rep ort from the results of the search� Printing the text p ositions i where a match o ccurs is normally of little help for the user� Printing the text p ortion �� that matched �i�e� the o ccurrence� do es not help much either� b ecause this is equal to the pattern�at least if no classes of characters are used�� We have followed agrep�s philosophy� the most useful way to present the result is to print a context of the text p ortion that matched the pattern�This context is de�ned as follows� The text is considered to b e a sequence of records� A user� de�ned record delimiter determines the text p ositions where a new record starts� The text areas until the �rst record separator and after the last record separators are considered records as well�When a pattern is found in the text� the whole record where the o ccurrence lies is printed� If the o ccurrence overlaps with or contains a record delimiter then it is not considered a pattern o ccurrence�The record delimiter is by default the newline character� but the user can sp ecify any other simple pattern as a record delimiter� For example� the string ��From � can b e used to delimit e�mail messages in a mail archive� therefore b eing able to retrieve complete e�mails that contain a given string� The system p ermits to sp ecify whether the record delimiter should b e contained in the next or in the previous record� Moreover� nrgrep p ermits to sp ecify an extra record separator when the records are printed �e�g� add another newline��It should b e clear by now that searching for longer strings is faster than for shorter ones� Since record delimiters tend to b e short strings� it is not a go o d idea to delimit the text records �rst and then search for the pattern inside each record� Rather� we prefer to search for the pattern in the text without any consideration for record separators and� when the pattern is found� search for the next and previous record delimiters� At this time we may determine that the o ccurrence overlaps a record delimiter and discard it� Note that we have to b e able to search for record delimiters forward and backward� We use the same BNDM algorithm to search for record delimiters�There are some nrgrep options� however� that make it necessary a record�wise traversal� printing record numb ers or printing records that do not contain the pattern� In this case we advance in the�le by delimiting the records �rst and then searching for the pattern inside each record��� Text Bu�eringText bu�ering is necessary to cop e with large �les and to achieve optimum p erformance� For example� in nrgrep the bu�er size is set to �� Kb b ecause it �ts well the cache size of many machines� but this default can b e overriden by the user� To avoid complex interactions b etween record limits and bu�er limits� we discard the last incomplete record each time we read a new bu�er from disk� The �discarded� partial record is moved to the b eginning of the bu�er b efore reading more text at the next bu�er loading� If a record is larger than the bu�er size then an arti�cial record delimiter is inserted to correct the situation �and a warning message is printed��Note that this also requires the ability to search for the record delimiter in backward direction�This technique works well unless the record size is large compared to the bu�er size� in which case the user should enlarge the bu�er size using the appropriate option��� ContextsAnother capability of nrgrep is context sp eci�cation� This means that the pattern o ccurrence has to b e surrounded by certain characters in order to b e considered as such� For example� one may sp ecify that the pattern should match as a whole word �i�e� surrounded by separators�� or a whole �� record �i�e� surrounded by record delimiters�� However� it is not just a matter of adding the context strings at the ends of the pattern b ecause� for example� a word may b e in the b eginning of a record and hence the separator may b e absent� We solve this by checking each o ccurrence found to determine that the required string is present b efore�after the o ccurrence or that we reached the b eginning�end of the record�This seems trivial for a simple pattern b ecause its length is �xed� but for more complex patterns�such as regular expressions� where there may b e many di�erent o ccurrences in the same text area� we need a way to discard p ossible o ccurrences and still check for other ones that are basically in the same place� For example� the search for �a�ba�� as a whole word should match in the text�aaa aabaa aaa�� Despite that �b� alone is an o ccurrence of the pattern that do es not �t the whole word criterion� the o ccurrence can b e extended to another one that do es� We return to this issue later��� Subpattern FilterThe BNDM algorithm is designed for the case m � w � Otherwise� we have in principle to simulate the algorithm using many computer words� However� as shown in �� it is much faster to prune the pattern to its �rst w characters� search for that subpattern� and try to extend its o ccurrences to an o ccurrence of the complete pattern� This is b ecause� for reasonably large w � the probability of�nding a pattern of length w is low enough to make the cost of unnecessary veri�cations negligible�On the other hand� the b ene�t of a p ossible shift of length m � w would b e cancelled by the need to up date dm�w e computer words p er text character read�Hence we select a contiguous subpattern of w characters �or classes� rememb er that a class needs also one bit� the same as a character� and search for it� Its o ccurrences are veri�ed with the complete pattern prior to checking records and contexts�The main p oint is which part of the pattern to search for� In the abstract algorithms of �� any part is equally go o d or bad b ecause a uniformly distributed mo del is assumed� In practice� di�erent characters have di�erent probabilities� and some pattern p ositions may have classes of characters� whose probability is the sum of those of the individual characters� This in fact is farther reaching than the problem of the limit w in the length of the pattern� even in a short pattern we may prefer not to include a part of the pattern in the fast scanning part� This is discussed in detail in the next subsection��� Selecting the Optimal Scanning SubpatternLet us consider the pattern �hello��a�� Pattern p ositions � to � match all the alphab et� which means that the search with the nondeterministic automaton inside the window will examine at least four window p ositions �even in a text window like �xxxxxxxxx�� and will shift at most by �� so the average numb er of comparisons p er character is at the very b est �� If we take the subpattern�hello� then we can have an average much closer to �� op erations p er text character�We have designed a general algorithm that� under the assumption that the text characters are3 indep endent� �nds the b est search subpattern in O �m � worst case time �although in practice it2 is closer to O �m log m�� This is a mo dest overhead in most practical text search scenarios� The algorithm is tailored to the BDM�BNDM search technique and works as follows� ��First� we build an array pr ob�� m�� which stores the sum of the probabilities of the characters participating in the class of each pattern p osition� N r g r ep stores an array of English letter prob� abilities� but this can b e tailored to other purp oses and the �nal scheme is robust with resp ect to changes in those probabilities from one language to another� The construction of pr ob takes O �m� � time in the worst case�Second� we build an array ppr ob�� m� � � � � m�� where ppr ob�i� �� stores the probability of2 matching the subpattern P � This is computed in O �m � time by dynamic programming� as i��i+��1 follows ppr ob�i� �� ppr ob�i� � � �� pr ob�i� � ppr ob�i � �� for increasing � values�Third� we build an array mpr ob�� m� � � � � m� � � � � m�� where mpr ob�i� j� �� gives the probability3 of matching any pattern substring of length � in P � This is computed in O �m � time by i��j �1 dynamic programming using the formulas mpr ob�i� j� �� �� mpr ob�i� i � � � �� �� j � i � �� mpr ob�i� j� �� ppr ob�i� �� mpr ob�i � �� j� �� for decreasing i values� Note that we have used the formula for the union of indep endent eventsP r �A � B � � � � �� P r �A�� P r �B ��Finally� the average cost p er character asso ciated to a subpattern P is computed with the i��j following rationale� With probability � we insp ect one window character� If any pattern p osition in the range i � � � j matches the window character read� then we read a second character �recallSection �� If any consecutive pair of pattern p ositions in i � � � j matches the two window characters read� then we read a third character� and so on� This is what we have computed in mpr ob� The exp ected numb er of window characters to read is therefore pcost�i� j � �� mpr ob�i� j� �� mpr ob�i� j� �� mpr ob�i� j� j � i� ��In the BDM�BNDM algorithm� as so on as the window su�x read ceases to b e found in the pattern� we shift the window to the p osition following the character that caused the mismatch� A simpli�ed computation considers that the ab ove pcost�i� j � �which is an average� can b e used as a�xed value� and therefore we approximate the real average numb er of op erations p er text character as pcost�i� j � ops�i� j � �� j � i � pcost�i� j � � � which is the average numb er of characters insp ected divided by the shift obtained� Once this is obtained we select the �i� j � pair that minimizes the work to do� We also avoid considering cases3 where j � i � w � The total time needed to obtain this has b een O �m ��The amount of work is reduced by noting that we can check the ranges in increasing order of i values� and therefore we do not need the �rst co ordinate of mpr ob �which can b e indep endently computed for each i�� Moreover� we start by considering maximum j � since in practice longer �� a b c d e f g hε ε εFigure �� A nondeterministic automaton accepting the pattern �abc�d�efg�h�� subpatterns tend to b e b etter than shorter ones� We keep the b est value found up to now and avoid considering ranges �i� j � which cannot b e b etter than the current b est solution even for pcost�i� j � � � �note that since i is tried in ascending order and j in descending order� the whole2 pattern is tried �rst�� This reduces the cost to O �m log m� in practice�As a result of this pro cedure we not only obtain the b est subpattern to search for �under a simpli�ed cost mo del� but also a hint of how many op erations p er character will we p erform� If this numb er is larger than �� then it is faster and safer to switch to plain Shift�Or� This is precisely what nrgrep do es�� Searching for Extended PatternsAs explained in Section �� we have considered optional and rep eatable �classes of � characters as the features allowed in our extended patterns� Each of these features is treated in a di�erent way and all are integrated in an automaton which is more general than that of Figure �� Over this automaton we later apply the general forward and backward search machinery��� Optional CharactersLet us consider the pattern �abc�d�efg�h�� A nondeterministic automaton accepting that pattern is drawn in Figure ��The �gure is chosen so as to show that multiple consecutive optional characters could exist�This outrules the simplest solution �which works when that do es not happ en�� one could set up a bit mask O with ones in the optional p ositions �in our example� O � �� and let the ones in previous states of D propagate to them� Hence� after the normal up date to D � we could p erformD �� D j ��D �� O �For example� this works if we have read �abcdef� �D � �� and the next text character is�h�� since the ab ove op eration would convert D to �� b efore op erating it against B ��h�� �� However� it do es not work if the text is �abefgh�� where b oth consecutive optional characters have b een omitted�A general solution needs to propagate each active state in D so as to �o o d all the states ahead it that corresp ond to optional characters� In our example� we would like that when D is ���and in general whenever its second bit is active�� it b ecomes �� after the �o o ding�This is achieved with three masks� A� I and F � marking di�erent asp ects of the states related to optional characters� More sp eci�cally� the i�th bit of A is set if this p osition in P is optional� that of I is set if this is the p osition of the �rst optional character of a blo ck �of consecutive optional characters�� and that of F is set if this is the p osition after the last optional character of a blo ck� �� c f a b c d e f g hεFigure �� A nondeterministic automaton accepting the pattern �abc�def�gh��In our example� A � �� I � �� and F � �� After p erforming the normal transition on D � we do as followsDf �� D j FD �� D j �A � �� Df � I �� b Df �� whose rationale is as follows� The �rst line adds a � at the p ositions following optional blo cks inD � In the second line we add some active states to D � Since the states to add are and�ed withA� let us just consider what happ ens inside a sp eci�c optional blo ck� The e�ect that we want is that the �rst � �counting from the right� �o o ds all the blo ck bits to the left of it� We subtract I from Df � which is equivalent to subtracting � at each blo ck� This subtraction cannot propagate its e�ect outside the blo ck b ecause there is a � �coming from �jF � in Df � after the highest bit of the blo ck� The e�ect of the subtraction is that all the bits until the �rst � �counting from the right� are reversed �e�g� �� and the rest are unchanged� In general� z z b b � � � b �� b b � � � b �� When this is reversed by the �� op eration we get x x�1 x�y x x�1 x�y z z� b � b � � � � b �� Finally� when this is xor�ed with the same Df � b b � � � b �� we x x�1 x�y x x�1 x�y x�y +1 z +1 get � � �This is precisely the e�ect we wanted� the last � �o o ded all the bits to the left� That � itself has b een converted to zero� however� but it is restored when the result is or �ed with the original D �This works even if the last active state in the optional blo ck is the leftmost bit of the blo ck� Note that it is necessary to and with A at the end to �lter out the bits of F that survive the pro cess whenever the blo ck is not all zeros� On the other hand� it is necessary to or Df with F b ecause a blo ck of all zeros would propagate the �� op eration outside its limits�Note that we could have a b order problem if there are optional characters at the b eginning of the pattern� As seen later� however� this cannot happ en when we select the b est subpattern for fast scanning� but it has to b e dealt with when verifying the whole pattern��� Rep eatable CharactersThere are two kinds of rep eatable characters� marked in the syntax by �� zero or more rep etitions� and �� one or more rep etitions�� Each of them can b e simulated using the other since a� � aa� and a� � a�� For involved technical reasons �that are related� for example� to the ability to build easily the masks for the reversed patterns and b order conditions for the veri�cation� we preferred the second choice� despite that it uses one more bit than necessary for the �� op eration� Figure � shows the automaton for �abc�def�gh��The bit�parallel simulation of this automaton is more straightforward than for the optional characters� We just need to have a mask S �c� that for each character c tells which pattern p ositions �� can remain active when we read character c� In the ab ove example� S ��c�� and S ��f�� �� The ��transition is handled with the mechanism for optional characters� Therefore a complete simulation step p ermitting optional and rep eatable characters is as follows� m�1D �� D �� j � �� B �t �� j �D � S �t �� j jDf �� D j FD �� D j �A � �� Df � I �� b Df ���� Forward and Backward SearchOur aim is to extend the approach used for simple patterns to patterns containing optional symb ols�Forward scanning is immediate once we learn how to simulate the di�erent automata using bit parallelism� We just have to add an initial self�lo op to enable text scanning �this is already done in the last formula�� We detect the �nal p ositions of o ccurrences and then check the surrounding record and context conditions�Backward searching needs� in principle� just to obtain an automaton that recognizes reverse factors of the pattern� This is obtained by building exactly the same automaton of the forward scan �without initial self�lo op� on the reverse pattern� and letting all the states b e initial �i�e� initializing D with all active states�� However� there are some problems to deal with� all of them deriving from the fact that the o ccurrences have variable length now�Since the o ccurrences do not have a �xed length� we have to compute the minimum length of a p ossible match of the pattern �e�g� � in the example �abc�def�gh�� and use this value as the width of the search window in order not to lose any p otential o ccurrence� As b efore� we set up an automaton that recognizes all the reverse factors of the automaton and use it to traverse the window backward� Because the o ccurrences are not all of the same length� the fact that we arrive to the b eginning of the window do es not immediately imply that the pattern is present� For example� a ��length text window could b e �cdefffg�� Despite that this is a factor of the pattern and therefore we would reach the window b eginning� no pattern o ccurrence starts in the b eginning of the window�Therefore� each time we arrive to the b eginning of the window we have to check that the initial state is active and then run a forward veri�cation from the window b eginning on� until either we�nd a match �i�e� the last automaton state is activated� or we determine that no match can start at the window p osition under consideration �i�e� the automaton runs out of active states�� The automaton used for this forward veri�cation is the same as for forward scanning� except that the initial self�lo op is absent� However� as we see next� veri�cation is in fact a little more complicated and we mix it with the rest of veri�cations that are needed on every o ccurrence �surrounding record� context� etc���� Verifying OccurrencesIn fact� the veri�cation is a little di�erent since� as we see in the next subsection� we select a subpat� tern for the scanning �as b efore� this is necessary if the pattern has more than w characters�classes� but can also b e convenient on shorter patterns�� Say that P � P SP � where S is the subpattern1 2 that has b een selected for fast text scanning� Each time the backward search determines that a �� given text p osition is a p otential start p oint for an o ccurrence of S � we obtain the surrounding record and check� from the candidate text p osition� the o ccurrence of SP in forward direction2 and P in backward direction �do not confuse forward�backward scanning with forward�backward1 veri�cation��When we had a simple pattern� this just needed to check that each text character b elonged to the corresp onding pattern class� Since the o ccurrences have variable length� we need to use pre�built automata for SP and for the reversed version of P � These automata do not have the2 1 initial self�lo op� However� this time we need to use a multi�word bit�parallel simulation� since the patterns could b e longer than the computer word�Note also that� under this scenario� the forward scanning also needs veri�cation� In this case we �nd a guaranteed end p osition of S in the text �not a candidate one as for backward searching��Hence we check� from that �nal p osition� P in forward direction and P S in backward direction�2 1Note that we need to check S again b ecause the automaton cannot tell where is the b eginning ofS � and there could b e various b eginning p ositions�A �nal complication is intro duced by record limits and context conditions� Since we require that a valid o ccurrence lies totally inside a record� we should submit for veri�cation the smallest p ossible o ccurrence� For example� the pattern �b�ab��cde�� has many o ccurrences in the text record �bbbcdee�� and we should rep ort �bcd� in order to guarantee that no valid o ccurrence will b e missed� However� it is p ossible that the context conditions require the presence of certain strings immediately preceding or following the o ccurrence� For example� if the context condition tells that the o ccurrence should b egin a record� then the minimal o ccurrence �bcd� would not qualify� while�bbbcde� would do�Fortunately� context conditions ab out the initial and �nal p ositions of o ccurrences are indep en� dent� and hence we can check them separately� So we treat the forward and backward parts of the veri�cation separately� For each one� we traverse the text and �nd all the p ositions that represent the b eginning �or ending� of o ccurrences and submit them to the context checking mechanism until one is accepted� the record ends� or the automaton runs out of active states��� Selecting a Go o d Search SubpatternAs b efore� we would like to select the b est subpattern for text scanning� since we have anyway to check the p otential o ccurrences� We want to apply an algorithm similar to that for simple patterns� However� this time the problem is more complicated b ecause there are more arrows in the automaton�We compute pr ob as b efore� adding another array spr ob�� m�� which is the probability of staying at state i via the S �c� array� The ma jor complication is that a factor of length � starting at pattern p osition i do es not necessarily �nish at pattern p osition i � � � �� Therefore� we �ll an array ppr ob�� m� � � � � m� � � � � L�� where L is the minimum length of an o ccurrence of P �at most m� and ppr ob�i� j� �� is the sum of the probabilities of all the factors of length � that start at p osition i and do not reach after pattern p osition j � In our example �abc�def�gh�� ppr ob�� should account for �ccccc�� ccccd�� cccde�� ccdef� and �cdeff�� This is computed as ppr ob�i� i � �� �i � j � ppr ob�i� j� �� �i � j � � � �� ppr ob�i� j� �� pr ob�i� � ppr ob�i � �� j� � � ��� spr ob�i� � ppr ob�i� j� � � ��� �if i�th bit of A is set� ppr ob�i � �� j� ��3 which is �lled for decreasing i and increasing � in O �m � time� Note that we are simplifying the computation of probabilities� since we are computing the probability of a set of factors as the sum of the individual probabilities� which is only true if the factors are disjoint�Similarly to ppr ob�i� j� �� we have mpr ob�i� j� �� as the probability of any factor of length � starting at p osition i or later and not surpassing p osition j � This is trivially computed from ppr ob as for simple patterns�Finally� we compute the average cost p er character as b efore� We consider subpatterns from length min�m� w � until a length that is so short that we will always prefer the b est solution found up to now� For each subpattern considered we compute the exp ected cost p er window as the sum of the probabilities of the subpatterns of each length� i�e� pcost�i� j � �� mpr ob�i� j� �� mpr ob�i� j� �� mpr ob�i� j� j � i� and later obtain the cost p er character as pcost�i� j �� pcost�i� j � � �� where � is the minimum length of an o currence of the interval �i� j ��As for simple patterns� we �ll ppr ob and mpr ob in lazy form� together with the computation of2 the b est factor� This makes the exp ected cost of the algorithm closer to O �m log m� than to the3 2 worst case O �m � �really O �m min�m� w ��Note that it is not p ossible �b ecause it is not optimal� to select a subpattern that starts or ends with optional characters or with �� If it happ ens that the b est subpattern gives one or more op erations p er text character� we switch to forward searching and select the �rst min�m� w � characters of the pattern �excluding initial and �nal ��s or ��s��In particular� note that it is p ossible that the scanning subpattern selected has not any optional or rep eatable character� In this case we use the scanning algorithm for simple patterns� despite that at veri�cation time we use the checking algorithm of extended patterns�� Searching for Regular ExpressionsThe most complex patterns that nrgrep can search for are regular expressions� For this sake� we use the technique explained in Section �� b oth for forward and backward searching� However� some asp ects need to b e dealt with in a real software��� Space Problems mThe �rst problem is space� since the T table needs O �� entries and this can b e unmanageable for long patterns� Despite that we exp ect that the patterns are not very long in typical text searching� some reasonable solution has to b e provided when this is not the case� ��We p ermit the user to sp ecify the amount of memory that can b e used for the table� and split the bitmasks in as many parts as needed to meet the space requirements� Since we do not search for masks longer than w bits� it is in fact unlikely that the text scanning part needs more than � or � table accesses p er text character� Attending to the most common alternatives� we develop ed separate co de for the cases of � and � tables� which p ermits much faster scanning since register usage is enabled��� Subpattern FilteringAs for simple and extended patterns� we select the b est subpattern for the scanning phase� and check all the p otential o ccurrences for complete o ccurrences and for record and context conditions�This means� according to Section �� that we need a forward and a backward automaton for the selected subpattern� and that we also need forward and backward veri�cation automata to check for the complete o ccurrence� An exception is when� given the result of selecting the b est factor� we prefer to use forward scanning� in which case only that automaton is needed �but the two veri�cation automata are still necessary�� All the complications addressed for extended patterns are present on regular expressions as well� namely� those derived from the fact that the o ccurrences may have di�erent lengths�It may also happ en that� after selecting the search subpattern� it turns out to b e just a simple or an extended pattern� in which case the search is handled by the appropriate search algorithm� which should b e faster� The veri�cation is handled as a general regular expression� Note that heuristics like those of Gnu Grep� which tries to �nd a literal string inside the regular expression in order to use it as a �lter� are no more than particular cases of our general optimization metho d�This makes our approach much smo other than others� For example Gnu Grep will b e much faster to search for �c��aaaaa�bbbbb�c�� where the strings �caaaaac� and�or �cbbbbbc� can b e used to �lter the search� than for �c��ab��ab��ab��ab��ab�c�� while the di�erence should not b e as large�Regarding optimal use of space �and also b ecause accessing smaller tables is faster� we use a state remapping function� When a subset of the states of the automaton is selected for scanning we build a new automaton with only the necessary states� This reduces the size of the tables�What is left is to explain how we select the b est factor to search for� This is much more complex on regular expressions than on extended patterns��� Selecting an Optimal Necessary FactorThe �rst nontrivial task on a regular expression is to determine what is a necessary factor� which is de�ned as a sub expression that has to match inside every o ccurrence of the whole expression�For example� �fgh� is not a necessary factor of �ab�cde�fghi�jk�� but �cd�fgh� is� Note that any range of states is a necessary factor in a simple or extended pattern�Determining necessary sub expressions is complicated if we try to do it using just the automaton graph� We rather make use of the syntax tree of the regular expression as well� The main trouble is caused by the �j� op erator� which forces us to cho ose one necessary factor from each branch�We simplify the problem by �xing the minimum o ccurrence length and searching for a necessary factor of that length on each side� In our previous example� we could select �cd�fg� or �de�hi�� for example� but not �cd�fgh�� However� we are able to take the whole construction� namely�cde�fghi��The pro cedure is recursive and aims at �nding the b est necessary factor of minimum length � provided we already know �we consider later the computation of these data�� w l ens�� m�� where w l ens�i� is the minimum length of a path from i to a �nal state�� mar k � mar k f �� m� � � � � L�� where mar k �i� �� is the bitmask of all the states reachable in � steps from state i� and mar k f sets the corresp onding �nal states�� cost�� m� � � � � L�� where cost�i� �� is the average cost p er character if we search for all the paths of length � leaving from state i�The metho d starts by analyzing the ro ot op erator of the syntax tree� Dep ending on the typ e of op erand� we do as follows�Concatenation� cho ose the b est factor �minimum cost� among b oth sides of the expression� Note that factors starting in the left side can continue to the right side� but we are deciding ab out where the initial p ositions are�Union� cho ose the b est factor from each side and take the union of the states� The average cost is the maximum over the two sides �not the sum� since the cost is related to the average numb er of characters to insp ect��One or more rep etition �� the minimum is one rep etition� so ignore the no de and treat the subtree�Zero or more rep etitions �� the subtree cannot contain a necessary factor since it do es not need to app ear in the o ccurrences� Nothing can b e found starting inside it� We assign a high cost to the factors starting inside the sub expression to make sure that it is not chosen in a concatenation� and that a union containing it will b e equally undesirable�Simple character or class �tree leaf �� there is only one p ossible initial p osition� so cho ose it unless its w l ens value is smaller than ��The pro cedure delivers a set of selected states and the prop er initial and �nal states for the se� lected subautomaton� It also delivers a reasonable approximation of the average cost p er character�The rest of the work is to obtain the input for this pro cedure� The ideas are similar as for ex� tended patterns� but the techniques need to make heavier use of graph traversal and are more exp en� sive� For example� just computing w l ens and L � w l ens�initial state�� i�e� shortest paths from any3 3 4 state to a �nal state� we need a Floyd�like algorithm that takes O �Lm �w � � O �min�m � m �w �� time�Arrow probabilities pr ob are loaded as b efore� but ppr ob�i� �� now gives the total probability of all the paths of length � leaving state i� and it includes the probability of reaching i from a previous �� state� In Glushkov�s construction� all the arrows reaching a state have the same lab el� so ppr ob is computed for increasing � using the formulas ppr ob�i� �� ppr ob�i� �� pr ob�i�X�� ppr ob�i� �� pr ob�i� � ppr ob�j� � � �� j�(i�j )�NFA2 3 2This takes O �Lm � � O �min�m � m w �� time� At the same time we compute the mar k � mar k f3 4 masks� at a total cost of O �min�m � m �w �� m+1 mar k �i� �� mark f �i� �� m�i i�1 mar k �i� �� mark f �i� �� � mar k �j� � � �� mar k �i� �� mar k �i� � � �� j�(i�j )�NFA��� mar k f �i� �� mar k f �j� � � �� j�(i�j )�NFA for increasing � values� Once ppr ob is computed� we build mpr ob�� m� � � � � L� � � � � L�� so that� � mpr ob�i� �� is the probability of any path of length � inside the area mar k �i� �� This is computed2 2 4 2 2 in O �L m � � O �min�m � m w �� time with the formulas����� mpr ob�i� �� mpr ob�i� �� Y� �� � ���� mpr ob�j� � � �� mpr ob�i� �� ppr ob�i� � � j�(i�j )�NFA2 3 2Finally� the search cost is computed as always using mpr ob� in O �L m� � O �min�m � mw �� time�Again� it is p ossible that the scanning subpattern selected is in fact a simpler typ e of pattern� such as a simple or extended one� In this case we use the appropriate scanning pro cedure� which should b e faster�Another nontrivial p ossibility is that even the b est necessary factor is to o bad �i�e� it has a high predicted cost p er character�� In this case we select from the initial state of the automaton a subset of it that �ts in the computer word �i�e� at most w states�� intended for forward scanning�Since all the o ccurrences found with this automaton will have to b e checked� we would like to select the subautomaton that �nds the least p ossible spurious matches� If m � w then we use the whole automaton� otherwise we try to minimize the probability of getting outside the set of selected states� To compute this� we consider that the probability of reaching a state is inversely prop ortional to its shortest path from the initial state� Hence we add states farther and farther2 from the ro ot until we have w states� All this takes O �m � time�This probabilistic assumption is of course simplistic� but an optimal solution is quite hard and at this p oint we know that the search will b e costly anyway� ���� Veri�cationA �nal nontrivial problem is how to determine which states should b e present in the forward and backward veri�cation automata once the b est scanning subautomaton is selected� Since we have chosen a necessary factor� we know for sure that the automaton is �cut� in two parts by the necessary factor� If we have chosen a backward scanning� then the veri�cations start from the initial states of the scanning automaton� otherwise they start from its �nal states�Figure �� illustrates these automata for the pattern �abcd�d��e�f�de�� where we have se� lected �d�d��e�f�� as our scanning sub expression� This corresp onds to the states f�� g of the original automaton� The selected arrows and states are in b oldface in the �gure �note that ar� rows leaving from the selected subautomaton are not selected�� Dep ending on whether the selected subautomaton is searched with backward or forward scanning� we know the text p osition where its initial or �nal states� resp ectively� were reached� Hence� in backward scanning the veri�cation starts from the initial states of the subautomaton� while it starts from the �nal states in case of forward scanning� We need two full automata� the original one and one with the arrows reversed� Backward scanning Backward verification automaton (reverse arrows) Forward verification automaton e d a b c d d e d e 0 1 2 3 4 5 6 7 8 9 f fForward scanning Forward verification automaton Backward verification automaton e (reverse arrows) d a b c dd e d e 0 1 2 3 4 5 6 7 8 9 f fFigure �� Forward and backward veri�cation automata corresp onding to forward and backward scanning of the automaton for �abcd�d��e�f�de�� The shaded states are the initial ones for veri�cation� In practice we use the complete automaton b ecause �cutting�o� � the exact veri�cation subautomaton is to o complicated�There is� however� a complication when verifying a regular expression in this scheme� Assume the search pattern is AX B jCYD � for strings A� B � C � D � X and Y � The algorithm could selectX jY as the scanning subpattern� Now� in the text AX D � the backward veri�cation for AjC would�nd A and the forward veri�cation for B jD would �nd D � and therefore an o ccurrence would b e �� incorrectly triggered� Worse than that� it could b e the case that X � Y � so there is no hop e in distinguishing what to check based on the initial states of the scanning automaton�The problem� which do es not app ear in simpler typ es of patterns� is that there is a set of initial states for the veri�cation� The backward and forward veri�cations tell us that some state of the set can b e extended to a complete o ccurrence� but there is no guarantee that there is a single state in that set that can b e extended in b oth directions� To overcome this problem we check the initial states one by one� instead of using a set of initial states and doing just one veri�cation�� Approximate Pattern MatchingAll the previous algorithms p ermit sp ecifying a �exible search pattern� but they do not allow any di�erence b etween the pattern sp eci�ed and its o ccurrence in the text� We now consider the problem of p ermitting at most k insertions� deletions� replacements or transp ositions in the pattern�We let the user to sp ecify that only a subset of the four allowed errors are p ermitted� However� designing one di�erent search algorithm for each subset was impractical and against the spirit of a uniform software� So we have considered that our criterion of p ermitting these four typ es of errors would b e the most commonly preferred in practice and have a unique scanning phase under this mo del �as all the previous algorithms� we have a scanning and a veri�cation phase�� Only at the veri�cation phase� which is hop efully executed a few times� we take care of only applying the p ermitted op erations� Note that we cannot miss any p otential match b ecause we scan with the most p ermissive error mo del� We also wrote in nrgrep sp ecialized co de for k � � and k � �� which are the most common cases and using a �xed k p ermits b etter register usage��� Simple PatternsWe make use of the three techniques describ ed in Section �� cho osing the one that promises to b e the b est��� Forward and Backward SearchingIn forward searching� k � � similar rows corresp onding to the pattern are used� There exists a bit�parallel algorithm to simulate the automaton in case of k insertions� deletions and replacements�� but despite that the automaton that incorp orates transp ositions has b een depicted �� seeFigure �� no bit parallel formula for the complete op eration has b een shown� We do that now�We store each row in a machine word� R � � � R � just as for the base technique� The temp orary0 k m�i i states are stored as T � � � T � The bitmasks R are initialized to � � as b efore� while all the T1 k i i m � � are initialized to � � The up date pro cedure to pro duce R and T up on reading text character t j is as follows� m�1 ��� R �� j � �� B �t � R0 j0 for i � � � � � k do� �R �� R �� B �t �� j R j �R �� j �R �� i j i�1 i�1 i i�1 j �T � �B �t � �� i j�T �� R �� B �t � i�1 j i ��The rationale of the pro cedure is as follows� The �rst three lines are as for the base technique� without transp ositions� The second line of the formula for R corresp onds to transp ositions� Note i� that the new T value is computed accordingly to the old R value� so it is � text p ositions b ehind��Once we compute the new bitmasks R for text character t � we take those old R masks that were j not up dated with t � shift them in two p ositions �aligning them for the p osition t and killing the j j +1 states that do not match t �� This is equivalent to pro cessing two characters� � t � At the next j j iteration �t �� we shift left the mask B �t � and kill the states of T that do not match� The net j +1 j +1� e�ect is that� at iteration j � �� we are or�ing R with iT �j � �� B �t � �� R �j � �� B �t �� B �t � �� i j +1 i�1 j j +1� ��R �j � �� B �t �� B �t � i�1 j +1 j which corresp onds to two forward transitions with the characters t t � If those characters matched j +1 j� the text� then we p ermit the activation of R � i m�1A match is detected as b efore� when R � �� is not zero� Since we may cut w characters k from the pattern� the context conditions� and the p ossibility that the user really wanted to p ermit only some typ es of errors� we have to check each match found by the automaton b efore rep orting it� We detail later the veri�cation pro cedure�Backward searching is adapted from this technique exactly as it is done from the basic algorithm without transp ositions� A subtle p oint is that we cannot consider the automaton dead until b oth the R and the T masks run out of active states� since T can awake a dead R�As b efore� we select the b est subpattern to search for� which is of length at most w � The algorithm to select the b est subpattern has to account for the fact that we are allowing errors� A go o d approximation is obtained by considering that any factor will b e alive for k turns� and then adding the normal exp ected numb er of window characters to read until the factor do es not match�So we add k to cost�i� j � in Eq� �� Now it is more p ossible than b efore �for large k �m� that the�nal cost p er character is greater than one� in which case we prefer forward scanning��� Splitting into k � � SubpatternsThis technique can b e directly adapted from previous work� taking care of leaving a hole b etween each pair of pieces� For the checking of complete o ccurrences in candidate text areas we use the general veri�cation engine �we have a di�erent prepro cessing for the case of each of the k � � pieces matching�� Note that it makes no sense to design a forward search algorithm for this case� if the average cost p er character is more than one� this means that the probability of �nding a subpattern is so high that we will pay to o much time verifying spurious matches� in which case the whole metho d do es not work�The main di�culty that remains is how to �nd the b est set of k � � equal length pieces to search for� We start by computing pr ob and ppr ob as in Section �� The only di�erence is that now we are interested only in lengths up to L � b�m � k ��k � ��c� which is the maximum length of a piece2�the m � k comes out b ecause there are k unused characters in the partition �� This cuts down the2 cost to compute these vectors to O �m �k ��2The numerator is in fact converted into m if no transp ositions are p ermitted� Despite that the scanning phase will allow transp ositions anyway� the p ossible matches missed by converting the numerator to m all include transp ositions� which by hyp othesis are not p ermitted� ��Now� we compute pcost�� m� � � � � L�� where pcost�i� �� gives the average cost p er window when searching for the factor P � For this sake� we compute for each i value the matrix i��i+��1 mpr ob�� L� � � � � L�� where mpr ob�� r � is the probability of any factor of length r in P � i��i+��12 2This is computed for each i in O �m �k � time as�� r � mpr ob�� r � �� mpr ob�� �� r � �� mpr ob�� r � �� ppr ob�i � � � r� r �� mpr ob�� r ��LX mpr ob�� r � pcost�i� �� r =03 2All this is computed for every i� for increasing �� in O �m �k � total time� The justi�cation for the previous formula is similar to that in Section �� Now� the most interesting part is mbest�� m� � � � � k � �� where mbest�i� s� is the exp ected cost p er character of the b est s pat� tern pieces starting at pattern p osition i� The strings selected have the same length� Together with mbest we have ibest�i� s�� which tells where must the �rst string start in order to obtain mbest�The maximum length for the pieces is L� We try all the lengths from L to �� until we determine that even in the b est case we cannot improve our current b est cost� For each p ossible piece length� we compute the whole mbest and ibest� as follows mbest�i� �� �i � m � s� � �s � �� s � �� mbest�i� s� �� �i � m � s� � �s � �� s � �� mbest�i� s� �� min�mbest�i � �� s��� � �� cost�� mbest�i � � � �� s � �� where cost � min�� pcost�i� �� pcost�i� �� which is computed for decreasing i� The rationale is that mbest�i� s� can cho ose whether to start the �rst piece immediately or not� The second case is easy since mbest�i � �� s� is already computed� and ibest�i� s� is made equal to ibest�i � �� s�� In the �rst case� we have that the cost of the �rst piece is cost and the other s � � pieces are chosen in the b est way from P according to i+�+1��m mbest�i � � � �� s � �� In this case ibest�i� s� � i� Note that as the total cost to search for the k pieces we could take the maximum cost� but we obtained b etter results by using the mo del of the probability of the union of indep endent events�2All this costs in the worst case O �m �� but normally the longest pieces are the b est and the cost b ecomes O �mk �� At the end we have the b est length � for the pieces� the exp ected cost p er character in mbest�� k � �� and the optimal initial p ositions for the pieces in i � ibest�� k � ��0 i � ibest�i � � � �� k �� i � ibest�i � � � �� k � �� and so on� Finally� if mbest�� k � �� actually1 0 2 1 larger than a smaller numb er� we know that we reach the window b eginning with probability high enough and therefore the whole scheme will not work well� In this case we have to cho ose among forward or backward searching�3 2 2 3 2Summarizing the costs� we pay O �m �k � m � in the worst case and O �m �k � k m� in practice�3Normally k is small so the cost is close to O �m � in all cases� ���� Veri�cationFinally� we explain the veri�cation pro cedure� This is necessary b ecause of the context conditions� of the p ossibly restricted set of edit op erations� and b ecause in some cases we are not sure that there is actually a match� Veri�cation can b e called from the forward scanning �in which case� in general� we have partitioned the pattern in P � P SP and know that at a given text p osition1 2 a match of S ends�� from the backward scanning �where we have partitioned the pattern in the same way and know that at a given text p osition the match of a factor of S b egins�� or from the partitioning into k � � pieces �in which case we have partitioned P � P S P � � � S P and0 1 1 k +1 k +1 know that at a given text p osition the exact o ccurrence of a given S b egins�� iIn general� all we know ab out the p ossible match of P around text p osition j is that� if we partition P � P P �a partition that we know�� then there should b e a match of P ending at TL R L j and a match of P starting at T � and that the total numb er of errors should not exceed k � InR j +1 all the cases� we have precomputed forward automata corresp onding to P and P �the �rst one isL R reversed b ecause the veri�cation go es backward�� In forward and backward scanning there is just one choice for P and P � while for partitioning into k � � pieces there are k � � choices �all ofL R them are precomputed��We run the forward and backward veri�cation from text character j � in b oth directions� Fortu� nately� the context conditions that make a match valid or invalid can b e checked at each extreme separately� So we go backward and� among all the p ositions where a legal match of P can b egin�L we cho ose the one with minimum error level k � Then we go forward lo oking for legal o ccurrencesL of P with k � k errors� If we �nd one� then we rep ort an o ccurrence �and the whole record isR L rep orted�� In fact we take advantage of each match found during the traversal� if we are lo oking� the pattern with k errors and �nd a legal endp oint with k � k � we still continue searching for� b etter o ccurrences� but now allowing just k � � errors� This saves time b ecause the veri�cation� automaton needs just to use �k � �� rows�A complicated condition can arise b ecause of transp ositions� the optimal solution may involve transp osing T with T � an op eration that we are not p ermitting b ecause we chose to split the j j +1 veri�cation there� We solve this in a rather simple but e�ective way� if we cannot �nd an o ccurrence in a candidate area� we give it a second chance after transp osing b oth characters in the text��� Extended PatternsThe treatment for extended patterns is quite a combination of the extension from simple to extended patterns without errors and the use of k � � rows or the partition into k � � pieces in order to p ermit k errors� However� there are a few complications that deserve mention�A �rst one is that the propagation of active states due to optional and rep eatable characters do es not mix well with transp ositions �for example� try to draw an automaton that �nds �abc�de� with one transp osition in the text ��adbe�� We solved the problem in a rather practical way�Instead of trying to simulate a complex automaton� we have two parallel automata� The R masks are used in the normal way without p ermitting transp ositions �but p ermitting the other errors and the optional and rep eatable characters�� and they are always one character b ehind the current text p osition� Each T mask is obtained from the R mask of the previous row by pro cessing the last two text characters in reverse order� and its result is used for the R mask of the same row in the next �� iteration� The co de to pro cess text p osition j is follows �the initial values are as always�� m�1R �� R �� j � �� B �t �0 j �10 for i � � � � � k do�R �� R �� B �t �� j �R � S �t �� i j �1 i j �1 i� j R j �R �� j �R �� j T i�1 i�1 i i�1�Df �� R j F i� �R �� R j �A � �� Df � I �� b Df �� i i� m�1T �� R �� j � �� B �t � j �R � S �t �� i�1 j i�1 j i�Df �� T j F i� �T �� T j �A � �� Df � I �� b Df �� i i� � m�1 �T �� T �� j � �� B �t � j �T � S �t �� j �1 j �1 i i i where the A� I and F masks are de�ned in Section ��A second complication is that subpattern optimization changes� For the forward and backward automata we use the same technique for searching without errors but we add k to the numb er of pro cessed window p ositions for any subpattern� So it remains to b e explained how is the optimiza� tion for the partitioning into k � � subpatterns�We have pr ob and spr ob computed in O �m� time as for normal extended patterns� A �rst� condition is that the k � � pieces must b e disjoint� so we compute r each�� m� � � � � L �� where�L � b�L � k ��k � ��c is the maximum length of a piece as b efore �L is the minimum length of a pattern o ccurrence as in Section �� and r each�i� �� gives the last pattern p osition reachable in �2 text characters from i� This is computed in O �m � time as r each�i� �� i and� for increasing � � �� t �� r each�i� � � �� m�t t�1 m while �t � m � �A � � �� t �� t � � r each�i� �� t3We then compute ppr ob� mpr ob and pcost exactly as in Section �� in O �m � time� Finally� the selection of the b est set of k � � subpatterns is done with mbest and ibest just as in Section �� the only di�erence b eing that r each is used to determine which is the �rst unused p osition if pattern p osition i is chosen and a minimum piece length � has to b e reserved for it� The total pro cess takes3O �m � time��� Regular ExpressionsFinally� nrgrep p ermits searching for regular expressions allowing errors� The same mechanisms used for simple and extended patterns are used� namely using k � � replicas of the search automaton and splitting the pattern into k � � disjoint pieces� Both adaptations present imp ortant complications with resp ect to their simpler counterparts��� Forward and Backward SearchOne problem in regular expressions is that the concept of �forward� do es not immediately mean one shift to the right� Approximate searching with k � � copies of the NFA of the regular expression implies b eing able to move �forward� from row i to row i � �� as Figure �� shows� �� e d a b c d d e d e 0 errors f f e d a b c d d e d e 1 error f f e d a b c d d e d e 2 errors f fFigure �� Glushkov�s resulting NFAs for the search of the regular expression �abcd�d��e�f�de� with two insertions� deletions or replacements� To simplify the plot� the dashed lines represent deletions and replacements �i�e� they move by � � f�g�� while the vertical lines represent insertions�i�e� they move by ��3For this sake� we use our table T � which for each state D gives the bit mask of all the states reachable from D in one step� The up date formula up on reading a new text character t is therefore j�R �� T �R � � B �t �0 j0 ol dR �� R0 0 for i � � � � � k do� �R �� T �R � � B �t �� j R j T �R j R � j �T �T �ol dR � � B �t �� B �t �� i j i�1 i�1 i�1 j j �1 i i�1 ol dR �� R i�1 i�1�The rationale is as follows� R is computed according to the simple formula for regular expres�0� sion searching without errors� For i � �� R p ermits arrows coming from matching the current i character �T �R � � B �t �� vertical� arrows representing insertions in the pattern �R � and �di� i j i�1� agonal� arrows representing replacements �from R � and deletions �from R �� which are joined i�1 i�1� in T �R jR �� Finally� transp ositions are arranged in a rather simple way� In ol dR we store the i�1 i�1 value of R two p ositions in the past �note the way we up date it to avoid having two arrays� ol dR and ol dol dR�� and each time we p ermit from that state the pro cessing of the last two text character in reverse order�Forward scanning� backward scanning� and the veri�cation of o ccurrences are carried out in the normal way using this up date technique� The technique to select the b est necessary factor is unaltered except that we add k to the numb er of characters that are scanned inside every text window�3Recall that in the deterministic simulation� each bit mask of active NFA states D is identi�ed as a state� ���� Splitting into k � � Sub expressionsIf we can select k � � necessary factors of the regular expression as done in Section �� for selecting one necessary factor� then we can b e sure that at least one of them will app ear unaltered in any o ccurrence with k errors or less� As b efore� in order to include the transp ositions we need to ensure that one character is left b etween consecutive necessary factors�We �rst consider how� once the sub expressions have b een selected� can we p erform the multi� pattern search� Each sub expression has a set of initial and �nal states� We reverse all the arrows and convert the formerly initial states of all the sub expressions into �nal states� Since we search for any reverse factor of the regular expression� all the states are made initial�We set the window length to the shortest path from an initial to a �nal state� At each window p osition� we read the text characters backward and feed the transformed automaton� Each time we arrive to a �nal state we know that the pre�x of a necessary sub expression has app eared� If we happ en to b e at the b eginning of the window then we check for the whole pattern as b efore� We keep masks with the �nal states of each necessary factor in order to determine� from the current mask of active states� which of them matched �recall that a di�erent veri�cation is triggered for each��Figure �� illustrates a p ossible selection of two necessary factors �corresp onding to k � �� for our running example �abcd�d��e�f�de�� These are �abc� and ��d��e�f�de�� where the �rst�d� has b een left to separate b oth necessary factors� Their minimum length is �� so this is the window length to use�ε e d a b c d e d e f fFigure �� An automaton recognizing reverse pre�xes of two necessary factors of�abcd�d��e�f�de�� based on the Glushkov construction of Figure �� Note that there are no transitions among sub expressions�The di�cult part is how to select k � � necessary and disjoint factors from a regular expression�Disjoint means that the subautomata do not share states �this ensures that there is a character separating them� which is necessary for transp ositions�� Moreover� we want the b est set� and we want that all them have the same minimum path length from an initial to a �nal state�We b elieve that an algorithm �nding the optimal choice has a time cost which grows exp o� nentially with k � We have therefore made the following simpli�cation� Each no de s is assigned a numb er I which corresp onds to the length of the shortest path reaching it from the initial state� sWe do not p ermit picking arbitrary necessary factors� but only those subautomata formed by all the states s that have numb ers i � I � i � �� for some i and �� This � is the minimum window s length to search using that subautomata� and should b e the same for all the necessary factors chosen� Moreover� every arrow leaving out of the chosen subautomaton is dropp ed� Finally� note �� that if the i values corresp onding to any pair of subautomata chosen di�er by more than � then we are sure that they are disjoint�Figure �� illustrates this numb ering for our running example� The partition obtained in Fig� ure �� corresp onds to cho osing the ranges �� and �� as the sets of states �this corresp onds to the sets f�� g and f�� g�� Of course the metho d do es not p ermit us to cho ose all the legal sets of subautomata� In this example� the necessary subautomaton f�� g cannot b e picked b ecause it includes some� but not all� states numb ered �� e d a b c d d e d e 0 1 2 3 4 5 6 7 8 9 f fI_s 0 1 2 3 4 5 5 5 6 7Figure �� Numb ering Glushkov�s NFA states for the regular expression �abcd�d��e�f�de��Numb ering the states is easily done by a closure pro cess starting from the initial state in2O �Lm �w � time� where L is the minimum length of a string matching the whole regular expression��As b efore� L � b�L � k ��k � ��c is the maximum p ossible length of a string matching a necessary factor�To determine the b est factors� we �rst compute pr ob� ppr ob� mpr ob and cost exactly as inSection �� The only di�erence is that now we are only interested in starting from sets of initial states of the same I numb er �there are L p ossible choices� and we are interested in analyzing s� factors of length no more than L � O �L�k �� This lowers the costs to compute the previous arrays� 2 2 to O ��L m� � � O ��Lm�k � �� We also make sure that we do not select subautomata that need more than w bits to b e represented�Once we have the exp ected cost of cho osing any p ossible I value and any p ossible minimum s factor length �� we can apply the optimization algorithm of Section �� since we have in fact�linearized� the problem� thanks to our simpli�cation� the optimization problem is similar to when we had a single pattern and needed to extract the b est set of k � � disjoint factors from it� of a2 single length �� This takes O �L � in the worst case but should in practice b e closer to O �k L��4Hence the total optimization algorithm needs O �m � time at worst�� A Pattern Matching SoftwareFinally� we are in p osition to describ e the software nrgrep� Nrgrep has b een develop ed in ANSI C and tested on Linux and Solaris platforms� Its source co de is publicly available under a Gnu license4� We discuss now its main asp ects�4From http��www�dcc�uchile�cl�gnavarro�pubcode�� �� Usage and OptionsNrgrep receives in the command line a pattern and a list of �les to search� The syntax of the pattern is given in Section �� If no options are given� nrgrep searches the �les in the order given and prints all the lines where it �nds the pattern� If more than one �le is given then the �le name is printed prior to the lines that have matched inside it� if there is one� If no �le names are given� it receives the text from the standard input�The default b ehavior of nrgrep can b e mo di�ed with a set of p ossible options� which pre�xed by the minus sign can precede the pattern sp eci�cation� Most of them are inspired in agrep� i� the search is case insensitive� w� only whole words matching the pattern are accepted� x� only whole records �e�g� lines� matching the pattern are accepted� c� just counts the matches� do es not print them� l� outputs the names of the �les containing matches� but not the matches themselves�G� outputs the whole contents of the �les containing matches� h� do es not output �le names� n� prints records preceded by their record numb er� v� reverse matching� rep orts the records that do not match� d � del im �� sets the record delimiter to � del im �� which is ��n�� by default� A �� at the end of the delimiter makes it app ear as a part of the previous record �default is next�� so by default the end of line is the record delimiter and is considered as a part of the previous record� b � buf siz e �� sets the bu�er size� default is �� Kb� This a�ects the e�ciency and the p ossibility of cutting very long records that do not �t in a bu�er� s � sep �� prints the string � sep � b etween each pair of records output� k � er r � �idst�� allows up to � er r � errors in the matches� If idst is not present the errors p ermitted are �i�nsertions� �d�eletions� �s�ubstitutions and �t�ransp ositions� otherwise a subset of them can b e sp eci�ed� e�g� ��k �ids��L� takes the pattern literally �no sp ecial characters��H� explains the usage and exits�We now discuss brie�y how each of these options are implemented� �i is easily carried out by using classes of characters� �w and �x are handled using context conditions� �c� �l� �G� �h� �b and�s are easily handled� but some changes are necessary such as stopping the search when the �rst match is found for �l and �G� �n and �v are more complicated b ecause they force all the records to b e pro cessed� so we �rst search for the record delimiters and then search for the pattern inside the records �normally we search for the pattern and �nd record delimiters around o ccurrences only���d is arranged by just changing the record delimiter �it has to b e a simple pattern� but classes or characters are p ermitted�� k switches to approximate searching and the �idst� �ags are considered at veri�cation time only� and �L avoids the normal parsing pro cess and considers the pattern as a simple string� ��Of course it is p ermitted to combine the options in a single string preceded by the minus sign or as a sequence of strings� each preceded by the minus sign� Some combinations� however� make no sense and are automatically overriden �a warning message is issued�� lenames are not printed if the text comes by the standard input� �c and �G are not compatible and �c dominates� �n and�l are not compatible and �l dominates� �l and �G are not compatible and �G dominates� and �G is ignored when working on the standard input �since �G works by printing the whole �le from the shell���� Parsing the PatternOne imp ortant asp ect that we have not discussed is how is the parsing done� In principle we just have to parse a regular expression� However� our parser mo dule carries out some other tasks� such asParsing our extended syntax� some of our op erations� such as �� and classes of characters� are not part of the classical regular expression syntax� The result of the parsing is a syntax tree which is not discarded� since as we have seen it is useful later for prepro cessing regular expressions�Map the pattern to bit mask p ositions� the parser determines the numb er of states required by the pattern and assigns the bit p ositions corresp onding to each part of the pattern sp eci��cation�Determine typ e of sub expression� given the whole pattern or a subset of its states� the parser is able to determine which typ e of expression is involved �simple� extended� or regular expres� sion��Algebraic simpli�cation� the parser p erforms some algebraic simpli�cation on the pattern to optimize the search and reduce the numb er of bits needed for it�The parsing phase op erates in a top�down way� It �rst tries to parse an or �� of sub ex� pressions� Each of them is parsed as a concatenation of sub expressions� each of which is a �single expression� �nished with a sequence of �� or �� terminators� and the single expression is either a single symb ol� a class of characters or a top�level expression in parentheses� Apart from the syntax tree� the parser pro duces a mapping from p ositions of the pattern strings to leaves of the syntax tree� meaning that the character or class describ ed at that pattern p osition must b e loaded at that tree leaf�The second step is the algebraic simpli�cation� which pro ceeds b ottom�up and is able to enforce the following rules at any level�� If �C � and �C � are classes of characters then �C � j �C � �� C C ��1 2 1 2 1 2�� If �C � is a class� then �C � j �� j �C � �� C ���� If E is any sub expression� then E � �� E �� E � ���� If E is any sub expression� then E � �� E �� E � �� E �� E �� E � �� E �� E �� andE �� E �� E � � �� E ���� j �� �� A sub expression which can match the empty string and app ears at the b eginning or at the end of the pattern is replaced by �� except when we need to match whole words or records�To collapse classes of characters in a single no de ��rst rule� we traverse the obtained mapping from pattern p ositions to tree leaves� �nd those mapping to the leaves that store �C � and �C � and1 2 make all them p oint to the new leaf that represents �C C ��1 2More simpli�cations are p ossible� but they are more complicated and we chose to stop here for the current version of nrgrep� Some interesting simpli�cations that may reduce the numb er of bits required to represent the pattern are xw z juw v �� xju�w �z jv �� EE � �� E � and E jE �� E for arbitrarily complex E �right now we do that just for leaves��After the simpli�cation is done we assign p ositions in the bit mask corresp onding to each character�class in the pattern� This is easily done by traversing the leaves left to right and assigning one bit to each leaf of the tree except to those storing � instead of a class of characters� Note that this works as exp ected on simple and extended patterns as well� For several purp oses �includingGlushkov�s NFA construction algorithm� we need to store which are the bit p ositions used inside every subtree� which of these corresp ond to initial and �nal states� and whether the empty string matches the sub expression� All this is easily done in the same recursive traversal over the tree�Once we have determined the bits that corresp ond to each tree leaf and the mapping from pattern p ositions to tree leaves� we build a table of bit masks B � such that B �c� tells which pattern p ositions are activated by the character c� This B table can b e directly used for simple and extended patterns� and it is the basis to build the DFA transitions of regular expressions�Finally� the parser is able to tell which is the typ e of the pattern that it has pro cessed� This can b e di�erent from what the syntax used to express the pattern may suggest� for example�a�b�c�d��e�f�g�� is the simple pattern �a�bc�de�fg�� which is discovered by the simpli�� cation pro cedure� This is in our general spirit of not b eing mislead by simple problems presented in a complicated way� which motivated us to select optimum subpatterns to search for� As a secondary e�ect of simpli�cations� the numb er of bits required to represent the pattern may b e reduced�A second reason to b e able to determine the real typ e of pattern is that the scanning subpattern selected can b e simpler than the whole pattern and hence it can admit a faster search algorithm�For this sake we p ermit determining the typ e not only of the whole pattern but also of a selected set of p ositions�The algorithm for determining the typ e of pattern or subpattern starts by assuming a simple pattern until it �nds evidence of an extended pattern or a regular expression� The algorithm enters recursively into the syntax tree of the pattern avo ding entering sub expressions where no selected state is involved� Among those that have to b e entered in order to reach the selected states� it answers �simple� for the leaves of the tree �characters� classes and �� and in concatenations it takes the most complex typ e among the two sides� When reaching an internal no de �� or�� it assumes �extended� if the subtree was just a single character� otherwise it assumes �regular expression�� The latter is always assumed when an or �� that could not b e simpli�ed is found� ���� Software StructureNrgrep is implemented as a set of mo dules� which p ermits easy enrichment with new typ es of patterns� Each typ e of pattern �simple� extended and regular expression� has two mo dules to deal with the exact and the approximate case� which makes six mo dules� simple� esimple� extended� eextended� regular� eregular �the pre�x �e� stands for allowing �e�rrors�� The other mo dules are� basics�options�except�memio� basic de�nitions� exception handling and memory and I�O man� agement functions� bitmasks� handles op erations on simple and multi�word bit masks� buffer� implements the bu�ering mechanism to read �les� record� takes care of context conditions and record management� parser� p erforms the parsing� search� template that exp orts the prepro cessing and search functions which are implemented in the six mo dules describ ed �simple� etc�� shell� interacts with the user and provides the main program�The template search facilitates the management of di�erent search algorithms for di�erent pattern typ es� Moreover� we remind that the scanning pro cedure for a pattern of a given typ e can use a subpattern whose typ e is simpler� and hence a di�erent scanning function is used� This is easily handled with the template�� Some Exp erimental ResultsWe present now some exp eriments comparing the p erformance of nrgrep version �� against that of its b est known comp etitors� Gnu grep version �� and agrep version ��Agrep �� uses for simple strings a very e�cient mo di�cation of the Boyer�Mo ore�Horsp o ol algorithm �� For classes of characters and wild cards �denoted �� and equivalent to �� it uses an extension of the Shift�Or algorithm explained in Section �� in all cases based on forward scanning� For regular expressions it uses a forward scanning with the bit parallel simulation ofThompson�s automaton� as explained in Section �� Finally� for approximate searching of simple strings it tries to use partitioning into k � � pieces and a multipattern search algorithm based onBoyer�Mo ore� but if this do es not promise to yield go o d results it prefers a bit�parallel forward scanning with the automaton of k � � rows �these techniques were explained in Section �� For approximate searching of more complex patterns agrep uses only this last technique�Gnu Grep cannot search for approximate patterns� but it p ermits all the extended patterns and regular expressions� Simple strings are searched for with a Boyer�Mo ore�Gosp er search algorithm�similar to Horsp o ol�� All the other complex patterns are searched for with a lazy deterministic automaton� i�e� a DFA whose states are built as needed �using forward scanning�� To sp eed up the �� search for complex patterns� grep tries to extract their longest necessary string� which is used as a�lter and searched for as a simple string� In fact� grep is able to extract a necessary set of strings� i�e� such that one of the strings in the set has to app ear in every match� This set is searched for as a�lter using a Commentz�Walter like algorithm �� which is a kind of multipattern Boyer�Mo ore� As we will see� this extension makes grep very p owerful and closer to our goal of a smo oth degradation in e�ciency as the pattern gets more complex�The exp eriments were carried out over �� Mb of English text extracted from Wall StreetJournal articles of �� which are part of the trec collection �� Two di�erent machines were used� sun is a Sun UltraSparc�� of �� MHz with �� Mb RAM running Solaris �� and intel is an i�� of �� MHz with �� Mb RAM running Linux Red Hat �� kernel �� Both are��bit machines �w � ��To illustrate the complexity of the co de� we show the sizes of the sources and executables inTable �� The source size is obtained by summing up the sizes of all the ��c� and ��h� �les� The size of the executables is computed after running Unix�s �strip� command on them� As can b e seen� nrgrep is in general a simpler software�Software Source size Executable size Executable size�sun� �intel� agrep v� �� Kb �� Kb �� Kb grep v� �� Kb �� Kb �� Kb nrgrep v �� Kb �� Kb �� KbTable �� Sizes of the di�erent softwares under comparison�The exp eriments were rep eated for �� di�erent patterns of each kind� To minimize the inter� ference of i�o times in our measures� we had the text in a lo cal disk� we considered only user times and we asked the programs to show the count of matching records only� The records were the lines of the �le� The same patterns were searched for on the same text for grep� agrep and nrgrep�Our �rst exp eriment shows the search times for simple strings of lengths � to �� Those strings were randomly selected from the same text starting at word b eginnings and taking care of including only letters� digits and spaces� We also tested how the ��i� �case insensitive search� and ��w��match whole words� a�ected the p erformance� but the di�erences were negligible� We also made this exp eriment allowing �� and �� of errors �i�e� k � b��mc or k � b��mc�� In the case of errors grep is excluded from the comparison b ecause it do es not p ermit approximate searching�As Figure �� shows� nrgrep is comp etitive against the others� more or less dep ending on the machine and the pattern length� It works b etter on the intel architecture and on mo derate length patterns rather than on very short ones� It is interesting to notice that in our exp eriments grep p erformed b etter than agrep�When searching allowing errors� nrgrep is slower or faster than agrep dep ending on the case�With low error levels �� they are quite close� except for m � �� where for some reason agrep p erforms consistently bad on sun� With mo derate error levels �� the picture is more complicated and each of them is b etter for di�erent pattern lengths� This is a p oint of very volatile b ehavior b ecause �� happ ens to b e very close to the limit where splitting into k � � pieces ceases to b e a �� go o d choice� and a small error estimating the probability of a match pro duces dramatic changes in the p erformance� Dep ending on each pattern� a di�erent search technique is used�On the other hand� the search p ermitting transp ositions is consistently worse than the one not p ermitting them� This is not only b ecause when splitting into k � � pieces they may have to b e shorter to allow one free space among them� but also b ecause even when they can have the same length the subpattern selection pro cess has more options to �nd the b est pieces if no space has to b e left among them�The b ehavior of nrgrep� however� is not as erratic as it seems to b e� The non�monotonic b ehavior can b e explained in terms of splitting into k � � pieces� As m grows� the length of the pieces that can b e searched for grows� tending to the real m��k � �� value from b elow� For example� for k ��� of m� we have in the case of transp ositions to search for � pieces of length � when m � ��For m � �� we have to search for � pieces of length �� which is estimated to pro duce to o many veri�cations and then another metho d is used� For m � �� however� we have to search for � pieces of length �� which has much lower probability of o ccurrence and recommends again splitting into k � � pieces� The di�erences b etween searching using or not transp ositions is explained in the same way� For m � �� we have to search for � pieces of length � no matter whether transp ositions are p ermitted� But for m � �� we can search for � pieces of length � if no transp ositions are p ermitted� which yields much b etter search time for that case� We remark that p ermitting transp ositions has the p otential of yielding approximate searching of b etter quality� and hence obtain the same �or b etter� results using a smaller k �Our second exp eriment aims at evaluating the p erformance when searching for simple patterns that include classes of characters� We have selected strings as b efore from the text and have replaced some random p ositions with a class of characters� The classes have b een� upp er and lower case versions of the letter replaced �called �case� in the exp eriments�� all the letters ��a�zA�Z�� called �letters� in the exp eriment� and all the characters �� called �all� in the exp eriments�� We have considered pattern lengths of �� and �� and an increasing numb er of p ositions converted into classes� We show also the case of length �� and k � � errors �excluding grep��As Figure �� shows� nrgrep deals much more e�ciently with classes of characters than its comp etitors� worsening slowly as there are more or bigger classes to consider� Agrep yields always the same search time �coming from a Shift�Or like algorithm�� Grep� on the other hand� worsens progressively as the numb er of classes grows but is indep endent on how big the classes are� and it worsens much faster than nrgrep� We have included the case of zero classes basically to show the e�ect of the change of agrep�s search algorithm�Our third exp eriment deals with extended patterns� We have selected strings as b efore from the text and have added an increasing numb er of op erators �� or �� to it� at random p ositions�avoiding the �rst and last ones�� For this sake we had to use the ��E� option of grep� which deals with regular expressions� and convert E � into EE � for agrep� Moreover� we found no way to express the �� in agrep� since even allowing regular expressions� it did not p ermit sp ecifying the� string or the empty set �a� � �aj�� aj�� We also show how agrep and nrgrep p erform to search for extended patterns allowing errors� Because of agrep�s limitations� we chose m � � and k � � errors �no transp ositions p ermitted� for this test�As Figure �� shows� agrep has a constant search time coming again from Shift�Or like searching�plus some size limitations on the pattern�� Both grep and nrgrep improve as the problem b ecomes �� Exact searching on SUN Exact searching on INTEL 2 34 Agrep Agrep Grep Grep 1.8 Nrgrep 32 Nrgrep 30 1.6 28 1.4 26 1.2 24 1 220.8 200.6 18 5 10 15 20 25 30 5 10 15 20 25 30 m mApproximate searching on SUN Approximate searching on INTEL 45 60 Agrep 10% Agrep 10% 40 Nrgrep 10% Nrgrep 10% Nrgrep 10% (transp) 55 Nrgrep 10% (transp) 35 Agrep 20% Agrep 20% Nrgrep 20% Nrgrep 20% Nrgrep 20% (transp) 30 50 Nrgrep 20% (transp)25 45 2015 4010 35 50 30 5 10 15 20 25 30 5 10 15 20 25 30 m mFigure �� Search times on �� Mb for simple strings using exact and approximate searching� �� m = 10 on SUN m = 10 on INTEL 10 429 40 Agrep - case Agrep - letters 8 Agrep - case 38 Agrep - all Agrep - letters Grep - case 7 Agrep - all 36 Grep - letters Grep - case Grep - all 6 Grep - letters 34 Nrgrep - case Grep - all Nrgrep - letters Nrgrep - all 5 Nrgrep - case 32 Nrgrep - letters Nrgrep - all 4 303 282 261 24 0 1 2 3 4 5 0 1 2 3 4 5 # of classes # of classes m = 10 on SUN m = 20 on INTEL 10 42 9 40 Agrep - case Agrep - letters 8 Agrep - case 38 Agrep - all Agrep - letters Grep - case 7 Agrep - all 36 Grep - letters Grep - case Grep - all 6 Grep - letters 34 Nrgrep - case Grep - all Nrgrep - letters 5 Nrgrep - case 32 Nrgrep - all Nrgrep - letters 4 Nrgrep - all 30 3 28 2 26 1 24 0 22 0 2 4 6 8 10 0 2 4 6 8 10 # of classes # of classes m = 15, allowing 2 errors on SUN m = 15, allowing 2 errors on INTEL 20 4818 Agrep - case 46 Agrep - letters Agrep - case 16 Agrep - all 44 Agrep - letters Nrgrep - case Agrep - all 14 Nrgrep - letters Nrgrep - case Nrgrep - all 42 Nrgrep - letters 12 Nrgrep - all 40 10 38 8 36 64 342 32 1 2 3 4 5 6 7 1 2 3 4 5 6 7# of classes # of classesFigure �� Exact and approximate search times on �� Mb for simple patterns with a varying numb er of classes of characters� �� simpler� but nrgrep is consistently faster� Note that the problem is necessarily easier with the �� op erator b ecause the minimum length of the pattern is not reduced as the numb er of op erators grows� We have again included the case of zero op erators to show how agrep jumps�With resp ect to approximate searching� nrgrep is consistently faster than agrep� whose cost is a kind of worst case which is reached when the minimum pattern length b ecomes �� This is indeed a very di�cult case for approximate searching and nrgrep wisely cho oses to do forward scanning on that case�Our fourth exp eriment considers regular expressions� It is not easy to de�ne what is a �random� regular expression� so we have tested � complex patterns that we considered interesting to illustrate the di�erent alternatives for the e�ciency� These have b een searched for with zero� one and two errors �no transp ositions p ermitted�� Table � shows the patterns selected and some asp ects that explain the e�ciency problems to search for them�No� Pattern Size Minimum � of lines that match�� chars� length ` exactly � error � errors� American�Canadian �� � American�Canadian�Mexican �� � Amer�a�z��can � � �� � Amer�a�z��can�Can�a�z��ian �� � Ame�i��r�i��can � � �� � Am�a�z��ri�a�z��an � � �� � �Am�Ca��er�na��ic�di�an �� � American��policy �� � A�mer�i��can��p�oli�cy� �� Table �� The regular expressions searched for �written with the syntax of nrgrep�� Note that some are indeed extended patterns�Table � shows the results� These are more complex to interpret than in the previous cases� and there are imp ortant di�erences in the b ehavior of the same co de dep ending on the machines�For example� grep p erformed consistently well on intel� while it showed wide di�erences on sun� We b elieve that this comes from the fact that in some cases grep cannot �nd a suitable set of�ltering strings �patterns �� and �� In those cases the time corresp onds to that of a forwardDFA� On the intel machine� however� the search times are always go o d�Another example is agrep� which has basically two di�erent times on sun and always the same time on intel� Despite that agrep uses always forward scanning with a bit�parallel automaton� it takes on the sun half the time when the pattern is very short �up to �� p ositions�� while it is slower for longer patterns� This di�erence seems to come from the numb er of computer words needed for the simulation� but this seems not to b e imp ortant on the intel machine� The approximate search using agrep �which works only on the shorter patterns� scales in time accordingly to the numb er of computer words used� although there is a large constant factor added in the intel machine�Nrgrep p erforms well on exact searching� All the patterns yield fast search times� in general b etter than those of grep on intel and worse on sun� The exception is where grep do es not use�ltering in the sun machine� and nrgrep b ecomes much faster� When errors are p ermitted the search times of nrgrep vary dep ending on its ability to �lter the search� The main asp ects that �� m = 10 on SUN m = 10 on INTEL 10 44 42 Agrep - *'s 8 Agrep - +'s 40 Agrep - *'s Grep - ?'s 38 Agrep - +'s Grep - *'s Grep - ?'s Grep - +'s 6 36 Grep - *'s Nrgrep - ?'s Grep - +'s Nrgrep - *'s 34 Nrgrep - ?'s Nrgrep - +'s Nrgrep - *'s 4 32 Nrgrep - +'s 30 2 28 26 0 24 0 1 2 3 4 5 0 1 2 3 4 5 # of special symbols # of special symbols m = 20 on SUN m = 20 on INTEL 3.5 44 42 3 Agrep - *'s 40 Agrep - +'s Agrep - *'s Grep - ?'s 38 Agrep - +'s 2.5 Grep - *'s Grep - ?'s Grep - +'s 36 Grep - *'s Nrgrep - ?'s 34 Grep - +'s 2 Nrgrep - *'s Nrgrep - ?'s Nrgrep - +'s 32 Nrgrep - *'s Nrgrep - +'s 30 1.5 281 26 24 0.5 22 2 4 6 8 10 0 2 4 6 8 10 # of special symbols # of special symbols m = 8, allowing 1 error on SUN m = 8, allowing 1 error on INTEL 30 4443 25 42 Agrep - *'s Agrep - *'s Agrep - +'s 20 Agrep - +'s 41 Nrgrep - ?'s Nrgrep - ?'s Nrgrep - *'s Nrgrep - *'s Nrgrep - +'s 15 Nrgrep - +'s 4039 10 38 5 370 36 1 1.5 2 2.5 3 3.5 4 1 1.5 2 2.5 3 3.5 4# of special symbols # of special symbolsFigure �� Exact and approximate search times on �� Mb for extended patterns with a varying numb er of op erators� Where� on sun� Agrep is out of b ounds� it takes nearly �� seconds� �� a�ect the search time are the frequency of the pattern and its size� When compared to agrep� nrgrep p erforms always b etter on exact searching and b etter or similarly on approximate searching� sunNo� grep agrep nrgrep� errors � errors � error � errors � errors � error � errors� �� � �� � �� � �� � �� � �� � �� � �� � �� intelNo� grep agrep nrgrep� errors � errors � error � errors � errors � error � errors� �� � �� � �� � �� � �� � �� � �� � �� � �� Table �� Search times for the selected regular expressions� There are empty cells b ecause agrep severely restricts the lengths of the complex patterns that can b e approximately searched for�The general conclusions from the exp eriments are that nrgrep is� for exact searching� comp etitive against agrep and grep� while it is in general sup erior �sometimes by far� when searching for classes of characters and extended patterns� exactly or allowing errors� When it comes to search for regular expressions� nrgrep is in general� but not always� faster than grep and agrep�One �nal word ab out nrgrep�s smo othness is worthwhile� The reader may get the impression that nrgrep�s b ehavior is not as smo oth as promised b ecause it takes very di�erent times for di�erent patterns� for example on regular expressions� while agrep�s b ehavior is much more predictable� The p oint is that some of these patterns are indeed much simpler than others� and nrgrep is much faster to search for the simpler ones� Agrep� on the other hand� do es not distinguish b etween simple and complicated cases� Grep do es a much b etter job but it do es not deal with the complex area of approximate searching� Something similar happ ens on other cases� nrgrep had the highest variance �� when searching for patterns where classes were inserted at random p oints� This is b ecause this random pro cess do es pro duce a high variance in the complexity of the patterns� it is much simpler to search for the pattern when all the classes are in one extreme �then cutting it out from the scanning subpattern� than when they are uniformly spread� Nrgrep�s b ehavior simply mimics the variance in its input� precisely b ecause it takes time prop ortional to the real complexity of the search problem��� ConclusionsWe have presented nrgrep� a fast online pattern matching to ol esp ecially well suited for complex pattern searching on natural language text� Nrgrep is now at version �� and publicly available under a Gnu license� Our b elief is that it can b e a successful new memb er of the grep family� TheFree Software Foundation has shown interest in making nrgrep an imp ortant part of a new release of Gnu grep� and we are currently de�ning de details�The most imp ortant improvements of nrgrep over the other memb ers of the grep family are�E�ciency� nrgrep is similar to the others when searching for simple strings �a sequence of single characters� and some regular expressions� and generally much faster for all the other typ es of patterns�Uniformity� our search mo del is uniform� based on a single concept� This translates into smo oth variations in the search time as the pattern gets more complex� and into an absence of obscure restrictions present in agrep�Extended patterns� we intro duce a class of patterns which is intermediate b etween simple pat� terns and regular expressions and develop e�cient search algorithms for it�Error mo del� we include character transp ositions in the error mo del�Optimization� we �nd the optimal subpattern to scan the text and check the p otential o ccurrences for the complete pattern�Some p ossible extensions and improvements have b een left for future work� The �rst one is a b etter computation of the matching probability� which has a direct impact on the ability of nrgrep for cho osing the right search metho d �currently it normally succeeds� but not always�� A second one is a b etter algebraic optimization of the regular expressions� which has also an impact on the ability to correctly compute the matching probabilities� Finally� we would also like to b e able to combine exact and approximate searching as agrep do es� where parts of the pattern accept errors and others do not� It is not hard to do this by using bit masks that control the propagation of errors�AcknowledgementsMost of the algorithmic work preceding nrgrep was done in collab oration with Mathieu Ra�not� who unfortunately had no time to participate in this software� ��References�� R� Baeza�Yates� Text retrieval� Theory and practice� In ��th IFIP World Computer Congress� volume I� pages �� Septemb er ���� R� Baeza�Yates and G� Gonnet� A new approach to text searching� Communications of theACM� �� �� R� Baeza�Yates and G� Navarro� Faster approximate string matching� Algorithmica� ���� �� R� Baeza�Yates and B� Rib eiro�Neto� Modern Information Retrieval� Addison�Wesley� ���� G� Berry and R� Sethi� From regular expression to deterministic automata� Theoretical Com� puter Science� �� �� R� Boyer and J� Mo ore� A fast string searching algorithm� Communications of the ACM��� �� B� Commentz�Walter� A string matching algorithm fast on the average� In Proc� ICALP��LNCS v� �� pages �� �� M� Cro chemore and W� Rytter� Text algorithms� Oxford University Press� ���� A� Czuma j� M� Cro chemore� L� Gasieniec� S� Jarominek� T� Lecro q� W� Plandowski� andW� Rytter� Sp eeding up two string�matching algorithms� Algorithmica� �� �� N� El�Mabrouk and M� Cro chemore� Boyer�mo ore strategy to e�cient approximate string matching� In �th International Symposium on Combinatorial Pattern Matching �CPM�� pages �� �� D� Harman� Overview of the Third Text REtrieval Conference� In Proc� Third Text REtrievalConference �TREC�� pages �� NIST Sp ecial Publication ���� R� Horsp o ol� Practical fast searching in strings� Software Practice and Experience� ������ D� Knuth� J� Morris� and V� Pratt� Fast pattern matching in strings� Siam Journal onComputing� �� �� K� Kukich� Techniques for automatically correcting words in text� ACM Computing Surveys��� �� U� Manb er and S� Wu� Glimpse� A to ol to search through entire �le systems� In Proc�USENIX Technical Conference� pages �� Winter ���� B� Melichar� String matching with k di�erences by �nite automata� In Proc� InternationalCongress on Pattern Recognition �ICPR�� pages �� IEEE CS Press� �� �� E� Moura� G� Navarro� N� Ziviani� and R� Baeza�Yates� Fast and �exible word searching on compressed text� ACM Transactions on Information Systems �TOIS�� To app ear� Earlier versions in SIGIR�� and SPIRE���� G� Myers� A fast bit�vector algorithm for approximate string matching based on dynamic progamming� Journal of the ACM� �� �� G� Navarro� A guided tour to approximate string matching� ACM Computing Surveys� ��To app ear��� G� Navarro and R� Baeza�Yates� Very fast and simple approximate string matching� Informa� tion Processing Letters� �� �� G� Navarro� E� Moura� M� Neub ert� N� Ziviani� and R� Baeza�Yates� Adding compression to blo ck addressing inverted indexes� Kluwer Information Retrieval Journal� �� �� G� Navarro and M� Ra�not� A bit�parallel approach to su�x automata� Fast extended string matching� In �th International Symposium on Combinatorial Pattern Matching �CPM��LNCS �� pages �� �� G� Navarro and M� Ra�not� Fast and �exible string matching by combining bit�parallelism and su�x automata� Technical Rep ort TR�DCC�� Dept� of Computer Science� Univ� ofChile� ���� G� Navarro and M� Ra�not� Fast regular expression searching� In �rd International Workshop on Algorithm Engineering �WAE�� LNCS �� pages �� �� G� Navarro and M� Ra�not� Compact DFA representation for fast regular expression search�In �th International Workshop on Algorithm Engineering �WAE�� LNCS� �� To app ear��� G� Navarro and M� Ra�not� Flexible Pattern Matching in Strings� Cambridge UniversityPress� �� To app ear��� M� Ra�not� On the multi backward dawg matching algorithm �MultiBDM�� In Proc� �thSouth American Workshop on String Processing �WSP�� pages �� Carleton UniversityPress� ���� K� Thompson� Regular expression search algorithm� Communications of the ACM� ���� �� S� Wu and U� Manb er� Agrep � a fast approximate pattern�matching to ol� In Proc� USENIXTechnical Conference� pages �� �� S� Wu and U� Manb er� Fast text searching allowing errors� Communications of the ACM���

NR-Grep: a Fast and Flexible Pattern Matching Tool 1 Introduction

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support