<<

Improving Regular-Expression Matching on Strings Using Negative Factors

Xiaochun Yang Bin Wang Tao Qiu College of Information Science College of Information Science College of Information Science and Engineering and Engineering and Engineering Northeastern University, China Northeastern University, China Northeastern University, China [email protected] [email protected] [email protected] Yaoshu Wang Chen Li College of Information Science Dept. of Computer Science and Engineering UC Irvine, USA Northeastern University, China [email protected] [email protected]

ABSTRACT D.2.8 [Software Engineering]: Metrics—performance mea- The problem of finding matches of a regular expression (RE) sures on a string exists in many applications such as text edit- ing, biosequence search, and shell commands. Existing tech- General Terms niques first identify candidates using substrings in the RE, , Performance then verify each of them using an automaton. These tech- niques become inefficient when there are many candidate oc- currences that need to be verified. In this paper we propose a Keywords novel technique that prunes false negatives by utilizing neg- Regular expression, long sequence, performance ative factors, which are substrings that cannot appear in an answer. A main advantage of the technique is that it can be 1. INTRODUCTION integrated with many existing algorithms to improve their efficiency significantly. We give a full specification of this This paper studies the problem of efficiently finding match- technique. We develop an efficient that utilizes ings of a regular expression (RE) in a long string. The prob- negative factors to prune candidates, then improve it by us- lem exists in many application domains. In the domain of ing bit operations to process negative factors in parallel. We bioinformatics, users specify a string such as TAC(T|G)AGA show that negative factors, when used together with neces- and want to find its matchings in proteins or genomes [6, sary factors (substrings that must appear in each answer), 13]. Modern text editors such as , ,andeclipse pro- can achieve much better pruning power. We analyze the vide the functionality of searching a given pattern in a text large number of negative factors, and develop an algorithm document being edited. The shell command is widely for finding a small number of high-quality negative factors. used to search plain-text files for lines matching a regular ex- We conducted a thorough experimental study of this tech- pression. For instance, the command “grep ^a.ple fruit- nique on real data sets, including DNA sequences, proteins, s.txt” finds lines in the file called fruits.txt that begin and text documents, and show the significant performance with the letter a, followed by one , followed by the improvement when applying the technique in existing algo- letter sequence ple. The ability of processing regular expres- rithms. For instance, it improved the search speed of the sions has been integrated into the syntax of languages (e.g., popular Gnu Grep tool by 11 to 74 times for text docu- , Ruby, AWK, and ) and provided by other languages ments. through standard libraries in .NET, , and Python. A simple method to do RE search on a string is to con- struct an automaton for the RE. For each position in the Categories and Subject Descriptors string, we run the automaton to verify if a substring start- H.3 [Information Storage and Retrieval]: Content Anal- ing from that position can be accepted by the automaton. ysis and Indexing; Notice that the verification step can be computationally ex- pensive. The main limitation of this approach is that we need to do the expensive verification step many times. Vari- ous algorithms have been developed to speed up the match- Permission to make digital or hard copies of all or part of this work for ing process by first identifying candidate occurrences in the personal or classroom use is granted without fee provided that copies are string, and then verifying them one by one [3, 4, 14]. These not made or distributed for profit or commercial advantage and that copies algorithms identify candidate places based on certain sub- bear this notice and the full citation on the first page. To copy otherwise, to strings derived from the RE that have to appear in matching republish, to post on servers or to redistribute to lists, requires prior specific answers, such as a prefix and/or a suffix. For instance, each permission and/or a fee. SIGMOD’13, June 22–27, 2013, New York, New York, USA. substring matching the RE xy(a|b)*zw* should start with Copyright 2013 ACM 978-1-4503-2037-5/13/06 ...$15.00. xy (a prefix condition) and end with zw or z (suffix con-

361 ditions). Such substrings, called positive factors through- GivenanREQ,weuseR(Q) to represent the of strings out the paper, can be used to locate candidate occurrences. that can be accepted by the automaton of Q.Weuse|Q| Each of them will be further verified using one or multiple to express the number of characters that Q contains. We automata. Although these algorithms can eliminate many use lmin to represent the length of the shortest string(s) in starting positions in the string, their efficiency can still be R(Q). For example, for the RE Q =(G|T)A∗GA∗T∗,wehave low when the positive factors generate too many candidate |Q| = 6 since it has six characters: G, T, A, G, A,andT.The occurrences, especially when the text is long. set of strings R(Q)={GG, TG, GAG, TAG, GGA, TGA, GGT, TGT, Contributions: In this paper we study how to improve the GAGT, ...}.Wehavelmin = 2, since its shortest strings GG efficiency of existing algorithms for searching regular expres- and TG have the length 2. sions in strings. We propose a novel technique that provides For a text (sequence) T of the characters in Σ, we use significant pruning power by utilizing the fact that we can |T | to denote its length, T [i]todenoteitsi-th character derive substrings from the RE that cannot appear in an an- (starting from 0), and T [i, j] to denote the substring ranging swer. Such substrings are called negative factors.Wegive from its i-th character to its j-th character. an analysis to show that many candidates can be pruned by For simplicity, in our examples we focus on the domain negative factors (Section 3). of genome sequences, where Σ={A, C, G, T}.Werunexper- In Section 4, we study how to use negative factors to iments in Section 6 on other domains such as proteins and speed up the matching process of existing algorithms. We English texts, where the Σ has more characters. show that negative factors can be easily integrated into these of an RE. Consider a text T of length methods, and propose two bit-operation algorithms to do ef- n andanREQ of length m.WesayQ matches a sub- ficient search. In Section 4.3 we study the benefits of using string T [a, b]ofT at position a if T [a, b] belongs to R(Q). negative factors. One interesting result is that considering The substring T [a, b] is called an occurrence of Q in T .The necessary factors (substrings that have to appear in every problem of pattern matching for an RE is to find all occur- answer) only does not provide much filtering power in prefix- rences of Q in T . Figure 1 shows an example text. Suppose based approaches. However, when combining both negative Q =(G|T)A∗GA∗T∗.ThenQ matches T at position 3, and factors and necessary factors, we can gain much more prun- the substring T [3, 6] is an occurrence of Q in T . ing power. In Section 5 we study the problem of deciding a set of high-quality negative factors. There can be many 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 negative factors, and the selected ones can greatly affect the TACTAGACGTTAATTTACGTA performance of the matching process. To achieve a high effi- ciency, we propose an algorithm that only considers partial Figure 1: An example text. strings, and present an algorithm that can do early termina- tion in the process of deciding negative factors. In Section 6 we present experimental results on real data sets including 2.2 Prefix-Based Approaches DNA sequences, protein sequences, and text documents, and One naive approach for finding occurrences of an RE Q demonstrate the space and time efficiency of the proposed is to build an automaton for Q and run it from the begin- technique. For instance, it improved the search speed of the ning of the text T . An occurrence will be reported whenever popular Gnu Grep tool by 11 to 74 times for texts. a final state of the automaton is reached. The verification fails once a new feed character could not be accepted by the 2. PRELIMINARIES automaton. In this case, we move to the next position and repeat the verification process using the same automaton. 2.1 Regular Expression Matching We repeat the process for all starting positions in the text. Let Σ be a finite alphabet. A regular expression (RE for Since the approach has to run the automaton for each posi- short) is a string over Σ∪{, |, ·, ∗, (, )}, which can be defined tion, it is very inefficient especially when T is very long (e.g., recursively as follows: a chromosome of a human can contain 3 billion characters). • The symbol is a regular expression. It denotes an Recent techniques are developed for identifying sequences (i.e., the string of length zero). that contain at least one matching of an RE among mul- • Each string w ∈ Σ∗ is a regular expression, which de- tiple sequences. They utilize certain features of the RE Q notes the string set {w}. to improve the performance of the automaton-based meth- • If e1 and e2 are regular expressions that denote the set ods.Theirmainideaistousepositive factors,whichare R1 and R2, respectively, then substrings of Q that can be used to identify candidate oc- • (e1) is a regular expression that represents the same currences of Q in T . For instance, Watson in [14] used pre- set denoted by e1. fixes of strings in R(Q) to find maximal safe shift distances • (e1 · e2) is a regular expression that denotes a set to avoid checking every position in T . A prefix w.r.t. an of strings x that can be written as x = yz,where RE Q is defined as a prefix with length lmin of a string in ∗ ∗ ∗ e1 matches y and e2 matches z. R(Q). For example, for the RE Q =(G|T)A GA T ,thepre- • (e1|e2) is a regular expression that denotes a set of fixes w.r.t. Q are GA, TA, GG,andTG. There are other kinds strings x such that x matches e1 or e2. of positive factors, which are discussed in Section 4.3. + • (e1 ) is a regular expression that denotes a set of Figure 2(a) shows the main idea of this approach using strings x such that, for a positive integer k, x can the running example. The substrings T [0, 1], T [3, 4], T [5, 6], be written as x = x1 ...xk and e1 matches each T [10, 11], T [15, 16], and T [19, 20] are six matching prefixes + string xi (1 ≤ i ≤ k). We use |e to express a for the RE Q. The approach only examines matching suffix- Kleene closure e∗. In this paper, we consider the es of the text T starting from these matching prefixes using general case e∗. the automaton of Q. The automaton keeps examining each

362 matching suffix until it fails, and it reports an occurrence verifications, but also terminate verifications early. We first whenever a final state is reached. We call this kind of ap- present a formal definition of negative factors, then show a proach “algorithm PFilter,” where “P” stands for “Prefix.” good pattern to prune candidates.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Definition 1. (Negative factor, or N-factor for short) Giv- TACTAGACGTTAATTTACGTA en a regular expression Q and a string w, a string w is called a negative factor with respect to Q, or simply a negative fac- ∗ ∗ C andidate occurrences when Q is clear in the context, if there is no string Σ wΣ R Q (a) Using prefixes. does in ( ). 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 For a text T , an N-factor w.r.t. an RE Q must not appear TACTAGACGTTAATTTACGTA in an answer to Q in T . For example, consider the RE Q =(G|T)A∗GA∗T∗, the strings C, AGG,andTTA are N-factors, C andidate occurrences since they cannot appear in an answer as a substring. (b) Combination of prefixes and last matching suffix. Lemma 1. An N-factor w.r.t. an RE Q can not be a sub- string of a prefix or a suffix w.r.t. Q. Figure 2: Checking candidate occurrences of the RE Q =(G|T)A∗GA∗T∗. Theorem 1. Given a text with length n, the number of n i N-factors w.r.t. an RE Q cannot be greater than i=1 |Σ| . An alternative approach is adopted in NR-grep [9, 11], which uses a sliding window of size lmin on the text T and A PNS Pattern: Intuitively, we say a substring of T has a recognizes reversed matching prefixes in the sliding win- PNS pattern if it starts with a prefix of Q, has an N-factor dow using a reversed automaton. Similar to the defini- in the middle, and ends with a suffix of Q.Formerly,let tion of prefix, a suffix w.r.t. an RE Q is defined as a suf- πp,πn,πs be the set of starting positions of a matching prefix fix with length lmin of a string in R(Q). For example, for ∗ ∗ ∗ P , a matching N-factor N, and a matching suffix S in a text the RE Q =(G|T)A GA T , the suffixes w.r.t. Q are TT, T , respectively. The substring T [πp,πs + lmin − 1] conforms AT, AA, GA, GT, AG, GG,andTG, and the matching suffix- to a PNS pattern if N is a substring of T [πp,πs + lmin − 1]. es are T [4, 5], T [5, 6], T [8, 9], T [9, 10], T [11, 12], T [12, 13], Figure 3 shows that a substring conforms to a PNS pattern T [13, 14], T [14, 15], and T [18, 19]. We call the suffix-based if and only if πp ≤ πn <πs and πn + |N|≤πs + lmin. approach “algorithm SFilter,” which is also similar to the PFilter algorithm . It runs a reversed automaton from the ʌp ʌn ʌs ʌp ʌn ʌs end position of each suffix to the beginning of the text. P1 N1 S1 P2 N2 S2 T T 2.3 Improving Prefix-Based Approaches Us- (a) A PNS pattern. (b) Not a PNS pattern. ing Last Matching Suffix In Figure 2(a), the matching prefix T [19, 20] = TA could Figure 3: A substring conforming to a PNS pattern not be used to produce an answer string in R(Q) since it is iff πp ≤ πn <πs and πn + |N|≤πs + lmin. not among the suffixes identified above from the RE. There- fore, we could use the last matching suffix to do an early Obviously, a substring of T conforming to a PNS pattern termination in each verification step. Figure 2(b) shows the cannot be an occurrence of Q. Based on this observation, example of improving the algorithm PFilter.Aswecan we can prune unnecessary verifications using N-factors. Fig- see, by using the last matching suffix T [18, 19] = GT in the ure 4 shows an example of the benefit by using PNS patterns. text T , a verification can terminate early at position 19. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 We call this approach the “algorithm PS.” It only verifies TACTAGACGTTAATTTACGTA those substrings starting from every matching prefix Sp to the last matching suffix Ss if the starting position of Sp is Candidate occurrences than or equals to the starting position of Ss.WecallSs a valid matching suffix and each Sp a valid matching prefix Figure 4: Using N-factors T [2, 2] and T [17, 17] to w.r.t. its valid matching suffix Ss. For example, the sub- ∗ ∗ ∗ T , prune candidates of Q =(G|T)A GA T . Compared string [18 19] is a valid matching suffix and the substrings T , T , T [0, 1], T [3, 4], T [5, 6], T [10, 11], T [15, 16] are valid match- with Figure 2(b), candidates [0 19] and [15 19] are ing prefixes, whereas T [19, 20] is an invalid matching prefix. pruned and the verifications starting from positions  3, 5,and10 can be terminated early by using the The algorithm PS requires O(m·n·np) time to do verifica-  N-factors T [7, 7] and T [14, 16], respectively. tions for np valid matching prefixes of T , assuming m is |Q|, and the verification for each valid matching prefix requires Q ∗ ∗ ∗ O(m·n)time. Although the number of N-factors w.r.t. =(G|T)A GA T is large, we can still generate a small number of high-quality N-factors, such as {C, AGG, ATA, ATG, GGG, GTA, GTG, TAT, 3. NEGATIVE FACTORS TGG, TTA, TTG}. (We will provide details in Section 5.) For In this section, we develop the concept of negative factor, the example in Figure 2(b), all the five candidate substrings which can be used to improve the performance of matching conform to a PNS pattern, in which two of them can be algorithms. Contrary to positive factors, a negative factor pruned using the N-factors T [2, 2] and T [17, 17]. For in- is a substring that must not appear in an occurrence. We stance, the substring T [15, 19] is such a substring that con- show that a negative factor can not only prune unnecessary sists of a matching prefix T [15, 16] = TA, a matching suffix

363 T [18, 19] = GT, and an N-factor C at position 17. The N- Algorithm 1: PNS-Merge T , factors can help us avoid the verifications of [0 19] and Input Q P {L ,...,L } :AnRE , Prefix lists set = P1 Pu , Suffix T [15, 19] (see Figure 2(b)). Furthermore, for the matching {L ,...,L } lists Sset = S1 Sv , Negative factor lists T , PNS ,...,L position of prefix [3 4] = TA, can terminate the veri- LN1 Nw ; fication early when it meets the suffix T [5, 6] = GA, resulting 1 Calculate lmin given Q; 2 L ,...,L H in a better performance. Similarly, the verifications starting Insert the frontier records of N1 Nw to a heap N ; 3 while H do from the matching prefixes T [5, 6] = GA and T [10, 11] = TA N is not empty 4 Let πn be the top element on HN associated with an can be terminated early at positions 6 and 15, respectively. N-factor N; 5 smax ← FindMaxSuffix(Sset,πn + |N|−lmin); 6 if smax is found then 4. PRUNING UNNECESSARY CHECKS US- 7 Remove all elements that are less than smax in suffix lists; 8 foreach L ∈ P do ING N-FACTORS Pi ( set) 9 for π ≤ s L do In this section we discuss how to combine N-factors with element p ( max) in the list Pi 10 Verify(T [πp,smax]); the algorithm PS to check candidate substrings that do not 11 Remove πp from LP ; conform to a PNS pattern. In Section 4.1 we propose a i merge algorithm to identify candidates and present two im- 12 else proved algorithms by using bit-parallel operations in Sec- 13 Remove all elements that are not greater than πn in tion 4.2. In Section 4.3 we show that negative factors, when prefix lists; used together with necessary factors (substrings that must 14 Pop πn from HN ; appear in answers), can achieve better pruning power. 15 Push next record (if any) on each popped list to HN ; 16 if there exists a non-empty list in Sset then 4.1 A Merging Algorithm 17 smax ← FindMaxSuffix(Sset, |T |−lmin +1); 18 Verify(T [πp,smax]) for each πp (≤ smax) in prefix lists; N-factors can be used to divide a text into a set of dis- joint substrings, and we want to run the algorithm PS for each of them. Recall that the algorithm PS verifies sub- strings from every valid matching prefix to its correspond- pops the top element πn from the heap HN and repeats the ing valid matching suffix (see Section 2.3). For example, above steps until the heap HN is empty. Figure 4 shows that the N-factors divide the text to disjoint If there exist non-empty suffix lists, the algorithm finds T , T , T , T , T , substrings [0 1], [3 6], [8 10], [10 15], [15 16], and the largest element smax among them and verifies the can- T , T , T , [18 20]. Only [3 6] and [10 15] contain valid matching didate occurrence T [πp,smax] for each remaining element T , T , suffixes [5 6] and [14 15], respectively. The substrings πp(≤ smax) on the prefix lists (lines 16 – 18). T , T , [3 4] and [5 6] are valid matching prefixes w.r.t. the valid The algorithm requires O(np +mn· log ln +ms· log ls)time T , T , matching suffix [5 6], and [10 11] is a valid matching pre- to generate candidates, where np is the number of prefixes T , T , fix w.r.t. the valid matching suffix [14 15], whereas [0 1], in T , mn and ms are the number of N-factors and suffixes, T , T , [15 16], and [19 20] are invalid matching prefixes. respectively, ln and ls are the average length of the invert- In order to use N-factors in the PS approach, we sort start- list of each N-factor and suffix, respectively. The av- ing positions of matches of prefixes, suffixes, and N-factors erage length of each verification has been reduced to n T mn·ln in the ascending order. A suffix of the text can sup- since N-factors could divide the text into mn·ln disjoint sub- port this sorting operation with a large space overhead when strings. Therefore, the algorithm PNS-Merge only requires T m·n is long. To reduce the space cost, we can store the text O( nc)timetodoverifications,wherenc is the number T mn·ln in a BWT format [7] and get an inverted list of start- of candidates.1 ing positions for a substring α in a constant time (O(|α|)) by simulating searches using BWT. For example, Figure 4 4.2 Bit-Parallel Algorithms shows the inverted list of starting positions for the prefix TA In algorithm PNS-Merge, we need to find valid match- is {0, 3, 10, 15, 19} and the inverted list of starting positions ing suffixes and their corresponding valid matching prefixes for an N-factor is {2, 7, 17}. C to construct candidate occurrences for verification. We can The basic idea of utilizing N-factors in the PS approach accelerate the process by introducing the bit-parallel opera- is to find a left valid matching suffix and valid matching tions. We propose two algorithms in Sections 4.2.1 and 4.2.2. prefixes in the text for each N-factor. The formal algorithm, called PNS-Merge, is described in Algorithm 1 . It builds a 4.2.1 The PNS-BitC Algorithm under the Constraint H min-heap N for all the N-factor lists, each of which is sorted |N|≤lmin in the ascending order. Let πn be one element on the list We use a bit vector Ns = d0 ...dn to represent all the of N.Itpopsπn from the heap HN and conduct a binary occurrences of N-factors in the text T .WesetNs[di]= search for each suffix list specified by the largest element 1(d0 ≤ di

364 ʌn ʌs ʌn ʌn PNS-BitC 1 2 3 Algorithm 2: N1 S N2 N3 Input P S N l T : Bit vectors s, s, s, min; ... 0111 110 ... 1000 0000 ... 1 Calculate vectors Ssm and Pc using Equation 1; 2 t ← the position of last suffix using Ssm ; 3 Change Ssm [t]from1to0; πs ← t; πp ← 0; Figure 5: Explanation of the intermediate vector B0. 4 repeat

5 t ← the position of last suffix using Ssm ; 6 repeat The vector Pc = Ps & B0 denotes the starting positions 7 πp ← the position of last prefix using Pc; of valid matching prefixes. Then the algorithm PNS-BitC 8 Change Pc[πp]from1to0; π π 9 Verify T πp,πs lmin − gets each position pair p and s using bit operations (lines 2 ( [ + 1]); 2 10 until πp

12 until there is no 1 in Ssm or Pc; using Pc and Ssm (line 9). For example, let lmin =2.Table1

shows the generated Pc and Ssm . The candidate occurrences are T [3, 6], T [5, 6], and T [10, 15], which are consistent with we use vectors Ps and Ss for the occurrences of prefixes and the candidate occurrences shown in Figure 4. T suffixes in T respectively, in which Ps[di]=1(orSs[di]=1) In general, the text string can occupy the memory space indicates a prefix (or suffix) starting at position di in T . of multiple words. The algorithm proceeds in the same fash- S P  n  w We first consider a simple case where the length of each ion by computing sm and c for all w words, where N-factor is no greater than the length lmin. Under this con- is the length of a word (e.g., 4 bytes). We only need to straint, it is obvious that the case shown in Figure 3(b) pay special attention to the case of processing (Ns<<1) N − S N i−1 cannot happen, i.e., if positions πp,πn,andπs satisfy the and ( s Sm ). For the current word s ,whenpro- N << condition πp <πn <πs, there exists a PNS pattern. Under cessing ( s 1), we need to maintain the first bit of its N i N i−1 this constraint, a valid matching suffix means πs <πn since next word s and put it into the last bit of s . Similarly, N − S i − a matching suffix and a matching N-factor cannot start at the operation s Sm of the ( 1)st word is changed to N i−1 − Si−1 − N i > (lmin − 1). However, the length of N-factors Ssm using (∼A0)&Ss. could be different, so we have to use another vector to represent the end positions of the occurrences of N-factors. Table 1: Generate candidate regions using bit oper- Equation 2 shows the bit operations to calculate candidate ations under the constraint |N|≤lmin (lmin =2). occurrences. Vector 0 5 10 15 20 Intermediate : A1 =(Ne|Se) − (Ne<<1), P 1001010000100001000100 s Sem =(∼ A1)&Se &(∼Ne), Ss 0000110011011110001000 Intermediate B N − S ∼N , Ns 0010000100000000010001 : 1 =(e em )&( e) A 1110101111011101110111 0 Intermediate : B2 = ∼ ((Sem <<(lmin−1))−Sem ), Ss 0000010000000010001000 (2) m Intermediate B3 Ne A1 | Ne Se , B0 0001110011111110001001 : =( & ) ( & ) Pc 0001010000100000000000 Intermediate : B4 =(Ns − B3)&(∼ Ns), Intermediate : B5 =(B1 & B2) | B4, P P B . Correspondingly we use another intermediate vector B0 c = s & 5 to represent the starting positions of all the possible valid We consider the same example that matches the RE Q matching prefixes. Figure 5 shows an example of partial bits =(G|T)A∗GA∗T∗ on the text in Figure 1. Table 2 shows the in the vector B0. The “1” bits correspond to the starting positions of valid matching prefixes. 2http://graphics.stanford.edu/∼seander/bithacks.html

365 input vectors and intermediate results of the bit operations in Equation 2. Table 2: Generate candidate regions using bit oper- ations without any constraint (lmin =2). A0 Similar to the bit vector in Equation 1, in the inter- Vector mediate bit vector A1 in Equation 2, A1[di]=0ifdi is 0 5 10 15 20 Ps 1001010000100001000100 the end position of a valid matching suffix. Then we can Se 0000011001101111000100 N 0010000101000010011001 Se s get the vector m , in which each bit 1 means an end posi- Ne 0010000100010000110011 tion of a valid matching suffix. Notice that, different from A1 1110010101011110010001 Sem 0000001000100001000100 Ssm in Equation 1, the vector Sem needs to do an AND B1 0001111011101111001100 ∼ A S ∼ N B2 1111110111011110111011 operation between ( 1)& e and ( e). Before we give B3 0010000100010000010001 the reason, let us consider the case where a matching suf- B4 0000000000110000000000 B5 0001110011111110001000 fix S has the same end position t with a matching N-factor Pc 0001010000100000000000 N1. According to the analysis in Section 4.1, this suffix S should not be used to generate candidate occurrences (see Figure 3(a)). In addition, if there is no suffix between the ification. The algorithm PNS-BitG requires the same time n N1 N2 O   N-factor and its right neighbor N-factor , then the bit complexity as the algorithm PNS-BitC, but needs (6 w ) A1[t] is 0. Therefore, the algorithm PNS-BitG may choose to store one more vector Ne than the algorithm PNS-BitC. an invalid matching suffix setting Sem [t] = 1. In order to avoid choosing such invalid matching suffix S, we do the bit 4.3 Improving Pruning Power by Combining AND operation between (∼ A1)&Se and (∼ Ne). Negative Factors with Necessary Factors Besides prefixes and suffixes, there is another type of pos- ʌne1 ʌse ʌne2 ʌne1 ʌse ʌne2 N1 SN2 N1 SN2 itive factors, called necessary factor. A necessary factor T T Q ... 0100 111 ...... 1111 1101 ... w.r.t. an RE is a substring that must appear in every matching substring in the text T . For instance, G is a nec- B B ∗ ∗ ∗ (a) Vector 1. (b) Vector 2. essary factor w.r.t. the RE Q =(G|T)A GA T . GNU grep 2.0 [4] employs a different heuristic approach for finding nec- ʌne1 ʌne2 ʌn ʌs ʌn e1= e e2 Q N1 N2 N1 SN2 essary factors w.r.t. an RE . The neighborhoods of these T T necessary factors are then verified using a lazy deterministic ... 0000 0000 ...... 0001 000000 ... automaton. (c) Vector Ne& A1. (d) Vector Ne& Se. Generally, a necessary factor (substring) divides Q into a left part and a right part. Two automata are constructed for Figure 6: Explanation of vectors. verification in both directions. Figure 7(a) shows an example of this approach, call M algorithm. Since G is a necessary For the purpose of calculating valid matching prefixes cor- factor, the algorithm M builds an automaton Ar for the responding to each valid matching suffix, we use five inter- right part of the RE Q, i.e., GA∗T∗, and another automaton ∗ mediate vectors B1,...,B5 to explain how the calculation Al for the left part of Q, i.e., (G|T)A G.ItthenrunsAr on works. Let S be the rightmost suffix between two N-factors the suffixes of T starting at positions 5, 8, and 18, and runs

N1 and N2. B1[di]=1ifπne1

B2[di]=0(πss

= 0 to mark the useless N-factor N1 and keep B3[πne2 ]=1 be pruned without verification. Now we show an interesting at the same time. We use Ne&A1 to mark all the useless N- observation that using both negative factors and necessary factors (see Figure 6(c)). However, a useless N-factor needs factors together (i.e., algorithm PMNS) can enhance the to be maintained if there is a suffix S happens ending at filtering power significantly. the same position with this N-factor. The reason is that a Figures 7(b) and 7(c) show the examples of introducing matching suffix T [πss ,πse ] could be valid if there exists a necessary factors into the algorithms P and PS, respectively. prefix starting at position πps and πns <πps ≤ πss ,where In Figures 7(b), the algorithm PM can prune the candi-

πns is the starting position of N1 (see Figure 6(d)). We use date T [19, 20] since it does not contain a necessary factor. Ne&Se to specify all the same end positions of matching However, in Figure 7(c), the algorithm PMS cannot prune N-factors and valid matching suffixes. Furthermore we set any more candidates because the necessary factor G at po- each bit in B4[πns +1,πse ] to 1. Then we use the vector sition 18 appears in the last matching suffix T [18, 19] = GT. B5 to combine all cases of possible valid matching prefixes. Generally, it has a high probability that a necessary factor Finally, the vector Pc marks all valid matching prefixes. appears in the late part of T , that is, each candidate occur-

Table 2 shows the generated Sem and Pc, and the candi- rence might have a high probability to contain this necessary date occurrences are consistent with those shown in Figure 4. factor and could not be pruned. We call the corresponding algorithm PNS-BitG.Afterus- However, the algorithm PNS generates candidates in a ing Equation 2 to get the two vectors Sem and Pc, the algo- relatively short interval, in which at least one necessary fac- rithm gets each position pair πs and πp from Sem and Pc,re- tor is expected to be found, otherwise it must not be an spectively. It then generates candidates T [πp,πs]todover- occurrence. For example, the substring T [10, 15] shown in

366 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Q ∗ TACTAGACGTTAATTTACGTA For example, for the RE = C AAA,thesetofcoreN- factors w.r.t. Q is {G, T, AC, AAAA}. The substring GA is not

C andidate occurrences a core N-factor since its subsequence G is an N-factor. In order to compute an upper bound on the number of (a) The algorithm M. core N-factors w.r.t. an RE Q,weuseafactor automa- 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ton [2] to check whether a string is an N-factor. A factor TACTAGACGTTAATTTACGTA automaton is an automaton representing the set of all pos- itive factors w.r.t. Q. For a given RE Q, we first construct C andidate occurrences a nondeterministic factor automaton Af .ThenAf can be (b) The algorithm PM. further transformed to a unique minimal deterministic fac-

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 tor automaton Afm , which accepts exactly the same set of TACTAGACGTTAATTTACGTA strings [5]. Using Afm we can prove an upper bound on the number of core N-factors. C andidate occurrences Theorem Q A (c) The algorithm PMS. 2. Let be an RE and fm be its minimized deterministic factor automaton. The length of a core N- 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 factor w.r.t. Q cannot be greater than the number of states TACTAGACGTTAATTTACGTA in the longest acyclic of Afm .

C andidate occurrences Lemma 2. The number of core N-factors w.r.t. an RE Q 2 (d) The algorithm PMNS. is upper bounded by |Q| . Proof. An upper bound on the number of states in the Figure 7: Checking candidate occurrences of the RE longest acyclic path of the minimized deterministic factor Q =(G|T)A∗GA∗T∗ using necessary factors, which are automaton is the square of the number of characters that shown in bold font. Q contains, i.e., |Q|2. Therefore, the length of any core N- factor should be no greater than |Q|2. Figure 7(d) is unnecessary to be verified using the algorithm 5.2 Constructing Core N-Factors Online PMNS since it does not contain any necessary factor. One naive way to construct core N-factors is to enumerate PNS-Merge We can design algorithms using the ideas of strings with a length ≤|Q|2 and check if they are core N- PNS and bit operations to combine necessary factors with factors one by one (see Algorithm 3). The check process of patterns. Due to space limitation we do not give the detail a string s includes two phases: (i) the function CheckSub- of these algorithms. In Section 6 we show our experimental Sequence checks if a subsequence of s is an N-factor (line PMNS results of the algorithm by using bit operations with 4), and (ii) otherwise, check if s itself is an N-factor, which constraints |N|≤lmin, which is called PMNS-BitC. can be determined by running the factor automaton Af (Q) (line 6). 5. CHOOSING GOOD N-FACTORS The number of N-factors w.r.t. an RE could be large, and Algorithm 3: NaiveCore different N-factors could have different improvements on the Input: Alphabet Σ, an RE Q, a factor automaton Af (Q); PNS matching performance of algorithms based on patterns. Output CN ∗ : A set of core N-factors ; For example, given an RE Q = C AAA,bothG and CGA are 1 CN ←∅; N-factors w.r.t. Q. It is obvious that any occurrence of CGA 2 for length l ← 1; l ≤|Q|2; l ++do in a text T also contains an occurrence of G in T .Thusthe 3 for each string s ∈ Σl do N-factor CGA does not provide more filtering power than G. 4 FOUND ← CheckSubSequence(CN , s); if then So a natural question is how to choose a small number of 5 FOUND is false 6 if s A Q then high-quality N-factors to maximize search performance. In cannot be accepted by f ( ) 7 CN ← CN ∪{s}; this section we develop techniques to solve this problem. In Section 5.1 we show how to define a finite set of high-quality 8 C N-factors, called “core N-factors,” and show an upper bound return N ; on the number of core N-factors w.r.t. an RE Q.Thenin Section 5.2 we describe the challenge of efficiently construct- It can be time consuming to enumerate all the strings with ing the core N-factors, and propose an efficient algorithm alength≤|Q|2 and check each of them, especially when we for constructing core N-factors. In Section 5.3 we develop have to do it for each search query. For instance, we need to a technique to speed up the generation of core N-factors by enumerate 340 strings within 16 iterations for the RE Q = doing early termination. C∗AAA. Next we present a more efficient method to construct core N-factors to reduce the computational cost. 5.1 Core N-Factors Instead of enumerating all the strings (see line 3 in Al- gorithm 3), we can use two properties of core N-factors to Definition 2. (Core N-factor). An N-factor w.r.t. an RE generate them with a length l using a smaller set of strings Q is called a core N-factor if each of its proper subsequences l − 3 with length 1. Based on Definition 2, we show the prop- is not an N-factor w.r.t. Q. erties and how to utilize them to improve the performance. 3We distinguish substring and subsequence in this paper. A the characters in a subsequence may not be consecutive in substring of a string s has consecutive characters of s, while s.

367 Algorithm 4: QuickCore (1)Generate core N-factors using substrings w.r.t. an RE Q. Input: Alphabet Σ, an RE Q, a factor automaton Af (Q); x Output: A set of core N-factors; Property 1. If a string is a core N-factor, then its prefix 1 C ←∅ S ← x , |x|− x , |x|− N ; Σ; [0 2] and suffix [1 1] can be accepted by the 2 Create a hash table HT ; A Q factor automaton f ( ). 3 for length l ← 1; l ≤|Q|2; l ++do 4 for each string s ∈ S do This property helps us improve the performance of com- 5 FOUND ← false; puting N-factors by “joining”a set of strings with length l−1 6 if l>2 then to generate strings with length l. The formal definition of 7 New a string y, y[0] ← s[0] and y[1] ← s[l − 1]; string-join is given below. 8 if y is in HT then 9 FOUND ← CheckSubSequence(CN , s); Definition 3. (String-join) Given two strings s1 and s2, 10 if FOUND is false then s1 s2 s1s2 the string-join of and , denoted by , is computed 11 if s cannot be accepted by Af (Q) then as follows. If |s1| = |s2| = l − 1ands1[1,l− 2] = s2[0,l− 3], 12 CN ← CN ∪{s}; S ← S −{s}; then s1s2[0,l− 2] = s1 and s1s2[l − 1] = s2[l − 2]; otherwise, 13 Insert y into HT ,wherey[0] ← s[0] and y ← s l − s1s2 = ∅. [1] [ 1]; else S {s ,...,s } 14 Definition 4. (String set self-join) Let = 1 k 15 S ← S −{s}; be a set of strings of length l − 1. The string set self-join   of S, denoted by S, is a set of non-empty strings sisj (1 ≤ 16 S ← S; i, j ≤ k). 17 return CN ;

For example, let S = {s1,s2}, s1 = ACG and s2 = CGT, S { } then = ACGT . not need to invoke the function CheckSubSequence based Compared with Algorithm 3, we can get the same set of on Property 2 (lines 7 – 9) . In the above example for Q l core N-factors with length by self-joining the set of strings = C∗AAA, among the 65 strings, only 8 strings need to be l− with length 1, each of which can be accepted by the factor further checked using the function CheckSubSequence. automaton Af (Q).  Let S be the set of strings that can be accepted by Af (Q), 5.3 Early Termination of Constructing Core where each string in S has a length l−1, and S = S.Weuse N-Factors S l ⊆ l S S ⊆ S 2 Q(Σ ) Σ and Q( ) to represent strings that can We observe that the upper bound |Q| on the length of an A Q S l ⊆ l S S ⊆ S be accepted by f ( ), and use C (Σ ) Σ and C ( ) N-factor in Algorithm 4 is loose, which can result in many to represent the core N-factors. iterations in the algorithm (see line 2). As we can see in l Table 3, all core N-factors have been generated before the Theorem 3. The two sets SQ(Σ ) and SQ(S) are equiva- S l S S fifth iteration. However, the absence of incremental core lent, and the two sets C (Σ ) and C ( ) are also equivalent. N-factors at an iteration cannot guarantee that there will not be any more core N-factors generated in the following (2)Avoid invoking CheckSubSequence for every generated iterations, i.e., a new core N-factor can still be generated string. even though nothing is produced in the previous iterations. Property 2. Given a core N-factor x with a length greater So the question is whether there exists a tighter upper bound than 1, let s be a string whose prefix s[0, |s|−2] and suffix on the length using which we can terminate the iterations s[1, |s|−1] can be accepted by Af (Q). If x is subsequence early. Next we show such a tighter bound. of s,thenx[0] = s[0] and x[|x|−1] = s[|s|−1]. ∗ Let |x| = m (1 k,we l − joining the retained strings with length 1, the algorithm cannot find any string s to make ss an N-factor. generates the strings with one more character (lines 3 – 16). Furthermore, when there exists a core N-factor x that As we can see in the second column in Table 3, Algorithm 4 does not satisfy s[0] = x[0] and s[l −1],x[|x|−1], then we do keeps generating strings and checks if some of them are N-

368 factors. In the second iteration, the string CC is used to suffix in S is not a substring of T [b, c]. Then we could get generate strings CCA and CCC in the third iteration, and gen- the following recurrence function for n ≥ 0: erate string CCAA, CCCA and CCCC in later iterations. As we ∗ know, CC is not an N-factor because C exists in the RE Q. Bn(h, lmin)= n Any generated string based on CC cannot be an N-factor, |Σ| if n

since multiple Cs can always be accepted by the factor au- |Σ|·Bn−1(h, lmin) − h·Bn−lmin (h, lmin) otherwise. tomaton Af (Q). Therefore, it is not needed to keep CC in Then we have the second iteration. Based on this observation in Algorithm 5, we propose the Bn(h, lmin) h p1(n)= × . (4) algorithm EarlyCore to do early termination in the con- |Σ|n |Σ|lmin struction of core N-factors without any false dismissal. The N {N ,...,N } algorithm first counts the frequency of each character in each Similarly, let = 1 k be the set of N-factors Q N l Kleene closure e∗ in Q (lines 2 – 7 ) and then uses it to check w.r.t. the RE and each N-factor in has a length . if a generated string needs to be reserved for self-joining in Then we have  the next iteration (line 12). Bn(k, l ) p2(n)=1− . (5) |Σ|n Algorithm 5: EarlyCore Input: Alphabet Σ, an RE Q, a factor automaton Af (Q); 6. EXPERIMENTS Output: A set of core N-factors; In this section, we present experimental results of the N- 1 CN ←∅; S ← Σ; CS ←∅; factor technique on multiple real data sets. 2 for each expression e∗ in Q do 3 Let C ← the set of characters in e; 4 Let freq(C) ← 0; Experiment Setup. We conducted the experiments on 5 for each character c in e do three public data sets, including Human Genome, Protein 6 freq(C) ← freq(C)+numberofc in Q; sequences, and English texts. 7 Insert C into CS ; • Human Genome: The genomic sequence (GRCh37) 8 l ← 1; 9 while S is not empty do was assembled from a collection of DNA sequences, 10 for each string s ∈ S do which consisted of 24 chromosomes with a length vary- l 4 11 if ∃C ∈ CS and s ∈ C then ing from 48 million to 249 million. 12 if l = freq(C)+1then // bound of C 13 S ← S −{s}; • Protein sequences: We adopted the database Pfam 26.0 that contains a large amount of protein families 14 ... // see lines 5 - 15 in Algorithm 4 and is composed of Pfam-A and Pfam-B.5 The symbol 15 S ← S; l ← l +1; set consists of all the capital English letters, excluding 16 return CN ; “O” and “J”. We randomly picked text with a length varying from 101 to 9143 from Pfam-B.

Table 3 shows the number of iterations using Algorithm 4 • English texts: We used DBLP-Citation-network6,which and Algorithm 5, respectively. Algorithm 5 only needs 4 included 1, 632, 442 blocks, each of which corresponds iterations compared with the 16 iterations in Algorithm 4. to one paper. Each block contains several attributes of a paper, e.g., title, authors, abstract, etc. We ex- Theorem 4. The set S in Algorithm 5 always converges tracted the abstract from every block. The symbol set to an empty set. consisted of 52 English letters, plus 10 digits.

5.4 Pruning Power of N-factors We extracted several subsequences of length ranging from In this section, we analyze the pruning power of N-factors. 10 million to 100 million from the Human Genome. For the We first present a theoretical analysis, then give an experi- other two data sets, the size of each data set varied from mental result on a real data set. 10MB to 100MB. Since there is no “random” regular expres- Consider a substring T [a, d]conformingtoaPNS pattern, sion [12], we manually synthesized REs to cover different in which the substring T [a, b] is a matching prefix and the properties of the regular expression (see Table 4). The size substring T [c, d] is a matching suffix (a ≤ b ≤ c ≤ d). Let of these REs was from 6 to 21 and lmin varied from 2 to 5. p1(n) denote the probability that the length of T [b, c]isequal We ran each algorithm using the corresponding REs on dif- to n and p2(n) denote the probability that there exists at ferent data sets. These REs were used as queries to compare least one N-factor matching in T [b, c]. Then the probability the performance of different algorithms. of filtering any prefix matching in T using N-factors can be All the algorithms were implemented using GNU C++. calculated as follows: The experiments were run on a PC with an Intel 3.10GHz |T |−2·lmin Quad Core CPU i5 and 8GB memory with a 500GB disk, pf = p1(n) × p2(n). (3) running a Ubuntu () 64-bit operating system. All in- n=0 dex structures were in memory.

We first calculate p1(n). Let S = {S1,...,Sh} be the set of suffixes w.r.t. the RE Q. As defined in [14], each suffix 4http://hgdownload.cse.ucsc.edu/goldenPath/hg18 5 in S has the same length lmin.LetBn(h, lmin)denotethe ftp://ftp.sanger.ac.uk/pub/databases/Pfam/releases number of substrings T [b, c] with a length n such that any 6http://arnetminer.org/DBLP Citation

369 600 300 300 Gnu Grep Agrep Gnu Grep Agrep Gnu Grep 500 NR-grep PNS-BitC 250 NR-grep PNS-BitC 250 NR-grep PNS-BitC

400 200 200

300 150 150

200 100 100

Running time (ms) 100 Running time (ms) 50 Running time (ms) 50

0 0 0 Q1d Q2d Q3d Q4d Q5d Q1p Q2p Q3p Q4p Q5p Q1e Q2e Q3e Q4e Q5e Regular expressions Regular expressions Regular expressions (a) DNA sequences. (b) Protein sequences. (c) English text.

Figure 8: Performance comparison of different algorithms.

600 Agrep Gnu Grep NR-Grep N N N Table 4: Synthesized regular expressions. 500 Agrep Gnu Grep NR-Grep Data sets Regular expressions Q 400 Q ∗ ∗ 1d = C(TA) (G)(TT) (A|G) 300 Q ∗ DNA 2d =(CT|GT|AT)(CT)(A|T) A Q | ∗ 200 3d =((TA) (GG)) (GA)(AT|GC) Q ∗ ∗ Running time (ms) 100 4d =(TG) (C|A)(C)(GA) (T) Q ∗ ∗ 0 5d =(TG) (C|A)(C)GA T Q1d Q2d Q3d Q4d Q5d ∗ Regular expressions Q1p =(VYL|VAP|DD|LR)(PST) (TT|FA|EMLA) Q ∗ 2p =(EV|NV|SS|PQ)(LSI) (VR|SR|SV|VI) (a) DNA sequences. ∗ ∗ Protein Q3p =(EL) (VR)(L|E) G ∗ 300 Q Agrep Gnu Grep NR-Grep 4p =(LQ|LA)A (V|L) N N N ∗ ∗ 250 Agrep Gnu Grep NR-Grep Q5p =(SL|LA)(A|L) (EL)(S|L) ∗ 200 Q1e =(e|a|i) (re|to)s ∗ English Q2e =(this|This|That|that) (is)(c|d|e|f|g) 150 ∗ text Q3e =(Auto|auto) (ma|ta) 100 ∗ Q4e = S(e|a|i) st Running time (ms) 50 Q5e =(pa)t*(er|n) 0 Q1p Q2p Q3p Q4p Q5p Regular expressions (b) Protein sequences.

Comparison of RE Matching Algorithms.Recallthat 300 Agrep Gnu Grep NR-Grep N N N the existing algorithms Agrep, Gnu Grep, and NR-grep are 250 Agrep Gnu Grep NR-Grep developed for matching REs on a set of short sequences. 200

When a sequence contains more than one occurrences of a 150 query, only the first occurrence of the query is returned. For 100

the purpose of comparability, we modified the source code Running time (ms) 50 of them so that they can find all the occurrences as our 0 Q1e Q2e Q3e Q4e Q5e algorithms do. Regular expressions Figure 8 shows the performance comparison of Agrep, Gnu (c) English texts. Grep, NR-grep, and our algorithms on the three data sets. Each data set contained sequences with length 50 million. The algorithm PNS-BitC achieved the best time perfor- Figure 9: Improving existing approaches by using N-factors. mance. For instance, when querying Q3d on DNAs, PNS- BitC took 27ms only, compared to the 171ms, 241ms, and 219ms using Agrep, Gnu Grep, and NR-grep, respective- ly. The superiority was even more evident on English texts, and NR-grepN reduced the time of NR-grep from 456ms to where PNS-BitC was 74 times faster than the popular Gnu 154ms. Figure 9(b) shows that Agrep and Gnu Grep were Grep tool, e.g., 3ms versus 242ms, 223ms, and 119ms for improved by more than 10 times when using N-factors for query Q2e . The difference was due to the fact that the prun- queries Q1p and Q2p . ing power of N-factors increased as the size of Σ increased. Scalability of Using N-Factors.Figure10showsthe Improving Existing Algorithms Using N-Factors.To slowly increased running time when we increased the length evaluate the benefits of N-factors, we modified the three of the sequence for different algorithms of using N-factors. existing algorithms to utilize the N-factors. Figure 9 shows We can see that the bit-parallel algorithms performed much the benefits of N-factors. Each superscript N means that better than the algorithm PNS-Merge. The algorithm the algorithm was using N-factors. It can be seen that the PMNS-BitC was the efficient one. For instance, when modified algorithms achieved a better performance than the the length of the sequence was 100 million, the running time original algorithms. For instance, for the DNA data set, were 149ms, 168ms, and 215ms, respectively, for DNA se- the N-factor improved the performance of the algorithms by quences, the running time were 16ms, 17ms, and 25ms, re- about 3 times. Figure 9(a) shows that for the query Q1d , spectively, for protein sequences, and the running time were AgrepN reduced the time of Agrep from 171ms to 42ms, Gnu 43ms, 44ms, and 62ms, respectively, for English texts. GrepN reduced the time of Gnu Grep from 222ms to 76ms,

370 500 250 250 PNS-Merge PNS-BitC PNS-Merge PNS-BitC PNS-Merge PNS-BitC PNS-BitG PMNS-BitC PNS-BitG PMNS-BitC 400 PNS-BitG PMNS-BitC 200 200

300 150 150

200 100 100

100 50 50 Running time (ms) Running time (ms) Running time (ms) 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Length of DNA sequence (x107) Length of Protein sequence (x107) Length of English text (x107) (a) DNA sequences. (b) Protein sequences. (c) English texts.

Figure 10: Scalability of approaches using N-factors.

Pruning Power of N-Factors. In Section 5.4 we analyzed We then tested the number of verifications when using the pruning power of N-factors. In order to calculate the different N-factor-based algorithms on the three data sets probability value pf in Equation 3 to see the pruning power of size 100MB. As we can see from Figure 12, the algo- of using N-factors for DNA sequences, we used three regular- rithm PMNS-BitC, which considers the necessary factors, expression workloads with query lengths 8, 12, and 15, re- required fewer verifications in most cases. For example, in spectively, each of which contained 100 REs, and lmin =5. Figure 12(a), when the query was Q4d ,thenumberofveri- Each RE in a workload generated different numbers of suf- fications using PMNS-BitC was 475, 926, about the half of fixes and N-factors with different lengths. We ran the REs the number of the other algorithms. However, in most cases, in each workload on DNA sequences with 50 million char- the filtering advantage of PMNS-BitC was not evident. For acters and calculated the average probability pf .Figure11 example, in Figure 12(c), the number of verifications of al- shows the pruning power of using N-factors. The dotted gorithm PMNS-BitC was 63, 861, which was similar to the lines represent the computed values pf , which ranged from number of algorithms PNS-BitC and PNS-BitG,andwas 70.071% to 73.781% in the three workloads. even higher than that of PNS-Merge.

100 Construction of N-Factors. We compared the construc- 90 tion time of N-factors using different N-factor construction 80 algorithms and give the experiment results in Table 5. We Pf 70 Pf Pf did not use the algorithm NaiveCore in the experiments 60 due to its very poor performance. The time of the algorithm 50 QuickCore was not stable, and it varied from 0.05ms to 40 648.08ms. The reason was that the number of sequences to 30 be processed could grow exponentially if there exists a K- 20 leene closure of size one. The algorithm EarlyCore was 10 much more efficient than QuickCore. It was stable s-

Percentage of pruned candidates (%) 0 |Q|=8 |Q| = 12 |Q| = 15 ince it avoids generating N-factors due to Kleene closure. In the best situation, it only took 0.05ms to construct the Regular-expression workloads (lmin =5) N-factors, and the worst construction time was just 1.66ms. Figure 11: The ability of pruning false negatives us- ing N-factors on DNA sequences. 7. RELATED WORK Traditional techniques of finding occurrences of an RE Q Each box region in the figure represents the pruning pow- of length m in a text T of length n is to convert Q to an au- er in the range between the first quartile (the top 25% REs) tomaton and run it from the beginning of T . An occurrence and the third quartile (the top 75% REs). The line in the will be reported whenever a final state of the automaton is box is the median value of the pruning power. The plus sign reached [1, 4, 8]. and the circles indicate the mean value of the pruning power Recent techniques utilize positive factors of the RE Q to and the outliers respectively. For the workload of |Q| =8, improve the traditional automata-based techniques. For in- the top 25% REs in the workload could prune 98.331% of stance, MutliStringRE [14] uses prefixes, Gnu Grep [4] uses the false negatives (i.e., substrings that do not need to be necessary factors, and NR-grep [9, 11] uses reversed prefixes verified) and the top 75% REs could prune 92.008% of the to identify initial matchings and then verify them using the false negatives. The other two workloads also provided sig- automata (see Section 2.2 for details). All these approaches nificantly high pruning power. As we can see, the value pf support finding initial matchings of positive factors. Mut- for each workload was lower than the experimental mean liStringRE and Gnu Grep use Shift-And algorithm [15], and value. The reason is that our analysis is based on the as- NR-grep uses BNDM [10], where BNDM is a bit-parallel sumption that data is evenly distributed, which may not be implementation of the reverse automaton. The above ap- true in real data sets. proaches cannot be directly used to find all occurrences of We got similar results for proteins and English texts (For an RE in a long text (sequence), since they are mainly devel- space reason, we do not show the details.). The average oped for identifying the sequences that contain at least one probability value pf for protein sequences was 72.115% when matching of the RE Q among a set of sequences. Agrep [15] |Q| =19andlmin = 4, and the average probability value pf is a different approach that supports approximate matching forEnglishtextswas61.870% when |Q| =23andlmin =4. of regular expressions within a specified search region and

371 ) 5

14 6 PNS-Merge PNS-BitC 105 PNS-Merge PNS-BitC 10 PNS-Merge PNS-BitC 12 PMNS-BitC PNS-BitG PMNS-BitC PNS-BitG PMNS-BitC PNS-BitG 5 4 10 10 10 104 8 103 103 6 2 10 2 4 10 10 2 10

0 Number of verifications 1 Number of verifications 1 Q1d Q2d Q3d Q4d Q5d Q1p Q2p Q3p Q4p Q5p Q1e Q2e Q3e Q4e Q5e

Number of verifications (x10 Regular expressions Regular expressions Regular expressions (a) DNA sequences. (b) Protein sequences. (c) English texts.

Figure 12: Comparison of verification numbers.

Table 5: Time for constructing Core N-factors. Data set: DNA Data set: Protein Data set: English text REs QuickCore EarlyCore REs QuickCore EarlyCore REs QuickCore EarlyCore

Q1d 18.74ms 0.15ms Q1p 2.02ms 1.99ms Q1e 528.99ms 0.63ms

Q2d 648.08ms 0.09ms Q2p 0.87ms 0.86ms Q2e 1.71ms 1.66ms

Q3d 0.17ms 0.11ms Q3p 13.37ms 0.16ms Q3e 0.45ms 0.43ms

Q4d 0.05ms 0.05ms Q4p 0.17ms 0.12ms Q4e 215.80ms 0.37ms

Q5d 14.89ms 0.19ms Q5p 429.48ms 1.01ms Q5e 0.65ms 0.44ms returns matching occurrences exactly. Compared to these [4] GNUgrep. ftp://reality.sgiweb.org/freeware/relnotes/ existing approaches, our main contribution is to use nega- fw-5.3/fw gnugrep/gnugrep.html. tive factors to improve the matching performance. [5] J. E. Hopcroft and J. D. Ullman. Introduction to , Languages, and Computation. 8. CONCLUSION Addison-Wesley Publishing Company, Reading, Massachusetts, 1979. In this paper, we proposed a novel technique called N- [6] L. F. Kolakowski, J. Leunissen, and J. E. Smith. factor and developed algorithms to improve the performance Prosearch: Fast searching of protein sequences with of matching a regular expression to a sequence. We gave a regular expression patterns related to protein structure full specification of this technique, and conducted experi- and function. Biotechniques, 13:919 – 921, 1992. ments to compare the performance between our algorithms and existing algorithms,such as Agrep, Gnu grep, and NR- [7] T. W. Lam, W. K. Sung, S. L. Tam, C. K. Wong, and grep. The experimental results demonstrated the superior- S. M. Yiu. Compressed indexing and local alignment ity of our algorithms. We also extended Agrep, Gnu grep, of DNA. Bioinformatics, 24(6):791–797, 2008. and NR-grep with the N-factor technique, and showed great [8] M. Mohri. String matching with automata. Nordic performance improvement. Journal of Computing, 4(2):217 – 231, 1997. [9] C. Navarro. NR-grep: a fast and flexible pattern matching tool. Software Practice and Experience 9. ACKNOWLEDGMENTS (SPE), 31:1265 – 1312, 2001. The work is partially supported by the National Basic Re- [10] C. Navarro and M. Raffinot. Fast and flexible string search Program of China (973 Program) (No. 2012CB316201), matching by combining bit-parallelism and suffix the National NSF of China (Nos. 60973018, 61272178), the automata. ACM Journal of Experimental Algorithmics Joint Research Fund for Overseas Natural Science of China (JEA), 5:4, 2000. (No. 61129002), the Doctoral Fund of Ministry of Education [11] C. Navarro and M. Raffinot. Compact DFA of China (No. 20110042110028), the National Natural Sci- representation for fast regular expression search. In ence of China Key Program (No. 60933001), the National Proceedings of WAE’01, Lecture Notes in Computer Natural Science Foundation for Distinguished Young Schol- Science 2141, pages 1 – 12, 2001. ars (No. 61025007), and the Fundamental Research Funds [12] C. Navarro and M. Raffinot. New techniques for for the Central Universities (No. N110804002). regular expression searching. Algorithmica, 41(2):89 – 116, 2004. 10. REFERENCES [13] R. Staden. Screening protein and nucleic acid sequences against libraries of patterns. J. DNA [1] R. A. Baeza-Yates and G. H. Gonnet. Fast text Sequencing Mapping, 1:369 – 374, 1991. searching for regular expressions or automaton searching on . J. ACM, 43(6):915 – 936, 1996. [14] B. W. Watson. A new regula grammar pattern matching algorithm. In Proceedings of the 4th Annual [2] M. Sˇim´anek. The factor automaton. Kybernetika, European Sysmposium, Lecture Notes in Computer 38(1):105 – 111, 2002. Science 1136, pages 364 – 377. Springer-Verlag, 1996. [3] M. Crochemore, A. Czumaj, L. Gasieniec, [15] S. Wu and U. Manber. Fast text searching allowing S. Jarominek, T. Lecroq, W. Plandowski, and errors. Comm. of the ACM, 35(10):83 – 91, 1992. W. Rytter. Speeding up two strings matching algorithms. Algorithmica, 12(4/5):247 – 267, 1994.

372