Merging String Sequences by Longest Common Prefixes
Total Page:16
File Type:pdf, Size:1020Kb
Information and Media Technologies 3(2): 236-245 (2008) reprinted from: IPSJ Digital Courier 4: 69-78 (2008) © Information Processing Society of Japan Regular Paper Merging String Sequences by Longest Common Prefixes Waihong Ng †1 and Katsuhiko Kakehi†2 We present LCP Merge, a novel merging algorithm for merging two ordered sequences of strings. LCP Merge substitutes string comparisons with integer comparisons whenever possible to reduce the number of character-wise comparisons as well as the number of key accesses by utilizing the longest common prefixes (LCP) between the strings. As one of the applications of LCP Merge, we built a string merge sort based on recursive merge sort by replacing the merging algorithm with LCP Merge and we call it LCP Merge sort. In case of sorting strings, the computational complexity of recursive merge sort tends to be greater than O(n lg n) because string comparisons are generally not constant time and depend on the properties of the strings. However, LCP Merge sort improves recursive merge sort to the extent that its computational complexity remains O(n lg n) on average. We performed a number of experiments to compare LCP Merge sort with other string sorting algorithms to evaluate its practical performance and the experimental results showed that LCP Merge sort is efficient even in the real-world. lem discussed in this paper. We applied LCP 1. Introduction Merge to build a string merge sort (we call it Merging is a fundamental computational pro- LCP Merge sort) and compared it with recur- cess which combines two or more ordered se- sive merge sort to evaluate its improvements. quences of objects into a single ordered se- In addition, LCP Merge sort is compared with quence of the objects and each object has a key four other string sorting algorithms: Multikey which governs the ordering of the object. Merg- Quicksort 1), Burstsort 13), MSD radix sort 10) ing has many applications, for instance, merge and CRadix sort 11), in a number of experi- sort, external sorting, etc. Classical merging ments to evaluate its performance. algorithms (e.g., Section 5.2.4 Algorithm M, The primal form of LCP Merge first appeared two-way merge, in Ref. 8)) assume the compar- in our previous work 4) and was called Length of ison operation runs in constant time. However, Longest Common Prefix Merge (LLCP Merge). when the keys are strings, which are widely But we renamed LLCP Merge to LCP Merge, used in practice, the computational complex- taking into account that the word LCP is very ity of the comparison operation is not constant common and it connotes ‘length’. but depends on the properties of the strings. In 2. Preliminaries this paper, we present Longest Common Pre- fix Merge or LCP Merge for short, a novel We define some terminology for the discus- merging algorithm for merging two ordered se- sions in this paper. quences of strings in which their keys are the Definition 2.1 An alphabet is a finite to- strings themselves. LCP Merge has a fewer tally ordered set of symbols and the elements number of key accesses and character-wise com- of the alphabet are called characters. parisons than those of classical merging algo- For example, {a, b,...,z} is the alphabet of rithms. The reductions in the number of key all lower-case English letters. accesses and the number of character-wise com- Without ambiguity, an alphabet is assumed parisons are achieved by utilizing the LCPs be- where appropriate in our discussion without be- tween the strings. Approaches utilizing LCP in- ing mentioned. formation are well-known for constructing suf- Definition 2.2 A string is a list of char- fix trees or suffix arrays 3),7),9). But to the best acters. Given a string s = s[1]s[2] ...s[a], s[i] of our knowledge, there is no published LCP (1 ≤ i ≤ a)isthei th character of s.Thenum- based merging algorithm which solves the prob- ber of characters in s, denoted |s|, is the length of s.Theempty or null string is the string †1 Graduate School of Fundamental Science and Engi- neering, Waseda University which has zero length and is denoted . †2 Faculty of Science and Engineering, Waseda Univer- Definition 2.3 Concatenation of two sity strings is defined as the juxtaposition of them. 236 Information and Media Technologies 3(2): 236-245 (2008) reprinted from: IPSJ Digital Courier 4: 69-78 (2008) © Information Processing Society of Japan For example, given that two strings s = or monotonically increasing if si ≤ si+1 for s[1]s[2] ...s[a]andt = t[1]t[2] ...t[b], then i :1≤ i<n. the concatenation of s and t is st = 3. LCP Merging Algorithm s[1]s[2] ...s[a]t[1]t[2] ...t[b]. Definition 2.4 Astringp is a prefix of The classical merging algorithm(Section 5.2.4 string s if s = ps for some string s. Algorithm M (two-way merge) in Ref. 8)) is Note that any string is a prefix of the string basedoncomparisonoperationwhichisas- itself and is a prefix of any string. sumed to be run in constant time. When the Definition 2.5 Astringp is a common pre- keys are strings, a general algorithm for the fix of the strings s and t iff s = ps and t = pt comparison operation will be similar to for some strings s and t.Whenp is the longest int strcmp(const char *s1, one among the common prefixes of s and t, i.e., const char *s2) |p| > |p| for any common prefix p = p of s and { t,thenp is the longest common prefix (LCP) of char c1, c2; s and t. The length of the longest common pre- int r; fix (LLCP) of s and t is denoted by the notation do { λ(s, t). c1 = *s1++; c2 = *s2++; For example, if we let s and t be strings over r = (unsigned char)c1 the alphabet containing English letters and sup- - (unsigned char)c2; pose s = AACDCDAB and t = AACDACEF, } while ((r==0) && (c1!=’\0’)); then the strings , A, AA, AAC and AACD are return r; the common prefixes, AACD is the LCP of s } and t,andλ(s, t)=4. The above algorithm (program) has the same Definition 2.6 Let S be a set of strings functionality as the standard C library func- containing at least two strings. If s = ∈ S and tion strcmp. It may be the simplest one as it l is the maximum of λ(s, t) for all t = s ∈ S, repeatedly compares the strings given in the ar- then the string guments character by character until a different s, l = |s|; one is found or a null character is encountered, p = s[1] ...s[l +1],l<|s|, and there is no way to reduce those character- wise comparisons without prior knowledge of is the distinguishing prefix 12) of s in S. the input strings 10). However, it is obvious that In other words, the distinguishing prefix of a the running time of the algorithm is not con- string is the string itself if the string forms a stant since the number of character-wise com- prefix of any other strings in the set; otherwise parisons required depends on the strings being it is the shortest prefix of the string that is not compared. Further, despite that this algorithm a prefix of other strings in the set. explicitly gives us the relative order and implic- Forexample,ifweconsiderthesetS = itly the LLCP of the strings (the number of it- {AACDCDAB, AACDACEF, AACEBA, AAC}, erations of the while loop minus one), known then the distinguishing prefix of AACDCDAB merging algorithms do not use their LLCP. is AACDC and that of AAC is AAC itself. Therefore, there arises the question whether Definition 2.7 The ordering relation < on we can utilize the LLCP information to ease strings is defined as follows (s = ps and t = pt our computational process in some context. In are strings and p is the LCP of s and t). the next section, we describe our merging algo- s>tif s[1] >t[1] or when s = t ∧ t = rithm, Longest Common Prefix Merge or LCP (s is greater than t) Merge for short, which utilizes LLCP informa- tion between the strings, for merging two or- s<tif s [1] <t[1] or when s = t ∧ s = dered string sequences. (s is smaller than t) 3.1 LCP Merge s = t if s = t = LCP Merge is built on the concept of anno- (s is equal to t) tated strings. It utilizes the LCP information Definition 2.8 A string sequence, denoted between the strings to substitutes string com- by s1,s2,...,sn, is a list of finite number parisons with integer comparisons whenever ap- of strings where si (1 ≤ i ≤ n) are strings. plicable, which in turn reduces the running time A string sequence is said to be ascending of the comparison operation in such cases. As a 237 Information and Media Technologies 3(2): 236-245 (2008) reprinted from: IPSJ Digital Courier 4: 69-78 (2008) © Information Processing Society of Japan result, a significant amount of key accesses and ing merges X and Y to form a single ascending character-wise comparisons are reduced when string sequence Z = z1,z2,...,za+b so that Z compared with those of classical merging algo- is a permutation of X and Y. Instead of merg- rithms. ing X and Y directly, we investigate merging 3.1.1 Annotated String them in their annotated string sequence forms.