Information and Media Technologies 3(2): 236-245 (2008) reprinted from: IPSJ Digital Courier 4: 69-78 (2008) © Information Processing Society of Japan

Regular Paper Merging String Sequences by Longest Common Prefixes

Waihong Ng †1 and Katsuhiko Kakehi†2

We present LCP Merge, a novel merging algorithm for merging two ordered sequences of strings. LCP Merge substitutes string comparisons with integer comparisons whenever possible to reduce the number of character-wise comparisons as well as the number of key accesses by utilizing the longest common prefixes (LCP) between the strings. As one of the applications of LCP Merge, we built a string based on recursive merge sort by replacing the merging algorithm with LCP Merge and we call it LCP Merge sort. In case of sorting strings, the computational complexity of recursive merge sort tends to be greater than O(n lg n) because string comparisons are generally not constant time and depend on the properties of the strings. However, LCP Merge sort improves recursive merge sort to the extent that its computational complexity remains O(n lg n) on average. We performed a number of experiments to compare LCP Merge sort with other string sorting algorithms to evaluate its practical performance and the experimental results showed that LCP Merge sort is efficient even in the real-world.

lem discussed in this paper. We applied LCP 1. Introduction Merge to build a string merge sort (we call it Merging is a fundamental computational pro- LCP Merge sort) and compared it with recur- cess which combines two or more ordered se- sive merge sort to evaluate its improvements. quences of objects into a single ordered se- In addition, LCP Merge sort is compared with quence of the objects and each object has a key four other string sorting algorithms: Multikey which governs the ordering of the object. Merg- 1), 13), MSD 10) ing has many applications, for instance, merge and CRadix sort 11), in a number of experi- sort, external sorting, etc. Classical merging ments to evaluate its performance. algorithms (e.g., Section 5.2.4 Algorithm M, The primal form of LCP Merge first appeared two-way merge, in Ref. 8)) assume the compar- in our previous work 4) and was called Length of ison operation runs in constant time. However, Longest Common Prefix Merge (LLCP Merge). when the keys are strings, which are widely But we renamed LLCP Merge to LCP Merge, used in practice, the computational complex- taking into account that the word LCP is very ity of the comparison operation is not constant common and it connotes ‘length’. but depends on the properties of the strings. In 2. Preliminaries this paper, we present Longest Common Pre- fix Merge or LCP Merge for short, a novel We define some terminology for the discus- merging algorithm for merging two ordered se- sions in this paper. quences of strings in which their keys are the Definition 2.1 An alphabet is a finite to- strings themselves. LCP Merge has a fewer tally ordered set of symbols and the elements number of key accesses and character-wise com- of the alphabet are called characters. parisons than those of classical merging algo- For example, {a, b,...,z} is the alphabet of rithms. The reductions in the number of key all lower-case English letters. accesses and the number of character-wise com- Without ambiguity, an alphabet is assumed parisons are achieved by utilizing the LCPs be- where appropriate in our discussion without be- tween the strings. Approaches utilizing LCP in- ing mentioned. formation are well-known for constructing suf- Definition 2.2 A string is a list of char- fix trees or suffix arrays 3),7),9). But to the best acters. Given a string s = s[1]s[2] ...s[a], s[i] of our knowledge, there is no published LCP (1 ≤ i ≤ a)isthei th character of s.Thenum- based merging algorithm which solves the prob- ber of characters in s, denoted |s|, is the length of s.Theempty or null string is the string †1 Graduate School of Fundamental Science and Engi- neering, Waseda University which has zero length and is denoted . †2 Faculty of Science and Engineering, Waseda Univer- Definition 2.3 Concatenation of two sity strings is defined as the juxtaposition of them.

236 Information and Media Technologies 3(2): 236-245 (2008) reprinted from: IPSJ Digital Courier 4: 69-78 (2008) © Information Processing Society of Japan

For example, given that two strings s = or monotonically increasing if si ≤ si+1 for s[1]s[2] ...s[a]andt = t[1]t[2] ...t[b], then i :1≤ i |p| for any common prefix p = p of s and { t,thenp is the longest common prefix (LCP) of char c1, c2; s and t. The length of the longest common pre- int r; fix (LLCP) of s and t is denoted by the notation do { λ(s, t). c1 = *s1++; c2 = *s2++; For example, if we let s and t be strings over r = (unsigned char)c1 the alphabet containing English letters and sup- - (unsigned char)c2; pose s = AACDCDAB and t = AACDACEF, } while ((r==0) && (c1!=’\0’)); then the strings , A, AA, AAC and AACD are return r; the common prefixes, AACD is the LCP of s } and t,andλ(s, t)=4. The above algorithm (program) has the same Definition 2.6 Let S be a set of strings functionality as the standard C library func- containing at least two strings. If s = ∈ S and tion strcmp. It may be the simplest one as it l is the maximum of λ(s, t) for all t = s ∈ S, repeatedly compares the strings given in the ar- then the string guments character by character until a different  s, l = |s|; one is found or a null character is encountered, p = s[1] ...s[l +1],l<|s|, and there is no way to reduce those character- wise comparisons without prior knowledge of is the distinguishing prefix 12) of s in S. the input strings 10). However, it is obvious that In other words, the distinguishing prefix of a the running time of the algorithm is not con- string is the string itself if the string forms a stant since the number of character-wise com- prefix of any other strings in the set; otherwise parisons required depends on the strings being it is the shortest prefix of the string that is not compared. Further, despite that this algorithm a prefix of other strings in the set. explicitly gives us the relative order and implic- Forexample,ifweconsiderthesetS = itly the LLCP of the strings (the number of it- {AACDCDAB, AACDACEF, AACEBA, AAC}, erations of the while loop minus one), known then the distinguishing prefix of AACDCDAB merging algorithms do not use their LLCP. is AACDC and that of AAC is AAC itself. Therefore, there arises the question whether Definition 2.7 The ordering relation < on we can utilize the LLCP information to ease strings is defined as follows (s = ps and t = pt our computational process in some context. In are strings and p is the LCP of s and t). the next section, we describe our merging algo- s>tif s[1] >t[1] or when s = t ∧ t = rithm, Longest Common Prefix Merge or LCP (s is greater than t) Merge for short, which utilizes LLCP informa- tion between the strings, for merging two or- s

237 Information and Media Technologies 3(2): 236-245 (2008) reprinted from: IPSJ Digital Courier 4: 69-78 (2008) © Information Processing Society of Japan result, a significant amount of key accesses and ing merges X and Y to form a single ascending character-wise comparisons are reduced when string sequence Z = z1,z2,...,za+b so that Z compared with those of classical merging algo- is a permutation of X and Y. Instead of merg- rithms. ing X and Y directly, we investigate merging 3.1.1 Annotated String them in their annotated string sequence forms. For a string s with a reference string In addition, we use the symbol ← to denote r, we annotate λ(r, s)tos to form a the assignment operation. For example, u ← v pair λ(r, s),s which we call an annotated means the value of variable u is replaced by the string. We interchangeably write r ·s to de- value of variable v. note λ(r, s),s. For a sequence of anno- Let αi and βj be strings. Set α1 ← , tated strings, A = r1 ·s1,r2 ·s2,...,ra ·sa and β1 ← and αi ←xi−1 for i :2≤ i ≤ a and a reference string r,ifr1 = r and ri = si−1 βj ← yj−1 for j :2≤ j ≤ b so that X = for all i :1

238 Information and Media Technologies 3(2): 236-245 (2008) reprinted from: IPSJ Digital Courier 4: 69-78 (2008) © Information Processing Society of Japan scanned (when i>a or j>b). In step 2, we ber of comparison operations remains the same apply Theorem 3.1 to select the smaller an- as classical merging. However, the number of notated string from αi ·xi and βj ·yj and ob- character-wise comparisons is reduced since no tain λ(xi,yj) without comparing xi and yj if character-wise comparison occurs in phase A λ(αi,xi) = λ(βj,yj), otherwise a breakdown oc- and a fewer number of character-wise compar- curs that forces us to issue a string compari- isons is required in phase B by not comparing son on xi and yj (key accesses occur) to make their LCP. the selection. The comparison skips the first 4. Application of LCP Merge to String λ(αi,xi) character-wise identical characters of Sorting xi and yj and then character-wise compares xi and yj by starting at their λ(αi,xi)+1thchar- As one of the application of the merging algo- acters. It terminates when character-wise un- rithm, we applied LCP Merge to build a merge matched characters of xi and yj are found or sort, LCP Merge sort, for sorting strings. LCP when xi or (non-exclusive) yj is exhausted and Merge sort is basically a recursive merge sort λ(xi,yj) is obtained as a by-product. In step 3, (henceforth, we call recursive merge sort Merge if αi ·xi ≤ βj ·yj ,wesetγk ·zk ← αi ·xi, βj ← zk sort) but differs internally in that it recursively to reannotate yj by λ(xi,yj) and then increase applies LCP merge to perform merging on an- i by 1. Since X is an annotated string sequence, notated string sequences. This requires LCP αi = xi−1 such that after the operations of step Merge sort to transform the input strings to 3, αi = βj = zk = xi−1.Instep4,k is increased annotated strings but this can be achieved ef- by 1. Thus the invariant αi = βj = zk−1 holds fortlessly by annotating a zero to every string again and is ready for the next iteration after obtained at the end of the recursion of divid- the bound checks in step 5. Similar arguments ing the input string sequence. In other words, hold for the case αi ·xi >βj ·yj . Instep6,the the strings are transformed to annotated string remaining elements of either X or Y are con- sequences with as their external references. catenated to Z. In the following Sections 4.1, 4.2 and 4.3, It is clear that zk ≤zk+1 for k :1≤ k

239 Information and Media Technologies 3(2): 236-245 (2008) reprinted from: IPSJ Digital Courier 4: 69-78 (2008) © Information Processing Society of Japan   of the strings being compared. Thus, we focus 1 on the analysis of phase B. n lg n 1+Pω + − n. (5) lg m Let si (1 ≤ i ≤ n)beoneofthen strings to be sorted, li be the LLCP annotated to si and Therefore, providing m>1, which is very μi be the length of distinguishing prefix of si. common in practice, the computational com- In the sorting process, comparisons of si only plexity of LCP Merge sort on average is occur in phase B (i.e., when the comparisons of O(n lg n). li,si incur breakdowns). Each comparison of 4.2 Computational Complexity of si starts at si[li + 1] and may cause an incre- Merge Sort ment in the value of li at the end. The value of We analyze the average computational com- li is 0 initially and may be increased or remains plexity of Merge sort in this section. unchanged between successive comparisons of Clearly the total number of comparisons is si but eventually reaches μi − 1 because si has equal to the number of character-wise compar- to be inspected up to si[μi] to be distinguished isons in the case of Merge sort. Let the in- from the other n − 1 strings. Suppose at the put to Merge sort contain n strings and have end of an execution of phase B, si is compared an alphabet size of m. Assuming both n and up to si[k +1](0≤ k<μi)sothatthevalue m are powers of 2 for simplicity, when sort- of li is increased to k. The next time when the ing the n strings, the recursion of Merge sort comparison of li,si incurs breakdown, si is starts at recursion level 1 and ends at recur- compared starting at si[k +1]. Thus si[k +1] sion level lg n. At recursion level lg n,nomerg- is compared twice. The change of value in li is ing occurs because the inputs at that recur- only caused by the comparison of si, therefore, sion level contain only one string each. Thus, if ωi is the number of breakdowns encountered if μd is the ALDP of the inputs at recursion in sorting si, the number of character-wise com- level d (1 ≤ d ≤ lg n − 1), the total number of parisons required to sort si is character-wise comparisons of Merge sort is μi + ωi − 1. (1) CM = n(μ1 + μ2 + ···+ μlg n−1). (6) Hence, the total number of character-wise Assuming the inputs at recursion level d con- comparisons required to sort n strings by LCP tain nd strings each and are uniformly random, Merge sort is we can express μd as lg nd/ lg m and rewrite i n = Eq. (6) to get  μ ω − ( i + i 1) lg n1 lg n2 lg nlg n−1 CM = n + +···+ . i=1 lg m lg m lg m i=n i=n (7) = μi + ωi − n. (2) i=1 i=1 However, μd has to be at least one character Since we have to compare the annotated by definition 2.6, but the value of lg nd/ lg m is strings about n lg n times to sort them, if we smaller than 1 when m>nd > 1, so we replace let Pω be the probability of breakdown and μa the value of lg nd/ lg m with 1 in those cases. n nd be the average length of distinguishing prefixes Then substituting = 2d into Eq. (7) yields (ALDP) of the n strings, then by Eq. (2) the n n m lg lg 2 lg CM = n + + ···+ number of character-wise comparisons of LCP lg m lg m lg m Merge sort for sorting n strings is about  nμa + Pωn lg n − n +1+···+1 . (8) = n(μa − 1) + Pωn lg n. (3) Adding the number of comparisons of phase Replacing lg n/ lg m with μa and because A to Eq. (3), the total number of comparisons there are lg m − 1 levels where m>nd > 1, we of LCP Merge sort is almost have    n lg n + n(μa − 1) + Pωn lg n. (4) C n μ μ − 1 Assuming the input has alphabet size m and M = a + a m   lg is uniformly random, about logm n =lgn/ lg m n μ − 2 characters are required to differentiate the + a m strings from each other (Section 6.3 in Ref. 8)). lg   n − m Hence, μa =lgn/ lg m. Thus, Eq. (4) can be lg lg + ···+ μa − rewritten as lg m

240 Information and Media Technologies 3(2): 236-245 (2008) reprinted from: IPSJ Digital Courier 4: 69-78 (2008) © Information Processing Society of Japan  sort when the input has a long ALDP. +lgm − 1  5. Experiments

= n μa(lg n − lg m +1) We discuss in this section the results of our   experiments in the real-world performances of n− m − 1 2 ··· lg lg LCP Merge sort and other string sorting algo- m + m + + m rithms. The interface of LCP Merge sort is de- lg lg lg clared as +lgm − 1 void LCPMergesort(char **strs,   size_t n) 2 μ +1 μa − 1 = n lg n a + . (9) so that it is the same interface as other sorting 2μa 2μa lg m algorithms being compared in the experiments From Eq. (9), we observe that the number performed in this paper for fairness. Thus, LCP of character-wise comparisons of Merge sort is Merge sort takes an array of n string pointers, proportional to the ALDP of the input and we sorts the strings and then returns the pointers also notice that Merge sort performs worse with to the sorted strings through the same array. a small alphabet size. In fact, to ensure the- 5.1 Test Bed and Test Data oretically that LCP Merge sort performs bet- Our experiments were conducted on the fol- ter than Merge sort, the input should have an lowing environment. ALDP of about 4 characters if we anticipate Model: IBM PC Compatible Pω = 1, i.e., the worse case of LCP Merge sort. CPU: AMD Athlon XP 2500+ 4.3 Computational Complexity of Main memory: 1.5 GB CRadix Sort 1st level cache: 128 kb In this section, the computational complex- 2nd level cache: 512 kb ity of CRadix sort in the average case is dis- OS: Windows XP Pro SP2 cussed. The reason we chose CRadix sort is Compiler: Visual C++ 6.0 sp6 simply that it is the fastest among other sort- All sorting programs were complied with the ing algorithms in the experiments conducted in sources obtained from their authors with the section 5 and theoretically efficient since it has ‘maximize speed’ option and we used the four the same computational complexity of that of sets of test data listed below in our experiments. MSD radix sort. • Dataset 1 — Random String: Strings ran- CRadix sort does not compare but groups the domly generated from the characters drawn strings to sort the input, so we roughly compare from the uniform distribution in the range LCP Merge sort and CRadix sort by comparing of ASCII code 33 to 126. The length of the the total number of comparisons of LCP Merge strings is random which varies uniformly sort and the number of characters inspected by from 0 (empty string) to 19 characters. CRadix sort. In theory, CRadix sort scans the • Dataset 2 — URL: Web page addresses strings two times (one for calculating the num- (URL) extracted in order of occurrence ber of strings in each group and another one from the documents of the large web track 5),6) for calculating the starting memory address of in the TREC project . ‘http://’ s from each group) to group them in each phase. As- the start of the URLs were stripped. The suming the input has n strings and the ALDP average length of the URLs is 32 characters of them is μa, the average number of character- long and there are large numbers of dupli- wise comparisons (average number of characters cates. inspected) is • Dataset 3 — Web Page Word: Dis- CR =2nμa (10) tinct alphabetic strings separated by non- Thus, by Eq. (4), the difference in the to- alphabetic characters extracted in order of tal number of comparisons between LCP Merge first occurrence from web pages excluding sort and CRadix sort is tags, images, and other non-textual infor- |n(μa − (Pω +1)lgn +1)|. (11) mation. The web pages are from the same Hence, if (μa +1)> (Pω +1)lgn,LCP source as dataset 2. Merge sort has a fewer total number of com- • Dataset 4 — Genome: DNA sequence frag- parisons. Also, by Eq. (11), LCP Merge sort ments. Each fragment is 9 characters in can be said to be more suitable than CRadix length. There are lots of duplicates.

241 Information and Media Technologies 3(2): 236-245 (2008) reprinted from: IPSJ Digital Courier 4: 69-78 (2008) © Information Processing Society of Japan

Table 1 Average length of the distinguish prefixes (ALDP) of the datasets. ALDP dataset 1 3.911 dataset 2 31.89 dataset 3 9.279 dataset 4 10.00

The test datasets are taken from Ref. 13) because they represent real world applications well. The strings in the datasets are null ter- minated strings, thus the actual length of the strings is 1 character longer than stated. For example, the length of the strings in dataset 4 Fig. 1 Number of character-wise comparisons of LCP Merge sort (marked with filled circles) and is actually 10 characters long if the null char- number of character-wise comparisons of Merge acter is considered while an empty string has a sort (marked with crosses) on four datasets: length of 1 character. In addition, there were random string (dashed line), URL (solid line), only 10 M strings in dataset 2 originally, but web page word (dotted line) and genome (dash- dotted line). The numbers of character-wise we tripled it by concatenating the original data comparisons of LCP Merge sort on random itself to get 30 M strings in order to be consis- string, web page word and genome are so close tent in size with other datasets. This is not that their lines overlapped. as harmful as it sounds because there were al- ready a large number of duplicates inside origi- nal dataset 2. Table 1 shows the ALDP of the datasets. 5.2 Comparing LCP Merge sort with Merge Sort We compared LCP Merge sort and Merge sort on the four datasets in four categories: number of character-wise comparisons, number of key accesses, total number of comparisons (the sum of the number of LLCP comparisons in phase A and the number of character-wise compar- isons in phase B in case of LCP Merge sort whereas the number of character-wise compar- Fig. 2 Total number of comparisons of LCP Merge isons in case of Merge sort) and running times. sort (marked with filled circles) and total num- ber of comparisons of Merge sort (marked Figure 1 shows the numbers of character- with crosses) on four datasets: random string wise comparisons of LCP Merge sort and Merge (dashed line), URL (solid line), web page word sort on four datasets. We observed that (dotted line) and genome (dash-dotted line). even though the ALDPs of the datasets vary The total number of comparisons of LCP Merge sort on random string, web page word and within a factor of about 8.15, the numbers of genome are so close that their lines overlapped. character-wise comparisons of LCP Merge sort on datasets 1, 3 and 4 are almost the same (their lines overlapped in Fig. 1) and that on theoretical results discussed in Section 4.1 and dataset 2 is just about 2 times the others while Section 4.2. For instance, in the case of dataset the numbers of character-wise comparisons of 1(whenn = 30 M), the number of compari- Merge sort vary about a factor of 7.23 among son operations of LCP Merge sort estimated the datasets. The results agree with Eq. (3) and by Eq. (4) is almost 1.893n lg n (Pω ≈ 0.776) demonstrated that LCP Merge sort is not sen- and that of Merge sort estimated by Eq. (9) is sitive to the ALDP of its input. about 2.140n lg n (m = 95). These two estima- In the experiments of total number of com- tions fairly agree with our experimental results parisons (Fig. 2), LCP Merge sort was slightly showninFig.2andindicateLCPMergesort better than Merge sort on dataset 1 but had only makes a little (about 13%) improvement greater improvements on other datasets. These to Merge sort in the number of character-wise experimental results are consistent with our comparisons on dataset 1.

242 Information and Media Technologies 3(2): 236-245 (2008) reprinted from: IPSJ Digital Courier 4: 69-78 (2008) © Information Processing Society of Japan

Fig. 3 Running time of LCP Merge sort (marked with Fig. 4 Number of key accesses of LCP Merge sort filled circles) and running time of Merge sort (marked with filled circles) and number of key (marked with crosses) on four datasets: random accesses of Merge sort (marked with crosses) string (dashed line), URL (solid line), web page on four datasets: random string (dashed line), word (dotted line) and genome (dash-dotted URL (solid line), web page word (dotted line) line). and genome (dash-dotted line).

From Fig. 3, we notice that the running times sorting algorithms known: Mulitkey Quick- of LCP Merge sort are shorter than those of sort 1), Burstsort 13), MSD radix sort 10) and Merge sort on all four datasets but not to the CRadix sort 11). Multikey Quicksort is a well degree of improvement that the total number known string and is some- of comparisons implies regardless that the total timesdeployedasaminorsortofthemain number of comparisons is a common measure sorting algorithms for sorting some kinds of in- of performance in sorting algorithms. For in- put 13). Burstsort is a cache efficient string sort- stance, LCP Merge sort is only about 2 times ing algorithm based on burst trie which is re- faster than Merge sort on dataset 2 despite that ported to be generally two times faster than the total number of comparisons of LCP Merge Multikey Quicksort and has a computational sort is roughly 1/6 that of Merge sort and sim- complexity of O(n). MSD radix sort is a fast ilar behavior is observed in the experiments on radix sort for sorting strings and CRadix sort the other datasets as well. A close look at the is a cache efficient variant of MSD radix sort. running times on dataset 3 and dataset 4 re- There are several versions of burstsort and we veals the fact that even though the total num- chose the generally faster burstsortA for our ex- ber of comparisons on both datasets are nearly periments. In addition, CRadix sort needs a the same, LCP Merge sort ran about 13% faster runtime parameter (key buffer size) to be tuned (when n = 30 M) on dataset 4 than on dataset for the input data to have maximum perfor- 3. Moreover, we also observed that the num- mance and we adjusted the parameter accord- ber of key accesses (Fig. 4)ofLCPMergesort ing to Ref. 11) to give the best performance on on dataset 4 is smaller than that on dataset each dataset. 3. These observations suggest that the num- Figures 5a – 5d show the comparisons in ber of key accesses has larger impact than the running time of LCP Merge sort with CRadix total number of comparisons on the running sort, Burstsort, MSD radix sort and Multi- time of LCP Merge sort. The main reasons are key Quicksort on the four datasets respectively. considered to be that when the input is large, It can be observed that LCP Merge sort ran key accesses may frequently incur cache misses faster than Multikey Quicksort on all datasets that have great impact on real-world perfor- (Fig. 5d). We also notice that LCP Merge sort mance as we reported in Ref. 11) and the cost of is the winner on dataset 2. However, LCP LLCP/character-wise comparisons is supposed Merge sort indeed did not run faster but just to be cheaper than the cost of cache miss penal- the other sorts ran much slower on dataset 2 ties. as the figures show that the running times of 5.3 Comparing LCP Merge sort with the sorts fluctuated by large factors among the Other Sorts four datasets. For example, when n =30M, In this section, we compare the running times the running times of LCP Merge sort on the of LCP Merge sort and four other fast string four datasets are within a factor of not more

243 Information and Media Technologies 3(2): 236-245 (2008) reprinted from: IPSJ Digital Courier 4: 69-78 (2008) © Information Processing Society of Japan

Fig. 5a Running time of LCP Merge sort (marked Fig. 5c Running time of LCP Merge sort (marked with filled circles) and running time of CRadix with filled circles) and running time of MSD sort (marked with filled triangles) on four radix sort (marked with filled diamonds) on datasets: random string (dashed line), URL four datasets: random string (dashed line), (solid line), web page word (dotted line) and URL (solid line), web page word (dotted line) genome (dash-dotted line). and genome (dash-dotted line).

Fig. 5b Running time of LCP Merge sort (marked Fig. 5d Running time of LCP Merge sort (marked with filled circles) and running time of Burst- with filled circles) and running time of Mul- sort (marked with filled squares) on four tikey Quicksort (marked with pluses) on four datasets: random string (dashed line), URL datasets: random string (dashed line), URL (solid line), web page word (dotted line) and (solid line), web page word (dotted line) and genome (dash-dotted line). genome (dash-dotted line). than 1.315 but those of CRadix sort, Burstsort, cache misses immensely affect real-world per- MSD radix sort and Multikey Quicksort var- formance. ied about by factors of 6.326, 5.622, 8.390 and 6. Conclusions 3.093 respectively. This agrees with the theo- retical analysis in Section 4.1 and are also con- We introduced the concept of annotated sistent with the experimental observations in string and built LCP Merge based on it. LCP Section 5.2 which infer LCP Merge sort is not Merge requires considerably less key accesses sensitive to the ALDP of its input. Moreover, and character-wise comparisons than merging we observed that LCP Merge sort and CRadix the string sequences by classical merging algo- sort are the two sorts the most relatively close rithms. However, extra spaces are required to to linear in performance. It is surprising that store the LLCPs. For a non memory-rich en- Burstsort and MSD radix sort, supposed to be vironment, one can apply the concept of anno- O(n) algorithms behaved non-linearly appar- tated string to a less memory intensive merging ently, and in particular, MSD radix sort just algorithm such as Ref. 2), although the trade off could not reach CRadix sort in terms of lin- is execution speed. earity even though CRadix sort is merely a We built LCP Merge sort by using LCP cache efficient version of MSD radix sort. This Merge and demonstrated its effectiveness by is consistent again with our hypothesis that conducting a number of experiments. The ex-

244 Information and Media Technologies 3(2): 236-245 (2008) reprinted from: IPSJ Digital Courier 4: 69-78 (2008) © Information Processing Society of Japan perimental results are consistent with theoreti- rithm, Proc. 10th Euro. Symp. on Algorithms, cal analyses and showed that LCP Merge sort LNCS 2461, pp.698–710 (2002). is practical and robust on various kinds of test 10) McIlroy, P.M., Bostic, K. and McIlroy, M.D.: data. In addition, it is observed that the num- Engineering Radix Sort, Comput. Syst., Vol.6, ber of key accesses immensely affects real-world No.1, pp.5–27 (1993). performances of LCP Merge sort. 11) Ng, W.H. and Kakehi, K.: Cache Efficient Moreover, LCP Merge sort is expected to Radix Sort for String Sorting, IEICE Trans. be effective in suffix sorting since many typi- Fundamentals of Electronics, Communications and Computer Sciences, Vol.E 90-A, No.2, cal texts have long average LLCPs as reported pp.457–646 (2007). in Ref. 9). Application of LCP Merge sort to 12) Nilsson, S.: Radix Sorting and Searching, multifield sorting is considered to be advanta- Ph.D. Thesis, Dept. Comput. Sci., Lund Uni- geous as well. We will investigate such kinds versity, Lund, Sweden (1996). of new applications of LCP Merge and deepen 13) Sinha, R. and Zobel, J.: Efficient Trie-based our theoretical analysis on the behavior of LCP Sorting of Large Sets of Strings, 26th Aus- Merge sort. tralasian Comput. Sci. Conf. (ACSC ), pp.11– 18 (2003). References (Received July 3, 2007) 1) Bentley, J.L. and Sedgewick, R.: Fast Algo- (Accepted November 6, 2007) rithms for Sorting and Searching Strings, Proc. (Released February 6, 2008) 8th Annual ACM-SIAM Symp. Discrete Algo- rithms, pp.360–369 (1997). Waihong Ng is a doctoral 2) Dvorak, S. and Durian, B.: Stable Lin- student of the Department of ear Time Sublinear Space Merging, Computer Computer Science, Waseda Uni- Journal, Vol.30, No.4, pp.372–375 (1987). versity and obtained his B.Eng. 3) Farach, M.: Optimal Suffix Tree Construc- and MICSc degrees both from tion with Large Alphabets, Proc. 38th Symp. on Foundations of Comp. Sci. ’97, pp.137–143 Waseda University in 1998 and (1997). 2000 respectively. He has been 4)Futamura,Y.,Futamura,N.andNg,W.H.: researching sorting algorithms since he was an Leaves Optimal Adaptive Sort and LLCP undergraduate student. Student member of Merge, JSSST Conference, Vol.1D-2 (2004). IEICE, IPSJ and JSSST. Research interest: 5) Harman, D.: Overview of the Second Text Re- algorithms and data structures, algorithm en- trieval Conference (TREC-2), Inform. Process. gineering and system evaluation. Mgmt., Vol.31, No.3, pp.271–289 (1995). 6) Hawking, D., Craswell, N., Thistlewaite, P. Katsuhiko Kakehi has been and Harman, D.: Results and Challenges a Professor in the Department of in Web Search Evaluation, Proc. 8th Intl. Computer Science, Waseda Uni- Conf. WWW, Toronto, Canada, pp.1321–1330 versity since 1991. 1968 Bs.Eng. (1999). the University of Tokyo, 1970 7) Kasai, T., Lee, G., Arimura, H., Arikawa, S. Ms.Eng. the University of Tokyo and Park, K.: Linear-Time Longest-Common- in Applied Mathematics. Assis- Prefix Computation in Suffix Arrays and Its Applications, Proc. 12th Symp. on CPM, tant Prof. (1974), then Associate Prof. (1976) of LNCS 2089, pp.181–192 (2001). Rikkyo University Math. Dept, Prof. of Waseda 8) Knuth, D.E.: The Art of Computer Program- University Math. Dept (1986). IPSJ fellow, ming, Vol.3: Sorting and Searching, Addison member of JSSST, ACM and MSJ. Research Wesley, 2nd edition (1998). area: programming languages, formalization 9) Manzini, G. and Ferragina, P.: Engineering and implementation. a Lightweight Suffix Array Construction Algo-

245