Information and Media Technologies 3(2): 236-245 (2008) reprinted from: IPSJ Digital Courier 4: 69-78 (2008) © Information Processing Society of Japan
Regular Paper Merging String Sequences by Longest Common Prefixes
Waihong Ng †1 and Katsuhiko Kakehi†2
We present LCP Merge, a novel merging algorithm for merging two ordered sequences of strings. LCP Merge substitutes string comparisons with integer comparisons whenever possible to reduce the number of character-wise comparisons as well as the number of key accesses by utilizing the longest common prefixes (LCP) between the strings. As one of the applications of LCP Merge, we built a string merge sort based on recursive merge sort by replacing the merging algorithm with LCP Merge and we call it LCP Merge sort. In case of sorting strings, the computational complexity of recursive merge sort tends to be greater than O(n lg n) because string comparisons are generally not constant time and depend on the properties of the strings. However, LCP Merge sort improves recursive merge sort to the extent that its computational complexity remains O(n lg n) on average. We performed a number of experiments to compare LCP Merge sort with other string sorting algorithms to evaluate its practical performance and the experimental results showed that LCP Merge sort is efficient even in the real-world.
lem discussed in this paper. We applied LCP 1. Introduction Merge to build a string merge sort (we call it Merging is a fundamental computational pro- LCP Merge sort) and compared it with recur- cess which combines two or more ordered se- sive merge sort to evaluate its improvements. quences of objects into a single ordered se- In addition, LCP Merge sort is compared with quence of the objects and each object has a key four other string sorting algorithms: Multikey which governs the ordering of the object. Merg- Quicksort 1), Burstsort 13), MSD radix sort 10) ing has many applications, for instance, merge and CRadix sort 11), in a number of experi- sort, external sorting, etc. Classical merging ments to evaluate its performance. algorithms (e.g., Section 5.2.4 Algorithm M, The primal form of LCP Merge first appeared two-way merge, in Ref. 8)) assume the compar- in our previous work 4) and was called Length of ison operation runs in constant time. However, Longest Common Prefix Merge (LLCP Merge). when the keys are strings, which are widely But we renamed LLCP Merge to LCP Merge, used in practice, the computational complex- taking into account that the word LCP is very ity of the comparison operation is not constant common and it connotes ‘length’. but depends on the properties of the strings. In 2. Preliminaries this paper, we present Longest Common Pre- fix Merge or LCP Merge for short, a novel We define some terminology for the discus- merging algorithm for merging two ordered se- sions in this paper. quences of strings in which their keys are the Definition 2.1 An alphabet is a finite to- strings themselves. LCP Merge has a fewer tally ordered set of symbols and the elements number of key accesses and character-wise com- of the alphabet are called characters. parisons than those of classical merging algo- For example, {a, b,...,z} is the alphabet of rithms. The reductions in the number of key all lower-case English letters. accesses and the number of character-wise com- Without ambiguity, an alphabet is assumed parisons are achieved by utilizing the LCPs be- where appropriate in our discussion without be- tween the strings. Approaches utilizing LCP in- ing mentioned. formation are well-known for constructing suf- Definition 2.2 A string is a list of char- fix trees or suffix arrays 3),7),9). But to the best acters. Given a string s = s[1]s[2] ...s[a], s[i] of our knowledge, there is no published LCP (1 ≤ i ≤ a)isthei th character of s.Thenum- based merging algorithm which solves the prob- ber of characters in s, denoted |s|, is the length of s.Theempty or null string is the string †1 Graduate School of Fundamental Science and Engi- neering, Waseda University which has zero length and is denoted . †2 Faculty of Science and Engineering, Waseda Univer- Definition 2.3 Concatenation of two sity strings is defined as the juxtaposition of them.
236 Information and Media Technologies 3(2): 236-245 (2008) reprinted from: IPSJ Digital Courier 4: 69-78 (2008) © Information Processing Society of Japan
For example, given that two strings s = or monotonically increasing if si ≤ si+1 for s[1]s[2] ...s[a]andt = t[1]t[2] ...t[b], then i :1≤ i