An Adaptive Algorithm for Splitting Large Sets of Strings and Its Application to Eﬃcient External Sorting

An Adaptive Algorithm for Splitting Large Sets of Strings and Its Application to Efficient External Sorting Tatsuya Asaiy Seishi Okamotoy Hiroki Arimuraz y Knowledge Research Center, Fujitsu Laboratories Ltd. 4–1–1 Kamikodanaka, Nakahara-ku, Kawasaki 211–8588, Japan Email: [email protected] Email: [email protected] z Graduate School of Information Science and Technology, Hokkaido University Kita 14–jo, Nishi 9–chome, Sapporo 060–0814, Japan Email: [email protected] Abstract Growth Internal memory Internal memory parameter ααα In this paper, we study the problem of sorting a large STEP1: BuildDietTrie T Sorted Bucket collection of strings in external memory. Based on Bucket adaptive construction of a summary data structure, S called adaptive synopsis trie, we present a practi- Sorted Bucket Bucket cal string sorting algorithm DistStrSort, which is Sorted Synopsis Trie TαS suitable to sorting string collections of large size in String Output List Input String Input List Sorted Bucket external memory, and also suitable for more com- Bucket plex string processing problems in text and semi- STEP2: SplitDietTrie STEP3: SortBucket STEP4: Merge structured databases such as counting, aggregation, and statistics. Case analyses of the algorithm and ex- Figure 1: The architecture of an external sorting al- periments on real datasets show the efficiency of our gorithm DistStrSort using adaptive construction algorithm in realistic setting. of a synopsis trie Keywords: external string sorting, trie sort, semi- structured database, text database, data splitting struction algorithm for synopsis tries for a large col- 1 Introduction lection of strings with the idea of adaptive growth. Based on this technique, we present an external Sorting strings is a fundamental task of string pro- string sorting algorithm DistStrSort. In Fig. 1, we cessing, which is not only sorting the objects in as- show an architecture of our external sorting. This sociated ordering, but also used as a basis of more algorithm first adaptively construct a small synopsis sophisticated processing such as grouping, count- trie as a model of string distribution by a single scan ing, and aggregation. With a recent emergence of of the input data, splits input strings into a collec- massive amount of semi-structured data (Abiteboul tion of buckets using the synopsis trie as a splitter, et al. 2000), such as plain texts, Web pages, or then sort the strings in each bucket in main memory, XML documents, it is often the case that we would and finally merge the sorted buckets into an output like to sort a large collection of strings that do not collection. fit into main memory. Moreover, there are poten- By an analysis, we show that if both of the size of tial demands for flexible methods tailored for semi- the synopsis trie and the maximum size of the buckets structured data beyond basic sorting facilities such are small enough to fit into main memory of size M, as statistics computation, counting, and aggregation then our algorithm DistStrSort correctly solves the over various objects including texts, data, codes, external string sorting problem in O(N) time within and streams (Abiteboul et al. 2005, Stonebraker et main memory M using 3N sequential read, N ran- al. 2005). dom write, and N sequential write in the external In this paper, we study efficient methods for sort- memory, where N is the total size of an input string ing strings in external memory. We present a new list S. Thus, this algorithm of O(N) time complex- approach to sorting a large collection of strings in ex- ity is attractive compared with a popular merge sort- ternal memory. The main idea of our method is to based external string sort algorithm of O(N log N) generalize distribution sort algorithms for collections time complexity. Experiments on real datasets show of strings by the extensive use of trie data structure. the effectiveness of our algorithm. To implement this idea, we introduce a synopsis data As an advantage, this algorithm is suitable structure for strings, called a synopsis trie, for ap- to implement complex data manipulation opera- proximate distribution, and develop an adaptive con- tions (Ramakrishnan and Gehrke 2000) such as group-by aggregation, statistics calculation, and func- 2.1 Strings tional join for texts. Let Σ = fA; B; A1;A2;:::g be an alphabet of letters, with which a total order ·Σ on Σ is associated. In 1.1 Related works particular, we assume that our alphabet is a set of In what follows, we denote by K = jSj and N = jjSjj integers in a small range, i.e., Σ = f0;:::;C ¡ 1g be the cardinality and the total size of an input string for some integer C ¸ 0, which is a constant or at list S 2 (Σ¤)¤. most O(N) for the total input size N. This is the A straightforward extension of quick sort al- case for ASCII alphabet f0;:::; 255g or DNA alpha- gorithm solves the string sorting problem in bet fA; T; C; Gg. O(LK log K) time in main memory, where L is the We denote by Σ¤ the set of all finite strings on Σ ¤ maximum length of the input strings and N = and by " the empty string. Let s = a1 ¢ ¢ ¢ an 2 Σ O(LK). This algorithm is efficient if L = O(1), but be a string on Σ. For 1 · i · j · n, we denote by inefficient if L is large. jsj = n the length of s, by s[i] = ai the i-th letter, Arge et al. presented an algorithm that solves and by s[i::j] = ai ¢ ¢ ¢ aj the substring from i-th to the string sorting problem in O(K log K + N) time j-th letters. If s = pq for some p; q 2 Σ¤ then we say in main memory (Arge et al. 1997). The String B- that p is a prefix of s and denote p v s. tree is an efficient dynamic indexing data structure We define the lexicographic order ·lex on strings ¤ for strings (Ferragina and Grossi 1999). This data of Σ by extending the total order ·Σ on letters in structure also has almost optimal performance for ex- the standard way. ternal string sorting in theory, while the performance degenerates in practice because of many random ac- 2.2 String sorting problem cess to external memory. For strings s ; : : : ; s 2 Σ¤, we distinguish an or- Sinha and Zobel presented a trie-based string 1 m dered list (s ; : : : ; s ), an unordered list (or multi- sorting algorithm, called the burst sort, and show 1 m set) ffs ; : : : ; s gg, and a set fs ; : : : ; s g (of unique that the algorithm outperforms many of previous 1 m 1 m strings) each other. The notations 2 and = are de- string sorting algorithms on real datasets (Sinha and fined accordingly. We use the intentional notation Zobel 2003). However, their algorithm is an inter- ( s j P (s) ) or f s j P (s) g. For lists of strings S ;S , nal sorting algorithm, and furthermore, the main aim 1 2 we denote by S S (S © S , resp.) the concatenation of this paper is proposal of an adaptive strategy for 1 2 1 2 of S ;S as ordered lists (as unordered lists, resp.). realizing an efficient external string sorting. 1 2 An input to our algorithm is a string list of size Although there are a number of studies on sort- K ¸ 0 that is just an ordered list S = (s ; : : : ; s ) 2 ing the suffixes of a single input string (Bentley and 1 K (Σ¤)¤ of possibly identical strings, where s 2 Σ¤ is a Sedgewick 1997, Manber and Myers 1993, Manzini i strings on Σ for every 1 · i · K. We denote by jSj = and Ferragina 2002, Sadakane 1999), they are rather PK irrelevant to this work since they are mainly incore K the cardinality of S and jjSjj = N = i=1 jsij the algorithms and utilize the repetition structure of suf- total size of S. We denote the set of all unique strings fixes of a string. in S by uniq(S) = f si j i = 1;:::;K g. A string list The adaptive construction of a trie for a stream of S = (s1; : : : ; sn) is sorted if si ·lex si+1 holds for letters has been studied since 90’s in time-series mod- every i = 1; : : : ; n ¡ 1. eling (Laird, Saul 1994), data compression (Moffat Definition 1 The string sorting problem (STR- 1990), and bioinformatics (Ron, Singer, Tishby 1996). SORT) is, given a string list S = (s1; : : : ; sK ) 2 However, the application of adaptive trie construction (Σ¤)¤, to compute the sorted list ¼(S) = to external sorting seems new and an interesting di- ¤ ¤ (s¼(1); : : : ; s¼(K)) 2 (Σ ) for some permutation ¼ : rection in text and semi-structured data processing. f1;:::;Kg ! f1;:::;Kg of its indices. Recently, (Stonebraker et al. 2005) showed the use- fulness of distributed stream processing as a massive We can extend our framework for more compli- data application of new type. cated string processing problems as follows. 1.2 Organization Definition 2 The string counting problem (STR- COUNT) is, given a string list S = (s1; : : : ; sK ) 2 This paper is organized as follows. In Section 2, we (Σ¤)¤, to compute the histogram h : uniq(S) ! N prepare basic notions and definitions on string sorting such that h(s) is the number of occurrences of the and related string processing problems. In Section 3, unique string s. we present our external string sorting algorithm based on adaptive construction of a synopsis trie. In Section Let (∆; ±) be a pair of a collection ∆ of values and 4, we report experiments on real datasets, and finally, a corresponding associative operation ± : ∆2 ! ∆.

An Adaptive Algorithm for Splitting Large Sets of Strings and Its Application to Eﬃcient External Sorting

15-122: Principles of Imperative Computation, Fall 2014 Lab 11: Strings in C

Unibasic 9.3 Reference Guide

IBM Education Assistance for Z/OS V2R1

Bidirectional Programming Languages

Reference Manual for the Icon Programming Language Version 5 (( Implementation for Limx)*

Software II: Principles of Programming Languages

Standard TECO (Text Editor and Corrector)

Rc the Plan 9 Shell

IBM ILOG CPLEX Optimization Studio OPL Language Reference Manual

Just Go for It: the Story of Dance-Mat.Js

Numeric and String Variables

Ed: a Context Editor for the Cp/M Disk System User's