<<

Algorithms ROBERT SEDGEWICK | KEVIN WAYNE

5.1 STRING SORTS 5.1 STRING SORTS

‣ strings in Java ‣ strings in Java ‣ -indexed counting ‣ key-indexed counting Algorithms ‣ LSD radix sort Algorithms ‣ LSD radix sort FOURTH EDITION ‣ MSD radix sort ‣ MSD radix sort ‣ 3-way radix quicksort ‣ 3-way radix quicksort ROBERT SEDGEWICK | KEVIN WAYNE ROBERT SEDGEWICK | KEVIN WAYNE http://algs4.cs.princeton.edu ‣ suffix arrays http://algs4.cs.princeton.edu ‣ suffix arrays

String processing The char data type

String. Sequence of characters. C char data type. Typically an 8-bit integer. Supports 7-bit ASCII. ・ 6.5 Data Compression 667 Important fundamental abstraction. ・Can represent at most 256 characters. ・Genomic sequences. ・Information processing. ASCII encoding. When you HexDump a bit- 0 1 2 3 4 5 6 7 8 9 A B C D E F stream that contains ASCII-encoded charac- NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI Communication systems (e.g., email). 0 ・ ters, the table at right is useful for reference. 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US ・Programming systems (e.g., Java programs). Given a 2-digit hex number, use the first hex 2 SP ! “ # $ % & ‘ ( ) * + , - . / digit as a row index and the second hex digit … 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? ・ as a column reference to find the character 4 @ A B C D E F G H I J K L M N O that it encodes. For example, 31 encodes the 5 P Q R S T U V W X Y Z [ \ ] ^ _ 1 4A J U+0041 U+00E1 U+2202 U+1D50A digit , encodes the letter , and so forth. 6 ` a b c d e f g h i j k l m n o This table is for 7-bit ASCII, so the first hex 7 p q r s t u v w x y z { | } ~ DEL digit must be 7 or less. Hex numbers starting some Unicode characters Hexadecimal to ASCII conversion table “ The digital information that underlies biochemistry, cell with 0 and 1 (and the numbers 20 and 7F) biology, and development can be represented by a simple correspond to non-printing control charac- ters. Many of the control characters are left over from the days when physical devices

string of G's, A's, T's and C's. This string is the root data like0 typewriters were controlled by ASCII input; the table highlights a few that you Java char data type. A 16-bit unsigned integer. structure of an organism's biology. ” — M. V. Olson might see in dumps. For example SP is the space character, NUL is the null character, LF is line-feed, and CR is carriage-return. ・Supports original 16-bit Unicode. Supports 21-bit Unicode 3.0 (awkwardly). In summary, working with data compression・ requires us to reorient our thinking about standard input and standard3 output to include binary encoding of data. BinaryStdIn 4 and BinaryStdOut provide the methods that we need. They provide a way for you to make a clear distinction in your client programs between writing out information in- tended for file storage and data transmission (that will be read by programs) and print- ing information (that is likely to be read by humans). I ♥ ︎ Unicode The String data type

String data type in Java. Immutable sequence of characters.

Length. Number of characters. Indexing. Get the ith character. Concatenation. Concatenate one string to the end of another.

s.length()

0 1 2 3 4 5 6 7 8 9 10 11 12 s A T T A C K A T D A W N

U+0041 s.charAt(3) s.substring(7, 11)

Fundamental constant-time String operations

5 6

The String data type: immutability The String data type: representation

Q. Why immutable? Representation (Java 7). Immutable char[] array + cache of hash.

A. All the usual reasons. ・Can use as keys in symbol table. ・Don't need to defensively copy. operation Java running time ・Ensures consistent state. length s.length() 1 ・Supports concurrency. indexing s.charAt(i) 1 Improves security. public class FileInputStream ・ { private String filename; concatenation s + t M + N public FileInputStream(String filename) { ⋮ ⋮ if (!allowedToReadFile(filename)) throw new SecurityException(); this.filename = filename; } ... }

attacker could bypass security if string type were mutable

7 8 String performance trap Comparing two strings

Q. How to build a long string, one character at a time? Q. How many character compares to compare two strings of length W ?

public String reverse(String s) p r e f e t c h { String rev = ""; 0 1 2 3 4 5 6 7 for (int i = s.length() - 1; i >= 0; i--) p r e f i x e s rev += s.charAt(i); quadratic time return rev; } Running time. Proportional to length of longest common prefix. ・Proportional to W in the worst case. A. Use StringBuilder data type (mutable char[] array). ・But, often sublinear in W.

public static String reverse(String s) { StringBuilder rev = new StringBuilder(); for (int i = s.length() - 1; i >= 0; i--) rev.append(s.charAt(i)); linear time return rev.toString(); }

9 10

Alphabets

Digital key. Sequence of digits over fixed alphabet. 604 CHAPTER 6 ! Strings Radix. Number of digits R in alphabet.

name R() lgR() characters BINARY 2 1 01 5.1 STRING SORTS OCTAL 8 3 01234567 DECIMAL 10 4 0123456789 ‣ strings in Java HEXADECIMAL 16 4 0123456789ABCDEF

DNA 4 2 ACTG ‣ key-indexed counting LOWERCASE 26 5 abcdefghijklmnopqrstuvwxyz Algorithms ‣ LSD radix sort UPPERCASE 26 5 ABCDEFGHIJKLMNOPQRSTUVWXYZ ‣ MSD radix sort PROTEIN 20 5 ACDEFGHIKLMNPQRSTVWY

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdef ‣ 3-way radix quicksort BASE64 64 6 ROBERT SEDGEWICK | KEVIN WAYNE ghijklmnopqrstuvwxyz0123456789+/ http://algs4.cs.princeton.edu ‣ suffix arrays ASCII 128 7 ASCII characters EXTENDED_ASCII 256 8 extended ASCII characters UNICODE16 65536 16 Unicode characters Standard alphabets

11 holds the frequencies in Count is an example of a character-indexed array. With a Java String, we have to use an array of size 256; with Alphabet, we just need an array with one entry for each alphabet character. This savings might seem modest, but, as you will see, our algorithms can produce huge numbers of such arrays, and the space for arrays of size 256 can be prohibitive.

Numbers. As you can see from our several of the standard Alphabet examples, we of- ten represent numbers as strings. The methods toIndices() coverts any String over a given Alphabet into a base-R number represented as an int[] array with all values between 0 and RϪ1. In some situations, doing this conversion at the start leads to com- pact code, because any digit can be used as an index in a character-indexed array. For example, if we know that the input consists only of characters from the alphabet, we could replace the inner loop in Count with the more compact code int[] a = alpha.toIndices(s); for (int i = 0; i < N; i++) count[a[i]]++; Review: summary of the performance of sorting algorithms Key-indexed counting: assumptions about keys

Frequency of operations. Assumption. Keys are integers between 0 and R - 1. Implication. Can use key as an array index. input sorted result algorithm guarantee random extra space stable? operations on keys name section (by section) Anderson 2 Harris 1 Applications. Brown 3 Martin 1 2 2 ✔ compareTo() insertion sort ½ N ¼ N 1 Davis 3 Moore 1 ・Sort string by first letter. Garcia 4 Anderson 2 Harris 1 Martinez 2 mergesort ✔ compareTo() Sort class roster by section. N lg N N lg N N ・ Jackson 3 Miller 2 Sort phone numbers by area code. Johnson 4 Robinson 2 ・ Jones 3 White 2 quicksort 1.39 N lg N * 1.39 N lg N c lg N compareTo() ・Subroutine in a sorting algorithm. [stay tuned] Martin 1 Brown 3 Martinez 2 Davis 3 heapsort 2 N lg N 2 N lg N 1 compareTo() Miller 2 Jackson 3 Moore 1 Jones 3 Remark. Keys may have associated data ⇒ Robinson 2 Taylor 3 * probabilistic can't just count up number of keys of each value. Smith 4 Williams 3 Taylor 3 Garcia 4 Thomas 4 Johnson 4 Thompson 4 Smith 4 Lower bound. ~ N lg N compares required by any compare-based algorithm. White 2 Thomas 4 Williams 3 Thompson 4 Wilson 4 Wilson 4

Q. Can we do better (despite the lower bound)? keys are use array accesses small integers A. Yes, if we don't depend on key compares. to make R-way decisions Typical candidate for key-indexed counting (instead of binary decisions) 13 14

Key-indexed counting demo Key-indexed counting demo

Goal. Sort an array a[] of N integers between 0 and R - 1. Goal. Sort an array a[] of N integers between 0 and R - 1. ・Count frequencies of each letter using key as index. R = 6 ・Count frequencies of each letter using key as index. ・Compute frequency cumulates which specify destinations. ・Compute frequency cumulates which specify destinations. ・Access cumulates using key as index to move items. ・Access cumulates using key as index to move items. Copy back into original array. Copy back into original array. offset by 1 ・ i a[i] ・ i a[i] [stay tuned] 0 d 0 d int N = a.length; int N = a.length; int[] count = new int[R+1]; 1 a int[] count = new int[R+1]; 1 a 2 c use a for 0 2 c r count[r] b for 1 for (int i = 0; i < N; i++) 3 f for (int i = 0; i < N; i++) 3 f a 0 c for 2 count frequencies count[a[i]+1]++; 4 f d for 3 count[a[i]+1]++; 4 f b 2 5 b e for 4 5 b c 3 f for 5 for (int r = 0; r < R; r++) 6 d for (int r = 0; r < R; r++) 6 d d 1 count[r+1] += count[r]; 7 b count[r+1] += count[r]; 7 b e 2 8 f 8 f f 1 for (int i = 0; i < N; i++) for (int i = 0; i < N; i++) 9 b 9 b - 3 aux[count[a[i]]++] = a[i]; aux[count[a[i]]++] = a[i]; 10 e 10 e 11 a 11 a for (int i = 0; i < N; i++) for (int i = 0; i < N; i++) a[i] = aux[i]; a[i] = aux[i];

15 16 Key-indexed counting demo Key-indexed counting demo

Goal. Sort an array a[] of N integers between 0 and R - 1. Goal. Sort an array a[] of N integers between 0 and R - 1. ・Count frequencies of each letter using key as index. ・Count frequencies of each letter using key as index. ・Compute frequency cumulates which specify destinations. ・Compute frequency cumulates which specify destinations. ・Access cumulates using key as index to move items. ・Access cumulates using key as index to move items. Copy back into original array. Copy back into original array. ・ i a[i] ・ i a[i] i aux[i]

0 d 0 d 0 a int N = a.length; int N = a.length; int[] count = new int[R+1]; 1 a int[] count = new int[R+1]; 1 a 1 a 2 c r count[r] 2 c r count[r] 2 b for (int i = 0; i < N; i++) 3 f a 0 for (int i = 0; i < N; i++) 3 f a 2 3 b count[a[i]+1]++; 4 f b 2 count[a[i]+1]++; 4 f b 5 4 b 5 b c 5 5 b c 6 5 c

compute for (int r = 0; r < R; r++) 6 d d 6 for (int r = 0; r < R; r++) 6 d d 8 6 d cumulates count[r+1] += count[r]; 7 b e 8 count[r+1] += count[r]; 7 b e 9 7 d 8 f f 9 8 f f 12 8 e for (int i = 0; i < N; i++) for (int i = 0; i < N; i++) 9 b - 12 move 9 b - 12 9 f aux[count[a[i]]++] = a[i]; items aux[count[a[i]]++] = a[i]; 10 e 10 e 10 f 11 a 11 a 11 f for (int i = 0; i < N; i++) for (int i = 0; i < N; i++) 6 keys < d, 8 keys < e a[i] = aux[i]; so d’s go in a[6] and a[7] a[i] = aux[i]; 17 18

Key-indexed counting demo Key-indexed counting: analysis

Goal. Sort an array a[] of N integers between 0 and R - 1. Proposition. Key-indexed takes time proportional to N + R. ・Count frequencies of each letter using key as index. for (int i = 0; i < N; i++) aux[count[a[i].key(d)]++] = a[i]; ・Compute frequency cumulates which specify destinations. Proposition. Key-indexed counting uses extra space proportional to N + R. count[] ・Access cumulates using key as index to move items. 1 2 3 4 0 3 8 14 Copy back into original array. Stable? ✔ 0 4 8 14 a[0] Anderson 2 Harris 1 aux[0] ・ i a[i] i aux[i] 0 4 9 14 a[1] Brown 3 Martin 1 aux[1] a[2] Davis 3 Moore 1 aux[2] 0 a 0 a 0 4 10 14 int N = a.length; 0 4 10 15 a[3] Garcia 4 Anderson 2 aux[3] int[] count = new int[R+1]; 1 a 1 a 1 4 10 15 a[4] Harris 1 Martinez 2 aux[4] 2 b r count[r] 2 b 1 4 11 15 a[5] Jackson 3 Miller 2 aux[5] 1 4 11 16 a[6] Johnson 4 Robinson 2 aux[6] for (int i = 0; i < N; i++) 3 b a 2 3 b 1 4 12 16 a[7] Jones 3 White 2 aux[7] a[8] Martin 1 Brown 3 aux[8] count[a[i]+1]++; 4 b b 5 4 b 2 4 12 16 2 5 12 16 a[9] Martinez 2 Davis 3 aux[9] 5 c 5 c c 6 2 6 12 16 a[10] Miller 2 Jackson 3 aux[10] for (int r = 0; r < R; r++) 6 d d 8 6 d 3 6 12 16 a[11] Moore 1 Jones 3 aux[11] 3 7 12 16 a[12] Robinson 2 Taylor 3 aux[12] count[r+1] += count[r]; 7 d 7 d e 9 3 7 12 17 a[13] Smith 4 Williams 3 aux[13] 8 e f 12 8 e 3 7 13 17 a[14] Taylor 3 Garcia 4 aux[14] for (int i = 0; i < N; i++) 3 7 13 18 a[15] Thomas 4 Johnson 4 aux[15] 9 f - 12 9 f aux[count[a[i]]++] = a[i]; 3 7 13 19 a[16] Thompson 4 Smith 4 aux[16] 10 f 10 f 3 8 13 19 a[17] White 2 Thomas 4 aux[17] 11 f 11 f 3 8 14 19 a[18] Williams 3 Thompson 4 aux[18] copy for (int i = 0; i < N; i++) 3 8 14 20 a[19] Wilson 4 Wilson 4 aux[19] back a[i] = aux[i]; Distributing the data (records with key 3 highlighted) 19 20 Least-significant-digit-first string sort

LSD string (radix) sort. ・Consider characters from right to left. ・Stably sort using dth character as the key (using key-indexed counting).

sort key (d = 2) sort key (d = 1) sort key (d = 0) 5.1 STRING SORTS 0 d a b 0 d a b 0 d a b 0 a c e ‣ strings in Java 1 a d d 1 c a b 1 c a b 1 a d d 2 c a b 2 e b b 2 f a d 2 b a d ‣ key-indexed counting 3 f a d 3 a d d 3 b a d 3 b e d ‣ LSD radix sort 4 f e e 4 f a d 4 d a d 4 b e e Algorithms 5 b a d 5 b a d 5 e b b 5 c a b ‣ MSD radix sort 6 d a d 6 d a d 6 a c e 6 d a b 7 b e e 7 f e d 7 a d d 7 d a d ‣ 3-way radix quicksort ROBERT SEDGEWICK | KEVIN WAYNE 8 f e d 8 b e d 8 f e d 8 e b b http://algs4.cs.princeton.edu ‣ suffix arrays 9 b e d 9 f e e 9 b e d 9 f a d 10 e b b 10 b e e 10 f e e 10 f e d 11 a c e 11 a c e 11 b e e 11 f e e

sort is stable (arrows do not cross) 22

LSD string sort: correctness proof LSD string sort: Java implementation

Proposition. LSD sorts fixed-length strings in ascending order. public class LSD { sort key public static sort(String[] a, int W) fixed-length W strings Pf. [ by induction on i ] { After pass i, strings are sorted by last i characters. 0 d a b 0 a c e int R = 256; radix R ・If two strings differ on sort key, 1 c a b 1 a d d int N = a.length; 2 f a d 2 b a d String[] aux = new String[N]; key-indexed sort puts them in proper 3 b a d 3 b e d do key-indexed counting for (int d = W-1; d >= 0; d--) relative order. 4 d a d 4 b e e for each digit from right to left { If two strings agree on sort key, 5 e b b 5 c a b ・ int[] count = new int[R+1]; 6 a c e 6 d a b stability keeps them in proper relative order. for (int i = 0; i < N; i++) 7 a d d 7 d a d count[a[i].charAt(d) + 1]++; 8 f e d 8 e b b key-indexed counting for (int r = 0; r < R; r++) 9 b e d 9 f a d count[r+1] += count[r]; 10 f e e 10 f e d for (int i = 0; i < N; i++) 11 b e e 11 f e e aux[count[a[i].charAt(d)]++] = a[i]; for (int i = 0; i < N; i++) Proposition. LSD sort is stable. a[i] = aux[i]; sorted from Pf. Key-indexed counting is stable. previous passes } (by induction) } } 23 24 Summary of the performance of sorting algorithms String sorting interview question

Frequency of operations. Problem. Sort one million 32-bit integers. Ex. Google (or presidential) interview. algorithm guarantee random extra space stable? operations on keys

Which sorting method to use? insertion sort ½ N 2 ¼ N 2 1 ✔ compareTo() ・Insertion sort. mergesort N lg N N lg N N ✔ compareTo() ・Mergesort. ・Quicksort. * compareTo() quicksort 1.39 N lg N 1.39 N lg N c lg N ・Heapsort. LSD string sort. heapsort 2 N lg N 2 N lg N 1 compareTo() ・

LSD sort † 2 W (N + R) 2 W (N + R) N + R ✔ charAt()

* probabilistic † fixed-length W keys

Q. What if strings are not all of same length?

25 26

String sorting interview question How to take a census in 1900s?

1880 Census. Took 1500 people 7 years to manually process data.

Herman Hollerith. Developed counting and sorting machine to automate. ・Use punch cards to record data (e.g., gender, age). ・Machine sorts one column at a time (into one of 12 bins). ・Typical question: how many women of age 20 to 30?

Hollerith tabulating machine and sorter punch card (12 holes per column)

1890 Census. Finished in 1 year (and under budget)! Google CEO Eric Schmidt interviews Barack Obama 27 28 How to get rich sorting in 1900s? LSD string sort: a moment in history (1960s)

Punch cards. [1900s to 1950s] ・Also useful for accounting, inventory, and business processes. ・Primary medium for data entry, storage, and processing.

Hollerith's company later merged with 3 others to form Computing Tabulating Recording Corporation (CTRC); company renamed in 1924. card punch punched cards card reader mainframe line printer

To sort a card deck not directly related to sorting - start on right column - put cards into hopper - machine distributes into bins - pick up cards (stable) - move left one column - continue until sorted

card sorter

Lysergic Acid Diethylamide IBM 80 Series Card Sorter (650 cards per minute) (Lucy in the Sky with Diamonds)

29 30

Reverse LSD

・Consider characters from left to right. ・Stably sort using dth character as the key (using key-indexed counting).

sort key (d = 0) sort key (d = 1) sort key (d = 2) 5.1 STRING SORTS 0 d a b 0 a d d 0 b a d 0 c a b ‣ strings in Java 1 a d d 1 a c e 1 c a b 1 d a b 2 c a b 2 b a d 2 d a b 2 e b b ‣ key-indexed counting 3 f a d 3 b e e 3 d a d 3 b a d ‣ LSD radix sort 4 f e e 4 b e d 4 f a d 4 d a d Algorithms 5 b a d 5 c a b 5 e b b 5 f a d ‣ MSD radix sort 6 d a d 6 d a b 6 a c e 6 a d d 7 b e e 7 d a d 7 a d d 7 b e d ‣ 3-way radix quicksort ROBERT SEDGEWICK | KEVIN WAYNE 8 f e d 8 e b b 8 b e e 8 f e d http://algs4.cs.princeton.edu ‣ suffix arrays 9 b e d 9 f a d 9 b e d 9 a c e 10 e b b 10 f e e 10 f e e 10 b e e 11 a c e 11 f e d 11 f e d 11 f e e

not sorted!

32 Most-significant-digit-first string sort MSD string sort: example

MSD string (radix) sort. input d she are are are are are are are are Partition array into R pieces according to first character sells by lo by by by by by by by ・ seashells she sells seashells sea sea sea seas sea (use key-indexed counting). by sells seashells sea seashells seashells seashells seashells seashells the seashells sea seashells seashells seashells seashells seashells seashells Recursively sort all strings that start with each character sea sea sells sells sells sells sells sells sells ・ shore shore seashells sells sells sells sells sells sells the shells she she she she she she she (key-indexed counts delineate subarrays to sort). shells she shore shore shore shore shore shore shore she sells shells shells shells shells shells shore shells 0 a d d sells surely she she she she she she she 0 d a b 0 a d d 1 a c e are seashells surely surely surely surely surely surely surely count[] surely the hi the the the the the the the 1 a d d 1 a c e seashells the the the the the the the the 2 b a d 2 c a b 2 b a d a 0 3 b e e need to examine end of string 3 f a d 3 b e e b 2 every character goes before any 4 b e d in equal keys char value output 4 f e e 4 b e d c 5 are are are are are are are are sort subarrays by by by by by by by by 5 b a d 5 c a b d 6 5 c a b recursively sea sea sea sea sea sea sea sea seashells seashells seashells seashells seashells seashells seashells seashells 6 6 e 8 d a d d a b seashells seashells seashells seashells seashells seashells seashells seashells 7 b e e 7 d a d f 9 6 d a b sells sells sells sells sells sells sells sells sells sells sells sells sells sells sells sells 7 d a d 8 f e d 8 e b b - 12 she she she she she she she she shore sshore shore shells she she she she 9 b e d 9 f a d 8 e b b shells hells shells she shells shells shells shells she she she shore shore shore shore shore 10 e b b 10 f e e surely surely surely surely surely surely surely surely 11 a c e 11 f e d 9 f a d the the the the the the the the the the the the the the the the 10 f e e Trace of recursive calls for MSD string sort (no cutof for small subarrays, subarrays of size 0 and 1 omitted) sort key 11 f e d 33 34

Variable-length strings MSD string sort: Java implementation

Treat strings as if they had an extra char at end (smaller than any char). public static void sort(String[] a) {

why smaller? aux = new String[a.length]; 0 s e a -1 sort(a, aux, 0, a.length - 1, 0); recycles aux[] array but not count[] array 1 s e a s h e l l s -1 } 2 s e l l s -1

3 s h e -1 private static void sort(String[] a, String[] aux, int lo, int hi, int d) she before shells { 4 s h e -1 if (hi <= lo) return; 5 s h e l l s -1 int[] count = new int[R+2]; key-indexed counting 6 s h o r e -1 for (int i = lo; i <= hi; i++) 7 s u r e l y -1 count[charAt(a[i], d) + 2]++; for (int r = 0; r < R+1; r++) count[r+1] += count[r]; private static int charAt(String s, int d) for (int i = lo; i <= hi; i++) { aux[count[charAt(a[i], d) + 1]++] = a[i]; if (d < s.length()) return s.charAt(d); for (int i = lo; i <= hi; i++) else return -1; a[i] = aux[i - lo]; }

for (int r = 0; r < R; r++) sort R subarrays recursively sort(a, aux, lo + count[r], lo + count[r+1] - 1, d+1); '\0' C strings. Have extra char at end ⇒ no extra work needed. }

35 36 MSD string sort: potential for disastrous performance Cutoff to insertion sort

Observation 1. Much too slow for small subarrays. count[] Solution. Cutoff to insertion sort for small subarrays. ・Each function call needs its own count[] array. ・Insertion sort, but start at dth character. ・ASCII (256 counts): 100x slower than copy pass for N = 2. private static void sort(String[] a, int lo, int hi, int d) ・Unicode (65,536 counts): 32,000x slower for N = 2. { for (int i = lo; i <= hi; i++) for (int j = i; j > lo && less(a[j], a[j-1], d); j--) Observation 2. Huge number of small subarrays exch(a, j, j-1); because of recursion. }

・Implement less() so that it compares starting at dth character.

private static boolean less(String v, String w, int d) { for (int i = d; i < Math.min(v.length(), w.length()); i++) { if (v.charAt(i) < w.charAt(i)) return true; if (v.charAt(i) > w.charAt(i)) return false; a[] aux[] } 0 b 0 a return v.length() < w.length(); 1 a 1 b }

37 38

MSD string sort: performance Summary of the performance of sorting algorithms

Number of characters examined. Frequency of operations. ・MSD examines just enough characters to sort the keys. ・Number of characters examined depends on keys. algorithm guarantee random extra space stable? operations on keys Can be sublinear in input size! ・ insertion sort ½ N 2 ¼ N 2 1 ✔ compareTo()

Non-random compareTo() based sorts Random Worst case mergesort ✔ compareTo() with duplicates N lg N N lg N N (sublinear) (linear) can also be sublinear! (nearly linear) 1EIO402 are 1DNB377 quicksort 1.39 N lg N * 1.39 N lg N c lg N compareTo() 1HYL490 by 1DNB377 1ROZ572 sea 1DNB377 2HXE734 seashells 1DNB377 heapsort 2 N lg N 2 N lg N 1 compareTo() 2IYE230 seashells 1DNB377 2XOR846 sells 1DNB377 3CDB573 sells 1DNB377 LSD sort † 2 W (N + R) 2 W (N + R) N + R ✔ charAt() 3CVP720 she 1DNB377 3IGJ319 she 1DNB377 ‡ charAt() 3KNA382 shells 1DNB377 MSD sort 2 W (N + R) N log R N N + D R ✔ 3TAV879 shore 1DNB377 4CQP781 surely 1DNB377 * probabilistic 4QGI284 the 1DNB377 D = function-call stack depth † fixed-length W keys 4YHV229 the 1DNB377 (length of longest prefix match) ‡ average-length W keys Characters examined by MSD string sort

39 40 MSD string sort vs. quicksort for strings Engineering a radix sort (American flag sort)

Disadvantages of MSD string sort. Optimization 0. Cutoff to insertion sort. ・Extra space for aux[]. ・Extra space for count[]. Optimization 1. Replace recursion with explicit stack. ・Inner loop has a lot of instructions. ・Push subarrays to be sorted onto stack. ・Accesses memory "randomly" (cache inefficient). ・Now, one count[] array suffices.

Disadvantage of quicksort. Optimization 2. Do R-way partitioning in place. Engineering Radix Sort ・Linearithmic number of string compares (not linear). ・Eliminates aux[] array. Peter M. Mcllroy and Keith Bostic University of California at Berkeley; Has to rescan many characters in keys with long prefix matches. Sacrifices stability. and M. Douglas Mcllroy ・ ・ AT&T Bell Laboratories

ABSTRACT Radix sorting methods have excellent asymptotic performance on string data, for which com- parison is not a unit-time operation. Attractive for use in large byte-addressable memories, these methods have nevertheless long been eclipsed by more easily prograÍrmed algorithms. Three ways to sort strings by bytes left to right-a stable list sort, a stable two-array doesn't rescan tight inner loop, sort, and an in-place "American flag" sor¿-are illus- trated with practical C programs. For heavy-duty sort- characters cache friendly ing, all three perform comparably, usually running at American national flag problem Dutch national flag problem least twice as fast as a good quicksort. We recommend American flag sort for general use.

Goal. Combine advantages of MSD and quicksort.

41 42

3-way string quicksort (Bentley and Sedgewick, 1997) @ Computing Systems, Vol. 6 . No. 1 . Winter 1993

Overview. Do 3-way partitioning on the dth character. ・Less overhead than R-way partitioning in MSD string sort. ・Does not re-examine characters equal to the partitioning char. (but does re-examine characters not equal to the partitioning char)

5.1 STRING SORTS partitioning item she by sells are use first character to ‣ strings in Java partition into seashells seashells "less", "equal", and "greater" by she ‣ key-indexed counting subarrays the seashells recursively sort subarrays, ‣ LSD radix sort sea sea excluding first character Algorithms for middle subarray ‣ MSD radix sort shore shore the surely ‣ 3-way radix quicksort shells shells ROBERT SEDGEWICK | KEVIN WAYNE she she http://algs4.cs.princeton.edu ‣ suffix arrays sells sells are sells surely the seashells the

44 3-way string quicksort: trace of recursive calls 3-way string quicksort: Java implementation

partitioning item

private static void sort(String[] a) { sort(a, 0, a.length - 1, 0); } she by are are are sells are by by by private static void sort(String[] a, int lo, int hi, int d) seashells seashells seashells seashells seashells { by she she sells sea if (hi <= lo) return; 3-way partitioning the seashells seashells seashells seashells int lt = lo, gt = hi; (using dth character) sea sea sea sea sells int v = charAt(a[lo], d); shore shore shore sells sells int i = lo + 1; the surely surely shells shells while (i <= gt) to handle variable-length strings { shells shells shells she she int t = charAt(a[i], d); she she she surely surely if (t < v) exch(a, lt++, i++); sells sells sells shore shore else if (t > v) exch(a, i, gt--); are sells sells she she else i++; surely the the the the } seashells the the the the sort(a, lo, lt-1, d); if (v >= 0) sort(a, lt, gt, d+1); sort 3 subarrays recursively Trace of first few recursive calls for 3-way string quicksort (subarrays of size 1 not shown) sort(a, gt+1, hi, d); }

45 46

3-way string quicksort vs. standard quicksort 3-way string quicksort vs. MSD string sort

Standard quicksort. MSD string sort. ・Uses ~ 2 N ln N string compares on average. ・Is cache-inefficient. ・Costly for keys with long common prefixes (and this is a common case!) ・Too much memory storing count[]. ・Too much overhead reinitializing count[] and aux[]. 3-way string (radix) quicksort. ・Uses ~ 2 N ln N character compares on average for random strings. ・Avoids re-comparing long common prefixes. 3-way string quicksort. ・Is cache-friendly. ・Is in-place. Fast Algorithms for Sorting and Searching Strings ・Has a short inner loop.

Jon L. Bentley* Robert Sedgewick# library of Congress call numbers

Abstract that is competitive with the most efficient string sorting We present theoretical algorithms for sorting and programs known. The second program is a symbol table searching multikey data, and derive from them practical C implementation that is faster than hashing, which is com- implementations for applications in which keys are charac- monly regarded as the fastest symbol table implementa- ter strings. The sorting algorithm blends Quicksort and tion. The symbol table implementation is much more radix sort; it is competitive with the best known C sort space-efficient than multiway trees, and supports more codes. The searching algorithm blends tries and binary advanced searches. search trees; it is faster than hashing and other commonly In many application programs, sorts use a Quicksort used search methods. The basic ideas behind the algo- implementation based on an abstract compare operation, rithms date back at least to the 1960s but their practical and searches use hashing or binary search trees. These do Bottom line. 3-way string quicksort is method of choice for sorting strings. utility has been overlooked. We also present extensions to not take advantage of the properties of string keys, which more complex string problems, such as partial-match are widely used in practice. Our algorithms provide a nat- 47 48 searching. ural and elegant way to adapt classical algorithms to this important class of applications. 1. Introduction Section 6 turns to more difficult string-searching prob- Section 2 briefly reviews Hoare’s [9] Quicksort and lems. Partial-match queries allow “don’t care” characters binary search trees. We emphasize a well-known isomor- (the pattern “so.a”, for instance, matches soda and sofa). phism relating the two, and summarize other basic facts. The primary result in this section is a ternary search tree The multikey algorithms and data structures are pre- implementation of Rivest’s partial-match searching algo- sented in Section 3. Multikey Quicksort orders a set of II rithm, and experiments on its performance. “Near neigh- vectors with k components each. Like regular Quicksort, it bor” queries locate all words within a given Hamming dis- partitions its input into sets less than and greater than a tance of a query word (for instance, code is distance 2 given value; like radix sort, it moves on to the next field from soda). We give a new algorithm for near neighbor once the current input is known to be equal in the given searching in strings, present a simple C implementation, field. A node in a ternary search tree represents a subset of and describe experiments on its efficiency. vectors with a partitioning value and three pointers: one to Conclusions are offered in Section 7. lesser elements and one to greater elements (as in a binary search tree) and one to equal elements, which are then pro- 2. Background cessed on later fields (as in tries). Many of the structures Quicksort is a textbook divide-and-conquer algorithm. and analyses have appeared in previous work, but typically To sort an array, choose a partitioning element, permute as complex theoretical constructions, far removed from the elements such that lesser elements are on one side and practical applications. Our simple framework opens the greater elements are on the other, and then recursively sort door for later implementations. the two subarrays. But what happens to elements equal to The algorithms are analyzed in Section 4. Many of the the partitioning value? Hoare’s partitioning method is analyses are simple derivations of old results. binary: it places lesser elements on the left and greater ele- Section 5 describes efficient C programs derived from ments on the right, but equal elements may appear on either side. the algorithms. The first program is a sorting algorithm Algorithm designers have long recognized the desir- * Bell Labs, Lucent Technologies, 700 Mountam Avenue, Murray Hill. irbility and difficulty of a ternary partitioning method. NJ 07974; [email protected]. Sedgewick [22] observes on page 244: “Ideally, we would # Princeton University. Princeron. NJ. 08514: [email protected]. llke to get all [equal keys1 into position in the file, with all

360 Summary of the performance of sorting algorithms

Frequency of operations.

algorithm guarantee random extra space stable? operations on keys

insertion sort ½ N 2 ¼ N 2 1 ✔ compareTo() 5.1 STRING SORTS mergesort N lg N N lg N N ✔ compareTo() ‣ strings in Java quicksort 1.39 N lg N * 1.39 N lg N c lg N compareTo() ‣ key-indexed counting heapsort 2 N lg N 2 N lg N 1 compareTo() Algorithms ‣ LSD radix sort LSD sort † 2 W (N + R) 2 W (N + R) N + R ✔ charAt() ‣ MSD radix sort ‣ 3-way radix quicksort ‡ ROBERT SEDGEWICK | KEVIN WAYNE MSD sort 2 W (N + R) N log R N N + D R ✔ charAt() http://algs4.cs.princeton.edu ‣ suffix arrays 3-way string 1.39 W N lg R * 1.39 N lg N log N + W charAt() quicksort

* probabilistic † fixed-length W keys

‡ average-length W keys 49

Keyword-in-context search Keyword-in-context search

Given a text of N characters, preprocess it to enable fast substring search Given a text of N characters, preprocess it to enable fast substring search (find all occurrences of query string context). (find all occurrences of query string context).

% more tale.txt % java KWIC tale.txt 15 characters of surrounding context it was the best of times search it was the worst of times o st giless to search for contraband it was the age of wisdom her unavailing search for your fathe it was the age of foolishness le and gone in search of her husband it was the epoch of belief t provinces in search of impoverishe it was the epoch of incredulity dispersing in search of other carri it was the season of light n that bed and search the straw hold it was the season of darkness it was the spring of hope better thing it was the winter of despair t is a far far better thing that i do than ⋮ some sense of better things else forgotte was capable of better things mr carton ent

Applications. Linguistics, databases, web search, word processing, …. Applications. Linguistics, databases, web search, word processing, ….

51 52 Suffix sort Keyword-in-context search: suffix-sorting solution

input string i t w a s b e s t i t w a s w ・Preprocess: suffix sort the text. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ・Query: binary search for query; scan until mismatch.

form sufxes sort sufxes to bring query strings together

0 i t w a s b e s t i t w a s w 3 a s b e s t KWIC search for "search" in Tale of Two Cities 1 12 t w a s b e s t i t w a s w a s w ⋮ 2 5 w a s b e s t i t w a s w b e s t i t w a s w 632698 s e a l e d _ m y _ l e t t e r _ a n d _ … 3 6 a s b e s t i t w a s w e s t i t w a s w 713727 s e a m s t r e s s _ i s _ l i f t e d _ … 4 0 s b e s t i t w a s w i t w a s b e s t i t w a s w 660598 s e a m s t r e s s _ o f _ t w e n t y _ … 5 9 b e s t i t w a s w i t w a s w 67610 s e a m s t r e s s _ w h o _ w a s _ w i … 6 4 e s t i t w a s w s b e s t i t w a s w 4430 s e a r c h _ f o r _ c o n t r a b a n d … 7 7 s t i t w a s w s t i t w a s w 42705 s e a r c h _ f o r _ y o u r _ f a t h e … 8 13 t i t w a s w s w 499797 s e a r c h _ o f _ h e r _ h u s b a n d … 9 8 i t w a s w t i t w a s w 182045 s e a r c h _ o f _ i m p o v e r i s h e … 10 1 t w a s w t w a s b e s t i t w a s w 143399 s e a r c h _ o f _ o t h e r _ c a r r i … 11 10 w a s w t w a s w 411801 s e a r c h _ t h e _ s t r a w _ h o l d … 12 14 a s w w 158410 s e a r e d _ m a r k i n g _ a b o u t _ … 13 2 s w w a s b e s t i t w a s w 691536 s e a s _ a n d _ m a d a m e _ d e f a r … 14 11 w w a s w 536569 s e a s e _ a _ t e r r i b l e _ p a s s … 484763 s e a s e _ t h a t _ h a d _ b r o u g h … array of suffix indices ⋮ in sorted order 53 54

War story The String data type: Java 7u5 implementation

Q. How to efficiently form (and sort) suffixes? public final class String implements Comparable { private char[] value; // characters String[] suffixes = new String[N]; private int offset; // index of first char in array for (int i = 0; i < N; i++) private int length; // length of string suffixes[i] = s.substring(i, N); private int hash; // cache of hashCode() Algorithms … FOURTH EDITION Arrays.sort(suffixes); ROBERT SEDGEWICK | KEVIN WAYNE

rd 3 printing String s = "Hello, World" length = 12

value[] H E L L O , W O R L D

0 1 2 3 4 5 6 7 8 9 10 11 input file characters Java 7u4 Java 7u5

amendments.txt 18 thousand 0.25 sec 2.0 sec offset = 0

aesop.txt 192 thousand 1.0 sec out of memory String t = s.substring(7, 12); length = 5

mobydick.txt 1.2 million 7.6 sec out of memory value[] H E L L O , W O R L D 0 1 2 3 4 5 6 7 8 9 10 11 chromosome11.txt 7.1 million 61 sec out of memory

55 offset = 7 56 The String data type: Java 7u6 implementation The String data type: performance

String data type (in Java). Sequence of characters (immutable). public final class String implements Comparable { Java 7u5. Immutable char[] array, offset, length, hash cache. private char[] value; // characters Java 7u6. Immutable char[] array, hash cache. private int hash; // cache of hashCode() …

operation Java 7u5 Java 7u6

length 1 1 String s = "Hello, World"

indexing 1 1 value[] H E L L O , W O R L D

0 1 2 3 4 5 6 7 8 9 10 11 substring extraction 1 N

concatenation M + N M + N

String t = s.substring(7, 12); immutable? ✔ ✔

value[] W O R L D memory 64 + 2N 56 + 2N

0 1 2 3 4

57 58

A Reddit exchange Suffix sort

Q. How to efficiently form (and sort) suffixes in Java 7u6? I'm the author of the substring() change. As has A. Define Suffix class ala Java 7u5 String. been suggested in the analysis here there were two motivations for the change • Reduce the size of String instances. Strings public class Suffix implements Comparable { are typically 20-40% of common apps footprint. bondolo private final String text; • Avoid memory leakage caused by retained private final int offset; substrings holding the entire character array. public Suffix(String s, int offset) { this.text = text; this.offset = offset; } Changing this function, in a bugfix release no public int length() { return text.length() - offset; } less, was totally irresponsible. It broke backwards public char charAt(int i) { return text.charAt(offset + i); } compatibility for numerous applications with errors public int compareTo(Suffix that) { /* see textbook */ } that didn't even produce a message, just freezing } and timeouts... All pain, no gain. Your work was

not just vain, it was thoroughly destructive, even cypherpunks beyond its immediate effect. text[] H E L L O , W O R L D 0 1 2 3 4 5 6 7 8 9 10 11

http://www.reddit.com/r/programming/comments/1qw73v/til_oracle_changed_the_internal_string offset 59 60 Suffix sort Lessons learned

Q. How to efficiently form (and sort) suffixes in Java 7u6? Lesson 1. Put performance guarantees in API. A. Define Suffix class ala Java 7u5 String. Lesson 2. If API has no performance guarantees, don't rely upon any!

Corollary. May want to avoid String data type for huge strings. String[] suffixes = new String[N]; for (int i = 0; i < N; i++) ・Are you sure charAt() and length() take constant time? suffixes[i] = new Suffix(s, i); ・If lots of calls to charAt(), overhead for function calls is large. Algorithms FOURTH EDITION If lots of small strings, memory overhead of String is large. Arrays.sort(suffixes); ・ ROBERT SEDGEWICK | KEVIN WAYNE

4th printing

Ex. Our optimized algorithm for suffix arrays is 5x faster and uses 32x less memory than our original solution in Java 7u5!

61 62

Suffix Arrays: theory Suffix Arrays: theory

Q. What is worst-case running time of our suffix arrays algorithm? Q. What is complexity of suffix arrays? ・Quadratic. ・Quadratic. ・Linearithmic. ・Linearithmic. Manber-Myers algorithm (see video) ・Linear. ✓ ・Linear. suffix trees (beyond our scope) ・None of the above. N2 log N ・Nobody knows.

Suffix arrays: A new method for on-line string searches sufxes Udi Manber1 0 a a a a a a a a a a Gene Myers2 LINEAR PATTERN MATCHING ALGORITHMS Department of Computer Science 1 a a a a a a a a a University of Arizona Peter Weiner Tucson, AZ 85721 2 a a a a a a a a The Rand Corporation, Santa Monica, California* 3 a a a a a a a May 1989 Revised August 1991 Abstract 4 a a a a a a In 1970, Knuth, Pratt, and Morris [1] showed how to do basic pattern matching Abstract 5 in linear time. Related problems, such as those discussed in [4], have pre- a a a a a viously been solved by efficient but sub-optimal algorithms. In this paper, we A new and conceptually simple data structure, called a suffix array, for on-line string searches is intro- introduce an interesting data structure called a bi-tree. A linear time algo- duced in this paper. Constructing and querying suffix arrays is reduced to a sort and search paradigm that 6 rithm "for obtaining a compacted version of a bi-tree associated with a given a a a a string is presented. With this construction as the basic tool, we indicate how employs novel algorithms. The main advantage of suffix arrays over suffix trees is that, in practice, they to solve several pattern matching problems, including some from [4], in linear use three to five times less space. From a complexity standpoint, suffix arrays permit on-line string 7 a a a time. searches of the type, ‘‘Is W a substring of A?’’ to be answered in time O(P + log N), where P is the length of W and N is the length of A, which is competitive with (and in some cases slightly better than) 8 a a I. Introduction Before giving a formal definition of a bi-tree, we re- suffix trees. The only drawback is that in those instances where the underlying alphabet is finite and small, view basic definitions and terminology concerning t-arysuffix trees can be constructed in O(N) time in the worst case, versus O(N log N) time for suffix arrays. In 1970, Knuth, Morris, and Pratt [1-2] showed how to trees. (See Knuth [7] for further details.) However, we give an augmented algorithm that, regardless of the alphabet size, constructs suffix arrays in 9 a A t-ary tpee T over [ = {al, ... ,a } is a set of match a given pattern into another given string in time t O(N) expected time, albeit with lesser space efficiency. We believe that suffix arrays will prove to be proportional to the sum of the lengths of the pattern nodes N which is either empty or consists of a poot, better in practice than suffix trees for many applications. and string. Their algorithm was derived from a result nO E N, and t ordered, disjoint t-arY trees. of Cook [3] that the 2-way deterministic pushdown lan- Clearly, every node n N is the root of some i E guages are recognizable on a random access machine in i time O(n). Since 1970, attention has been given to t-ary tree T which itself consists of n1 and t ordered,1. Introduction 63 iii 64 several related problems in pattern matching [4-6], but disjoint t-ary trees, say T , T , T • We call the l 2 t Finding all instances of a string W in a large text A is an important pattern matching problem. There are the algorithms developed in these investigations us- iii tree T a sub-tpee of T;also, .all sub-trees of T are many applications in which a fixed text is queried many times. In these cases, it is worthwhile to construct ually run in time which is slightly worse than linear, j j 1 for example O(n log n). It is of considerable interest considered to be sub-trees of T • It is natural to a data structure to allow fast queries. The Suffix tree is a data structure that admits efficient on-line string to either establish that there exists a non-linear associate with a tree Tasuccessor function searches. A suffix tree for a text A of length N over an alphabet can be built in O(N log | | ) time and lower bound on the run time of all algorithms which O(N) space [Wei73, McC76]. Suffix trees permit on-line string searches of the type, ‘‘Is W a substring of solve a given pattern matching problem, or to exhibit S: NX[ (N-{n }) U {NIL} an algorithm whose run time is of O(n). O A?’’ to be answered in O(P log | | ) time, where P is the length of W. We explicitly consider the In the following sections, we introduce an inter- defined for n E Nand a E L by esting data structure, called a bi-tree, and show how i j 1 Supported in part by an NSF Presidential Young Investigator Award (grant DCR-8451397), with matching funds from AT&T, and an efficient calculation of a bi-tree can be applied to i by an NSF grant CCR-9002351. n , the root of if is non-empty the linear-time (and linear-space) solution of several 2 Supported in part by the NIH (grant R01 LM04960-01) , and by an NSF grant CCR-9002351. pattern matching problems. s(ni'Oj) = {NIL if is empty. II. Strings, Trees, and Bi-Trees It is easily seen that this function completely deter- In this paper, both patterns and strings are finite mines a t-ary tree and we write T = (N, nO'S). length, fully specified sequences of symbols over a If n' = S(n,a), we say that nand n' are connected finite alphabet [ = {a ,a , ... ,a }. Such a pattern of f l 2 t by a bpanah from n to n which has a label of o. wet length m will be denoted as call n' a son of n, and n the father of n'. The degree of a node n is the number of sons of that node, that is, P = P (1) P (2) ... P (m ), the number of distinct a for which S(n,a) NIL. A node of degree 0 is a leaf of the tree. th where P(i), an element of [, is the i symbol in the It is useful to extend the domain of S from Nx[ th to (N U {NIL}) x [* (and extend the range to include sequence, and is said to be located in the i position. nO) by the inductive definition To represent the substring of characters which begins at position i of P and ends at position j, we write (Sl) S(NIL,w) NIL for all w E [* P (i:j). That is, when i j, P (i: j) = P (i) ... P (j ), and P(i:j) = A, the null string, for i > j. (S2) S(n,A) = n for all nEN Let [* denote the set of all finite length strings (S3) S(n,u.xJ) = S(S(n,w),a) for all n EN, w E L*, over [. strings WI and w in [* may be combined by 2 and a E L:. the operation of concatenation to form a new string Not every S: Nx[ (N-{n }) U {NIL} is the successor W = WI w2 . The reverse of a string P = A (1) ... A (m) O is the string pr = A (m) ... A (1 ). function of a t-ary tree. But a necessary and suffi- The length of a string or pattern, denoted by 19(w) cient condition for S to be a successor function of for W E [*, is the number of symbols in the sequence. some (unique, if it exists) t-ary tree can be expressed For example, 19(P(i:j» = j-i+l if i j and is 0 if in terms of the extended S. Namely, that there exists i > j. exactly one choice of w such that S(nO'w} n for every Informally, a bi-tree over [ can be thought of as n E N. there exists a T such that T = (N,nO'S), two related t-ary trees sharing a common node set. we say that S is We may also associate with T a father function *This work was partially supported by grants from F: N N defined by F(nO) = nO and for n' E N-{nO}' the Alfred P. Sloan Foundation and the Exxon Education

Foundation. P. Weiner was at Yale University when this F (n ') = n ¢) S (n , a) = n' for s orne a E [. work was done. Suffix Arrays: practice String sorting summary

Applications. Bioinformatics, information retrieval, data compression, … We can develop linear-time sorts. ・Key compares not necessary for string keys. Many ingenious algorithms. ・Use characters as index in an array. ・Memory footprint very important. ・State-of-the art still changing. We can develop sublinear-time sorts. ・Input size is amount of data in keys (not number of keys). Not all of the data has to be examined. year algorithm worst case memory ・

1990 Manber-Myers N log N 8 N 3-way string quicksort is asymptotically optimal. 1999 Larsson-Sadakane N log N 8 N ・1.39 N lg N chars for random data.

2003 Kärkkäinen-Sanders N 13 N Long strings are rarely random in practice. 2003 Ko-Aluru N 10 N ・Goal is often to learn the structure! 2008 divsufsort2 N log N 5 N May need specialized algorithms. good choices ・ (Yuta Mori) 2010 sais N 6 N

65 66