String Sorts 5.1 String Sorts

Algorithms ROBERT SEDGEWICK | KEVIN WAYNE 5.1 STRING SORTS 5.1 STRING SORTS ‣ strings in Java ‣ strings in Java ‣ key-indexed counting ‣ key-indexed counting Algorithms ‣ LSD radix sort Algorithms ‣ LSD radix sort FOURTH EDITION ‣ MSD radix sort ‣ MSD radix sort ‣ 3-way radix quicksort ‣ 3-way radix quicksort ROBERT SEDGEWICK | KEVIN WAYNE ROBERT SEDGEWICK | KEVIN WAYNE http://algs4.cs.princeton.edu ‣ suffix arrays http://algs4.cs.princeton.edu ‣ suffix arrays String processing The char data type String. Sequence of characters. C char data type. Typically an 8-bit integer. Supports 7-bit ASCII. ・ 6.5 Data Compression 667 Important fundamental abstraction. ・Can represent at most 256 characters. ・Genomic sequences. ・Information processing. ASCII encoding. When you HexDump a bit- 0 1 2 3 4 5 6 7 8 9 A B C D E F stream that contains ASCII-encoded charac- NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI Communication systems (e.g., email). 0 ・ ters, the table at right is useful for reference. 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US ・Programming systems (e.g., Java programs). Given a 2-digit hex number, use the first hex 2 SP ! “ # $ % & ‘ ( ) * + , - . / digit as a row index and the second hex digit … 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? ・ as a column reference to find the character 4 @ A B C D E F G H I J K L M N O that it encodes. For example, 31 encodes the 5 P Q R S T U V W X Y Z [ \ ] ^ _ 1 4A J U+0041 U+00E1 U+2202 U+1D50A digit , encodes the letter , and so forth. 6 ` a b c d e f g h i j k l m n o This table is for 7-bit ASCII, so the first hex 7 p q r s t u v w x y z { | } ~ DEL digit must be 7 or less. Hex numbers starting some Unicode characters Hexadecimal to ASCII conversion table “ The digital information that underlies biochemistry, cell with 0 and 1 (and the numbers 20 and 7F) biology, and development can be represented by a simple correspond to non-printing control characters. Many of the control characters are left over from the days when physical devices string of G's, A's, T's and C's. This string is the root data like0 typewriters were controlled by ASCII input; the table highlights a few that you Java char data type. A 16-bit unsigned integer. structure of an organism's biology. ” — M. V. Olson might see in dumps. For example SP is the space character, NUL is the null character, LF is line-feed, and CR is carriage-return. ・Supports original 16-bit Unicode. Supports 21-bit Unicode 3.0 (awkwardly). In summary, working with data compression・ requires us to reorient our thinking about standard input and standard3 output to include binary encoding of data. BinaryStdIn 4 and BinaryStdOut provide the methods that we need. They provide a way for you to make a clear distinction in your client programs between writing out information in- tended for file storage and data transmission (that will be read by programs) and printing information (that is likely to be read by humans). I ♥ ︎ Unicode The String data type String data type in Java. Immutable sequence of characters. Length. Number of characters. Indexing. Get the ith character. Concatenation. Concatenate one string to the end of another. s.length() 0 1 2 3 4 5 6 7 8 9 10 11 12 s A T T A C K A T D A W N U+0041 s.charAt(3) s.substring(7, 11) Fundamental constant-time String operations 5 6 The String data type: immutability The String data type: representation Q. Why immutable? Representation (Java 7). Immutable char[] array + cache of hash. A. All the usual reasons. ・Can use as keys in symbol table. ・Don't need to defensively copy. operation Java running time ・Ensures consistent state. length s.length() 1 ・Supports concurrency. indexing s.charAt(i) 1 Improves security. public class FileInputStream ・ { private String filename; concatenation s + t M + N public FileInputStream(String filename) { ⋮ ⋮ if (!allowedToReadFile(filename)) throw new SecurityException(); this.filename = filename; } ... } attacker could bypass security if string type were mutable 7 8 String performance trap Comparing two strings Q. How to build a long string, one character at a time? Q. How many character compares to compare two strings of length W ? public static String reverse(String s) p r e f e t c h { String rev = ""; 0 1 2 3 4 5 6 7 for (int i = s.length() - 1; i >= 0; i--) p r e f i x e s rev += s.charAt(i); quadratic time return rev; } Running time. Proportional to length of longest common prefix. ・Proportional to W in the worst case. A. Use StringBuilder data type (mutable char[] array). ・But, often sublinear in W. public static String reverse(String s) { StringBuilder rev = new StringBuilder(); for (int i = s.length() - 1; i >= 0; i--) rev.append(s.charAt(i)); linear time return rev.toString(); } 9 10 Alphabets Digital key. Sequence of digits over fixed alphabet. 604 CHAPTER 6 ! Strings Radix. Number of digits R in alphabet. name R() lgR() characters BINARY 2 1 01 5.1 STRING SORTS OCTAL 8 3 01234567 DECIMAL 10 4 0123456789 ‣ strings in Java HEXADECIMAL 16 4 0123456789ABCDEF DNA 4 2 ACTG ‣ key-indexed counting LOWERCASE 26 5 abcdefghijklmnopqrstuvwxyz Algorithms ‣ LSD radix sort UPPERCASE 26 5 ABCDEFGHIJKLMNOPQRSTUVWXYZ ‣ MSD radix sort PROTEIN 20 5 ACDEFGHIKLMNPQRSTVWY ABCDEFGHIJKLMNOPQRSTUVWXYZabcdef ‣ 3-way radix quicksort BASE64 64 6 ROBERT SEDGEWICK | KEVIN WAYNE ghijklmnopqrstuvwxyz0123456789+/ http://algs4.cs.princeton.edu ‣ suffix arrays ASCII 128 7 ASCII characters EXTENDED_ASCII 256 8 extended ASCII characters UNICODE16 65536 16 Unicode characters Standard alphabets 11 holds the frequencies in Count is an example of a character-indexed array. With a Java String, we have to use an array of size 256; with Alphabet, we just need an array with one entry for each alphabet character. This savings might seem modest, but, as you will see, our algorithms can produce huge numbers of such arrays, and the space for arrays of size 256 can be prohibitive. Numbers. As you can see from our several of the standard Alphabet examples, we often represent numbers as strings. The methods toIndices() coverts any String over a given Alphabet into a base-R number represented as an int[] array with all values between 0 and RϪ1. In some situations, doing this conversion at the start leads to compact code, because any digit can be used as an index in a character-indexed array. For example, if we know that the input consists only of characters from the alphabet, we could replace the inner loop in Count with the more compact code int[] a = alpha.toIndices(s); for (int i = 0; i < N; i++) count[a[i]]++; Review: summary of the performance of sorting algorithms Key-indexed counting: assumptions about keys Frequency of operations. Assumption. Keys are integers between 0 and R - 1. Implication. Can use key as an array index. input sorted result algorithm guarantee random extra space stable? operations on keys name section (by section) Anderson 2 Harris 1 Applications. Brown 3 Martin 1 2 2 ✔ compareTo() insertion sort ½ N ¼ N 1 Davis 3 Moore 1 ・Sort string by first letter. Garcia 4 Anderson 2 Harris 1 Martinez 2 mergesort ✔ compareTo() Sort class roster by section. N lg N N lg N N ・ Jackson 3 Miller 2 Sort phone numbers by area code. Johnson 4 Robinson 2 ・ Jones 3 White 2 quicksort 1.39 N lg N * 1.39 N lg N c lg N compareTo() ・Subroutine in a sorting algorithm. [stay tuned] Martin 1 Brown 3 Martinez 2 Davis 3 heapsort 2 N lg N 2 N lg N 1 compareTo() Miller 2 Jackson 3 Moore 1 Jones 3 Remark. Keys may have associated data ⇒ Robinson 2 Taylor 3 * probabilistic can't just count up number of keys of each value. Smith 4 Williams 3 Taylor 3 Garcia 4 Thomas 4 Johnson 4 Thompson 4 Smith 4 Lower bound. ~ N lg N compares required by any compare-based algorithm. White 2 Thomas 4 Williams 3 Thompson 4 Wilson 4 Wilson 4 Q. Can we do better (despite the lower bound)? keys are use array accesses small integers A. Yes, if we don't depend on key compares. to make R-way decisions Typical candidate for key-indexed counting (instead of binary decisions) 13 14 Key-indexed counting demo Key-indexed counting demo Goal. Sort an array a[] of N integers between 0 and R - 1. Goal. Sort an array a[] of N integers between 0 and R - 1. ・Count frequencies of each letter using key as index. R = 6 ・Count frequencies of each letter using key as index. ・Compute frequency cumulates which specify destinations. ・Compute frequency cumulates which specify destinations. ・Access cumulates using key as index to move items. ・Access cumulates using key as index to move items. Copy back into original array. Copy back into original array. offset by 1 ・ i a[i] ・ i a[i] [stay tuned] 0 d 0 d int N = a.length; int N = a.length; int[] count = new int[R+1]; 1 a int[] count = new int[R+1]; 1 a 2 c use a for 0 2 c r count[r] b for 1 for (int i = 0; i < N; i++) 3 f for (int i = 0; i < N; i++) 3 f a 0 c for 2 count frequencies count[a[i]+1]++; 4 f d for 3 count[a[i]+1]++; 4 f b 2 5 b e for 4 5 b c 3 f for 5 for (int r = 0; r < R; r++) 6 d for (int r = 0; r < R; r++) 6 d d 1 count[r+1] += count[r]; 7 b count[r+1] += count[r]; 7 b e 2 8 f 8 f f 1 for (int i = 0; i < N; i++) for (int i = 0; i < N; i++) 9 b 9 b - 3 aux[count[a[i]]++] = a[i]; aux[count[a[i]]++] = a[i]; 10 e 10 e 11 a 11 a for (int i = 0; i < N; i++) for (int i = 0; i < N; i++) a[i] = aux[i]; a[i] = aux[i]; 15 16 Key-indexed counting demo Key-indexed counting demo Goal.

Load more