Text Processing Tools
Total Page:16
File Type:pdf, Size:1020Kb
Tools for processing text David Morgan Tools of interest here sort paste uniq join xxd comm tr fmt sed fold head file tail dd cut strings 1 sort sorts lines by default can delimit fields in lines ( -t ) can sort by field(s) as key(s) (-k ) can sort fields of numerals numerically ( -n ) Sort by fields as keys default sort sort on shell (7 th :-delimited) field UID as secondary (tie-breaker) field 2 Do it numerically versus How sort defines text ’s “fields ” by default ( a space character, ascii 32h = ٠ ) ٠bar an 8-character string ٠foo “By default, fields are separated by the empty string between a non-blank character and a blank character.” ٠bar separator is the empty string between non-blank “o” and the space ٠foo 1 2 ٠bar and the string has these 2 fields, by default ٠foo 3 How sort defines text ’s “fields ” by –t specification (not default) ( a space character, ascii 32h = ٠ ) ٠bar an 8-character string ٠foo “ `-t SEPARATOR' Use character SEPARATOR as the field separator... The field separator is not considered to be part of either the field preceding or the field following ” separators are the blanks themselves, and fields are ' "٠ " ٠bar with `sort -t ٠foo whatever they separate 12 3 ٠bar and the string has these 3 fields ٠foo data sort fields delimited by vertical bars field versus sort field ("1941:japan") ("1941") 4 sort efficiency bubble sort of n items, processing grows as n 2 shell sort as n 3/2 heapsort/mergesort/quicksort as n log n technique matters sort command highly evolved and optimized – better than you could do it yourself Big -O: " bogdown propensity" how much growth requires how much time 5 sort stability stable if input order of key-equal records preserved in output unstable if not sort is not stable GNU sort has –stable option sort stability 2 outputs, from same input (all keys identical) not stable stable 6 uniq operates on sorted input omits repeated lines counts them uniq 7 xxd make hexdump of file or input your friend testing intermediate pipeline data cf. “octal dump” older, more widespread: od -Ad -tx1z tr and sed both useable to replace text with other text tr replaces individual characters sed replaces whole strings 8 tr reads standard input (not a file) for each input character • maps it to an alternate character • deletes it, or • leaves it alone tr replace “these” with “those” more of “these” than “those” more of “those” than “these” brackets unspecial, literal (usually) 9 tr delete characters control characters replace tab with space, hyphen with tab delete trailing newline convert line termination from microsoft to unix style (ie, delete carriage returns) tr squeeze repeated characters make arbitrary length file of ascii /dev/urandom serves as a bottomless source of binary charaters, [A-Za-z0-9] means the set of all characters that are letters or numerals tr is the translate command -c means "the complement of," in this case -c [A-Za-z0-9] means all characters that are not letters or numerals -d means, delete the specified characters from tr's input tr's output will contain whatever characters remain namely those that are letter or numerals (pure ascii) that goes to head as input head -c takes the specified number of characters from the input 10 binary character specification syntax per program ascii – “man ascii ” or ascii charts abounding on the internet 11 sorting text blocks imaginative example in Robbins ch 4 markup sort block as a unit strategy, using tr/sed instead of awk Variation of Robbins sec 4.1.3 Sorting Text Blocks avoids use of awk (confining the work to sed and tr) replace all \n's with a first control character ( ^X 030 18) use tr (not sed, because it won't do \n's) replace pairs of first with a second one (^Y 031 19) use sed (not tr, because it doesn't do pairs/strings, only individual characters) replace remaining first's (those that were single) with a third (^Z 032 1A) use tr or sed replace seconds (that's where there was double \n) with \n use tr or sed sort (by lines; now whole block is reduced to its own single line) double space replace thirds with \n (to turn lines back into blocks from which they came) 12 head and tail top five bottom five middle ten: 1) bottom of the top 2) top of the bottom Or, could employ “process substitution” tail -10 <(head -30 states) cut 13 paste cut'ing with gawk a special-purpose use of gawk, just for field cutting ninth third fifth field field field more closely than cut, gawk identifies fields as we intuitively do uses "white space"(multiple characters) instead of single characters only to separate fields gawk is a full-fledged, powerful text processing language this is merely a particular convenient usage 14 comm fmt 10x10 …to width 33 Wrap… …to width 84 16x16 15 fold fifty characters wide twenty-five characters two characters one character fold – top 10 characters puts characters in right like characters come mr. popularity lineup, top ten form (1 per line) all consecutive tallyman, most-to-least tally me banana 16 file – internal file format analyzer dd – device -to -device (files are devices too!) 17 strings 18.