Tools for processing text
David Morgan
Tools of interest here
sort paste uniq join xxd comm tr fmt sed fold head file tail dd cut strings
1 sort
sorts lines by default can delimit fields in lines ( -t ) can sort by field(s) as key(s) (-k ) can sort fields of numerals numerically ( -n )
Sort by fields as keys
default sort
sort on shell (7 th :-delimited) field
UID as secondary (tie-breaker) field
2 Do it numerically
versus
How sort defines text ’s “fields ” by default
( a space character, ascii 32h = ٠ ) ٠bar an 8-character string ٠foo “By default, fields are separated by the empty string between a non-blank character and a blank character.”
٠bar separator is the empty string between non-blank “o” and the space ٠foo
1 2 ٠bar and the string has these 2 fields, by default ٠foo
3 How sort defines text ’s “fields ” by –t specification (not default)
( a space character, ascii 32h = ٠ ) ٠bar an 8-character string ٠foo “ `-t SEPARATOR' Use character SEPARATOR as the field separator... The field separator is not considered to be part of either the field preceding or the field following ” separators are the blanks themselves, and fields are ' "٠ " ٠bar with `sort -t ٠foo whatever they separate
12 3
٠bar and the string has these 3 fields ٠foo
data
sort fields delimited by vertical bars
field versus sort field ("1941:japan") ("1941")
4 sort efficiency
bubble sort of n items, processing grows as n 2 shell sort as n 3/2 heapsort/mergesort/quicksort as n log n technique matters sort command highly evolved and optimized – better than you could do it yourself
Big -O: " bogdown propensity" how much growth requires how much time
5 sort stability
stable if input order of key-equal records preserved in output unstable if not sort is not stable GNU sort has –stable option
sort stability
2 outputs, from same input (all keys identical)
not stable
stable
6 uniq
operates on sorted input omits repeated lines counts them
uniq
7 xxd
make hexdump of file or input your friend testing intermediate pipeline data
cf. “octal dump” older, more widespread: od -Ad -tx1z
tr and sed
both useable to replace text with other text tr replaces individual characters sed replaces whole strings
8 tr
reads standard input (not a file) for each input character • maps it to an alternate character • deletes it, or • leaves it alone
tr
replace “these” with “those”
more of “these” than “those”
more of “those” than “these”
brackets unspecial, literal (usually)
9 tr
delete characters
control characters
replace tab with space, hyphen with tab
delete trailing newline
convert line termination from microsoft to unix style (ie, delete carriage returns)
tr
squeeze repeated characters
make arbitrary length file of ascii
/dev/urandom serves as a bottomless source of binary charaters, [A-Za-z0-9] means the set of all characters that are letters or numerals tr is the translate command -c means "the complement of," in this case -c [A-Za-z0-9] means all characters that are not letters or numerals -d means, delete the specified characters from tr's input tr's output will contain whatever characters remain namely those that are letter or numerals (pure ascii) that goes to head as input head -c takes the specified number of characters from the input
10 binary character specification syntax per program
ascii – “man ascii ” or ascii charts abounding on the internet
11 sorting text blocks
imaginative example in Robbins ch 4
markup sort block as a unit
strategy, using tr/sed instead of awk Variation of Robbins sec 4.1.3 Sorting Text Blocks avoids use of awk (confining the work to sed and tr)
replace all \n's with a first control character ( ^X 030 18) use tr (not sed, because it won't do \n's)
replace pairs of first with a second one (^Y 031 19) use sed (not tr, because it doesn't do pairs/strings, only individual characters)
replace remaining first's (those that were single) with a third (^Z 032 1A) use tr or sed
replace seconds (that's where there was double \n) with \n use tr or sed
sort (by lines; now whole block is reduced to its own single line)
double space
replace thirds with \n (to turn lines back into blocks from which they came)
12 head and tail top five
bottom five middle ten:
1) bottom of the top
2) top of the bottom
Or, could employ “process substitution” tail -10 <(head -30 states)
cut
13 paste
cut'ing with gawk
a special-purpose use of gawk, just for field cutting
ninth third fifth field field field
more closely than cut, gawk identifies fields as we intuitively do uses "white space"(multiple characters) instead of single characters only to separate fields gawk is a full-fledged, powerful text processing language this is merely a particular convenient usage
14 comm
fmt
10x10 …to width 33 Wrap… …to width 84
16x16
15 fold
fifty characters wide
twenty-five characters
two characters
one character
fold – top 10 characters
puts characters in right like characters come mr. popularity lineup, top ten form (1 per line) all consecutive tallyman, most-to-least tally me banana
16 file – internal file format analyzer
dd – device -to -device (files are devices too!)
17 strings
18