Tools for processing text

David Morgan

Tools of interest here

     xxd    fmt  sed    file     strings

1 sort

 sorts lines by default  can delimit fields in lines ( -t )  can sort by field(s) as key(s) (-k )  can sort fields of numerals numerically ( -n )

Sort by fields as keys

default sort

sort on shell (7 th :-delimited) field

UID as secondary (tie-breaker) field

2 Do it numerically

versus

How sort defines text ’s “fields ” by default

( a space character, ascii 32h = ٠ ) ٠bar  an 8-character string ٠foo “By default, fields are separated by the empty string between a non-blank character and a blank character.”

٠bar  separator is the empty string between non-blank “o” and the space ٠foo

1 2 ٠bar  and the string has these 2 fields, by default ٠foo

3 How sort defines text ’s “fields ” by –t specification (not default)

( a space character, ascii 32h = ٠ ) ٠bar  an 8-character string ٠foo “ `-t SEPARATOR' Use character SEPARATOR as the field separator... The field separator is not considered to be part of either the field preceding or the field following ” separators are the blanks themselves, and fields are ' "٠ " ٠bar  with `sort -t ٠foo whatever they separate

12 3

٠bar  and the string has these 3 fields ٠foo

data

sort fields delimited by vertical bars

field versus sort field ("1941:japan") ("1941")

4 sort efficiency

 bubble sort of n items, processing grows as n 2  shell sort as n 3/2  heapsort/mergesort/quicksort as n log n  technique matters  sort highly evolved and optimized – better than you could do it yourself

Big -O: " bogdown propensity" how much growth requires how much time

5 sort stability

 stable if input order of key-equal records preserved in output  unstable if not  sort is not stable  GNU sort has –stable option

sort stability

2 outputs, from same input (all keys identical)

not stable

stable

6 uniq

 operates on sorted input  omits repeated lines  counts them

uniq

7 xxd

 make hexdump of file or input  your friend testing intermediate pipeline data

cf. “octal dump” older, more widespread: -Ad -tx1z

tr and sed

 both useable to replace text with other text  tr replaces individual characters  sed replaces whole strings

8 tr

 reads standard input (not a file)  for each input character • maps it to an alternate character • deletes it, or • leaves it alone

tr

replace “these” with “those”

more of “these” than “those”

more of “those” than “these”

brackets unspecial, literal (usually)

9 tr

delete characters

control characters

replace tab with space, hyphen with tab

delete trailing newline

convert line termination from microsoft to unix style (ie, delete carriage returns)

tr

squeeze repeated characters

make arbitrary length file of ascii

/dev/urandom serves as a bottomless source of binary charaters, [A-Za-z0-9] means the set of all characters that are letters or numerals tr is the translate command -c means "the complement of," in this case -c [A-Za-z0-9] means all characters that are not letters or numerals -d means, delete the specified characters from tr's input tr's output will contain whatever characters remain namely those that are letter or numerals (pure ascii) that goes to head as input head -c takes the specified number of characters from the input

10 binary character specification syntax per program

ascii – “man ascii ” or ascii charts abounding on the internet

11 sorting text blocks

 imaginative example in Robbins ch 4

markup sort block as a unit

strategy, using tr/sed instead of awk Variation of Robbins sec 4.1.3 Sorting Text Blocks avoids use of awk (confining the work to sed and tr)

replace all \n's with a first control character ( ^X 030 18) use tr (not sed, because it won't do \n's)

replace pairs of first with a second one (^Y 031 19) use sed (not tr, because it doesn't do pairs/strings, only individual characters)

replace remaining first's (those that were single) with a third (^Z 032 1A) use tr or sed

replace seconds (that's where there was double \n) with \n use tr or sed

sort (by lines; now whole block is reduced to its own single line)

double space

replace thirds with \n (to turn lines back into blocks from which they came)

12 head and tail top five

bottom five middle ten:

1) bottom of the top

2) top of the bottom

Or, could employ “process substitution” tail -10 <(head -30 states)

cut

13 paste

cut'ing with gawk

 a special-purpose use of gawk, just for field cutting

ninth third fifth field field field

more closely than cut, gawk identifies fields as we intuitively do uses "white space"(multiple characters) instead of single characters only to separate fields gawk is a full-fledged, powerful text processing language this is merely a particular convenient usage

14 comm

fmt

10x10 …to width 33 Wrap… …to width 84

16x16

15 fold

fifty characters wide

twenty-five characters

two characters

one character

fold – top 10 characters

puts characters in right like characters come mr. popularity lineup, top ten form (1 per line) all consecutive tallyman, most-to-least tally me banana

16 file – internal file format analyzer

dd – device -to -device (files are devices too!)

17 strings

18