Programming Pearls: a Literate Program
Total Page:16
File Type:pdf, Size:1020Kb
by jot1 Be~~tley programming with Special Guest Oysters Don Knuth and Doug McIlroy pearls A LITERATE PROGRAM Last month‘s column introduced Don Knuth’s style of frequency; or there might not even be as many as k “Literate Programming” and his WEB system for building words. Let’s be more precise: The most common programs that are works of literature. This column pre- words are to be printed in order of decreasing fre- sents a literate program by Knuth (its origins are sketched quency, with words of equal frequency listed in al- in last month‘s column) and, as befits literature, a review. phabetic order. Printing should stop after k words So without further ado, here is Knuth’s program, have been output, if more than k words are present. retypeset in Communications style. -Jon Bentley 2. The input file is assumed to contain the given Common Words Section text. If it begins with a positive decimal number Introduction.. , . , , . , . , . , , . 1 (preceded by optional blanks), that number will be Strategic considerations . , a the value of k; otherwise we shall assume that Basic input routines . , , . 9 k = 100. Answers will be sent to the output file. Dictionary lookup . , . , . , .17 define default-k = 100 (use this value if k isn’t The frequency counts . .32 otherwise specified) Sortingatrie . ...36 Theendgame................................41 3. Besides solving the given problem, this program is Index . ...42 supposed to be an example of the WEB system, for people who know some Pascal but who have never 1. Introduction. The purpose of this program is to seen WEB before. Here is an outline of the program solve the following problem posed by Jon Bentley: to be constructed: Given a text file and an integer k, print the k most program common-words (input, output); common words in the file (and the number of type (Type declarations 17) their occurrences) in decreasing frequency. var (Global variables 4) (Procedures for initialization 5) Jon intentionally left the problem somewhat (Procedures for input and output a) vague, but he stated that “a user should be able to (Procedures for data manipulation 20) find the 100 most frequent words in a twenty-page begin (The main program 8); technical paper (roughly a 50K byte file) without end. undue emotional trauma.” Let us agree that a word is a sequence of one or 4. The main idea of the WEB approach is to let the more contiguous letters; “Bentley” is a word, but program grow in natural stages, with its parts pre- “ain 1t II isn’t. The sequence of letters should be sented in roughly the order that they might have maximal, in the sense that it cannot be lengthened been written by a programmer who isn’t especially without including a nonletter. Uppercase letters clairvoyant. are considered equivalent to their lowercase For example, each global variable will be intro- counterparts, so that the words “Bentley” duced when we first know that it is necessary or and “BENTLEY” and “bentley” are essentially desirable; the WEB system will take care of collecting identical. these declarations into the proper place. We already The given problem still isn’t well defined, for the know about one global variable, namely the number file might contain more than k words, all of the same that Bentley called k. Let us give it the more descrip- 01966 ACM OOOl-0782/86/0600-0471 750 tive name max-words-to-print. June 1986 Volume 29 Number 6 Communicationsof the ACM 471 Programming Pear/s (Global variables 4) = There should be a frequency count associated max.-words-to-print: integer; with each word, and we will eventually want to run (at most this many words will be printed) through the words in order of decreasing frequency. See also sections 11, 13, 18, 22, 32, and 36. But there’s no need to keep these counts in order as This code is used in section 3. we read through the input, since the order matters only at the end. 5. As we introduce new global variables, we’ll often Therefore it makes sense to structure our program want to give them certain starting values, This will as follows: be done by the initialize procedure, whose body will (The main program a) = consist of various pieces of code to be specified initialize; when we think of particular kinds of initialization, (Establish the value of max-words-to-print IO); (Procedures for initialization 5) = (Input the text, maintaining a dictionary with procedure initialize; frequency counts 34); var i: integer; {all-purpose index for initializa- (Sort the dictionary by frequency 39); tions) (Output the results 41) begin (Set initial values 12) This code is used in section 3. end; This code is used in section 3. 9. Basic input routines. Let’s switch to a bottom- up approach now, by writing some of the procedures 6. The WEB system, which may be thought of as a that we know will be necessary sooner or later. preprocessor for Pascal, includes a macro definition Then we’ll have some confidence that our program facibty so that portable programs are easier to write. is taking shape, even though we haven’t decided yet For example, we have already defined ‘default-k’ to how to handle the searching or the sorting. It will be be 100. Here are two more examples of WEB macros; nice to get the messy details of Pascal input out of they allow us to write, e.g., ‘incr(counf[ p])’ as a con- the way and off our minds. venient abbreviation for the statement ‘counf[ p] + Here’s a function that reads an optional positive counf[p] + 1’. integer, returning zero if none is present at the be- ginning of the current line. define incr(#) = # c- # + 1 (increment a vari- able} (Procedures for input and output 9) = define deer(#) = 4#t # - 1 {decrement a vari- function read-inf: integer; able) var n: integer; {th e accumulated value) begin n c 0; 7. Some of the procedures we shall be writing come if leaf then to abrupt conclusions; hence it will be convenient to begin while (leoln) A (input t = Iu’) do introduce a ‘return’ macro for the operation of gef(inpuf]; jumping to the end of the procedure. A symbolic while (input 1 2 ‘0’) A (input 1 5 ‘9’) do label ‘exit’ will be declared in all such procedures, begin n t lO*n + ord(inpuf7) - ord(‘0’); and ‘exit:’ will be placed just before the final end. gef(inpuf); (No other labels or goto statements are used in the end; present program, but the author would find it pain- end; ful to eliminate these particular ones.) read-inf c n; end; define exit = 30 (the end of a procedure} See also sections 15, 35, and 40. define return = goto exit {quick termination] This code is used in section 3. format return = nil {typeset ‘return’ in boldface) 10. We invoke readht only once. 8. Strategic considerations. What algorithms and data structures should be used for Bentley’s prob- (Establish the value of max-words-to-print IO) = lem? Clearly we need to be able to recognize differ- max-words-to-print c read-inf; ent occurrences of the same word, so some sort of if max-words-to-prinf = 0 then internal dictionary is necessary. There’s no obvious max-words-to-print t default-k way to decide that a particular word of the input This code is used in section 8. cannot possibly be in the final set, until we’ve gotten very near the end of the file; so we might as well 11. To find words in the input file, we want a quick remember every word that appears. way to distinguish letters from nonletters. Pascal has 472 Communications of he ACM June 1986 Volume 29 Number 6 Programming Pearls conspired to make this problem somewhat tricky, 13. Each new word found in the input will be because it leaves many details of the character set placed into a buffer array. We shall assume that no undefined. We shall define two arrays, lowercase and words are more than 60 letters long; if a longer word uppercase, that specify the letters of the alphabet. A appears, it will be truncated to 60 characters, and a third array, lettercode, maps arbitrary characters into warning message will be printed at the end of the the integers 0 . 26. run. If c is a value of type char that represents the kth define max-word-length = 60 letter of the alphabet, then lettercode[ord(c)] = k; but {words shouldn’t be longer than this] if c is a nonletter, lettercode[ord(c)] = 0. We assume that 0 5 ord(c) 5 255 whenever c is of type char. (Global variables 4) += buffer: array [l . max-word-length] of 1 . 26; (Global variables 4) += (the current word] lowercase, uppercase: array [l . 261 of char; word-length: 0 . max-word-length; (the letters) (the number of active letters currently in buffer] lettercode; array [0 . 2551 of 0 . 26; word-truncated: boolean; {the input conversion table) (was some word longer than max-word-length?] 12. A somewhat tedious set of assignments is neces- 14. (Set initial values 12) += sary for the definition of lowercase and uppercase, be- word-truncated t false; cause letters need not be consecutive in Pascal’s character set. 15. We’re ready now for the main input routine, (Set initial values 12) = which puts the next word into the buffer. If no more lowercase[l] t ‘a’; uppercase[l] t ‘A’; words remain, word-length is set to zero; otherwise lowercase[2] + ‘b’; uppercase[2] c ‘B’; word-length is set to the length of the new word.