AN EXPERIMENTAL APPLICATIVE FOR LINGUISTICS AND STRING PROCESSING P.A.C. Bailes and L.H. Reeker DeparTment of Computer Science, University of Queensland, St. Lucia, Queensland, Auslralia 4067 Summar~ been retained, and in fact, improved. In an applicative framework, the pattern match The Post-X language is designed to must return a value that can be acted upon provide facilities for pattern-directed by other functions. The pattern itself has processing of strings, sequences and trees in been generalized to a much more powerful an integrated applicative format. data object, called the FORM.

Post-X is an experimental language A FORM consists of a series of designed for string processing, and for the alternative PATTERNS and related ACTIONS. other types of operations that one often Each pattern is very much like a pattern in undertakes in computational linguistics and SNOBOL4 (with some slight variations). FORMS language data processing. may be passed parameters (by value), which are then used in the pattern or action portion. In the design of Post-X, the following four goals have been foremost: A PATTERN determines the structure of the string to which it is matched. The pattern (I) To modernize the Markov algorithm based contains a sequence of concatenated elements, paradigm, as embodied in which are themselves PATTERNS, PRIMITIVE such languages as COMIT 13 and SNOBOL 8 ; PATTERNS (utilizing most of the SNOBOL4 primitives) or STRINGS. The value returned by (2) To provide a language useful in the pattern is either FALSE (if it fails to computational linguistics and language data match) or a "parse tree" designating the processing, in particular, but hopefully with structure of the string that corresponds to wider applicability; portions of the pattern. As an example, suppose that a pattern is P:=p1^P2^...^pn" (3) To provide a vehicle for the study of It may be matched to a string S=SoSl...s n by applicative programming, as advocated by the use of the operator '~in", and if each of Backus , among others; the Pi match a successive letter s j, one can conceptualize the "tree" returned as (4) To provide a vehicle to study the applic- ation of natural language devices in programming languages, as advocated by Hsu 9 and Reekerl~ s O s I • .. s n Sn+ I

The "X" in "Post-X" stands for where s O represents the unmatched portion to "experimental", and is a warning that features the left of the matched portion and Sn+. the of the language and its implementation may portion to the right of the matched portion. change from one day to the next. The eventual goal is to produce a language designed for The numbers I ..... n and the characters wide use, to be called "Post" (after the < and > in the example are SELECTORS, used logician Emil Post). In this paper, we shall in the ACTION portion to refer to the present some of the language's facilities for appropriate substring. The tree returned is string and tree processing. A more detailed denoted by $$, and ! is used for selection, statement of th~ rationale behind the language but $$! can be condensed to $ in this can be found in , and more~details of the context, so the expression $ < returns s O in language are to be found in- . the example above, while $ 2 returns s 2. The selectors give the effect of the short Pattern Matching persistence variables that were found in COMIT, where they were denoted by numerals. These The basic idea of using pattern matching variables had the advantage of having their to direct a computation is found in the normal scope limited to a single line of the program, algorithms of Markov, and was embodied in the thus minimizing the number of variables early string processing language COMIT. The defined at any one time. In Post-X, the series of SNOBOL languages developed at Bell selectors are local to a particular FORM. Laboratories, culminating in SNOBOL4, improved a number of awkward features of Each of the s may be a subtree, rather COMIT and added some features of their own. than a string. In the example above, if Pl Among these latter was the idea of patterns were defined as P11be12the^P ^ .^Pin,subtree then in as data objects. place of s I would

Post-X incorporates patterns into an applicative framework, which will be illus- tFated below. In doing so, the powerful pattern matching features of SNOBOL4 have Sll s12 Sln

--520-- where is the portion matched by P11 etc. free parsing creates no difficulties. Then s11Sllwould be referenced by $ I . I. Thus we may define Composition of functions is often necess- ary. For the composition of F and G, we write E:=E^"+"^T I T F:G. For example T:=T^"*"^F I F head:sort F:=,,(,,^E^,,),, I ,,x,, where "head" gives the first element of a Then the pattern match sequence and "sort" sorts a sequence of strings into alphabetical order, defines a function E = "x+x*x" which operates on a sequence of strings and gives the first in alphabetical order. will return the tree

Certain natural variations in the syntax (E) are permitted. For example (E) +~T expression Iop expression 2

(E) + (T) is defined to be the same as (T) (T) * (~) (op):[expression I, expression2].

(F) (F) x The existence of a conditional function X X cond [a,b,c] The default action is to produce an unlabelled on producing "b v' or "c" depending "a" is tree "compressed" by the elimination of single vital; Post-X allows for a multi-way branch of offspring nodes. The example above is given by the form the tree (pictorially). if condition I then expression I elif condition 2 then expression 2 elif X

else expression fi n X ~ X the form or the sequence do sequence of names expression [ ["x","+",E"x","x","x"7 ],"+","x"] od If a labelled phrase marker is desired, then is an expression whose value is the function appropriate actions need to be attached. For which may be applied to a sequence of instance, if a parenthesized representation epxressions, the value of each of which is of the tree above with node labels added is given in the expression (in the d_o_o ... o d) desired, the forms would be" by the corresponding name (i.e. call-by-value). The value of the application is the value of E :=E^,,+,,^T {,,E[,,^$i^,,, +, ,,^$3^,, ],,} the expression (with "substituted" parameters). 1 T {"E["^$ I^"]"}; User-defined functions are named by declarations, T: =T^,,.,,^F {,,T[,,^$i^,,, *, ,,^$3^,, ],,} examples of which are given later, and defined I F {"T["^$1 ^"]"}; In • F:=,, (,, ^E^, ),, {,,F[(,,^$2^,,) ],,} We have already discussed the bas,ic I "x" {"FEM]"}; pattern matching operation and the definition of the FORMS used in that operation. As may In the example above, the application of E be apparent from that discussion, context would return "E[E[E[T[ F[x] ] ],+,T[T[ F[x] ], *, FEx] ]],+,T[x] ]"

--521-- Translation to a prefix representation of E:=E^"+"^T {$I + $3} the arithmetic expression could, incidentally, IT; be accomplished by a slight change in the T:=T^"*"^F {$I * $3} actions: rF; F:='('^E^') f {$2} E:=E^"+"^T {"+"^$I ^$3} r span (0..9) {$I}; r T; T:=T^"*"^F {"*"^$I ^$3} Certain predefined patterns and pattern-returning I F; functions are available, being closely modelled F:=" (" ^E^") '' {$2} on those of SNOBOL4 e.g. 3 "x"; any string This time, E="x+x*x" would yield arb break string ++X*XXX. span string arbno string Context sensitive - and even more complex etc... - parsing can be effected by building programs into the actions to give the effect of augmented transition networks. Two Examples It should also be noted that the actions need not merely pass back a single value. I. Random Generation of Sentences Several values may be associated with a node, as in attribute grammars I0. For example Given a context-free grammar as a sequence of rules in BNF notation (i.e. left E'.--E^"+''^T{ and right hand sides separated by "::=", value := $1.value + $3.value; nonterminals surrounded by angle brackets), code := $1.code^$3.code ^" add" we wish to randomly generate sentences from the grammar. We shall assume for simplicity } that a pseudo-random number generator RANDOM I T; is available which generates a number in an appropriate range each time that it is called. T: =T ^''*''^Fa We assume also that the grammar is a string value := $1.value + $3.value; of productions, separated by ";" and is called code := $1.code^$3.code ^" times" GRAM. } I F; The program will utilize a form LHS_FIND to find a production with a particular left F:=,,("^E ,,),, hand side and return its right hand side. {value := $2.value; code := $2.code LHS FIND LHS := LHS^"::="^BREAK";"^";"{$3}; } I span ("0'~.."9'') The alternatives on the right hand side are then converted to a sequence by the {value := $I; pattern ALT_LIST: code := " "^$I}; ALT_LIST := BREAK "I ''^'ir'' {[$1]^(ALT_ LIST <$)>} (As In SNOBOL4, a numerical string in a I REM {[$I]}; numerical context is treated as a number.) The particular alternative on the right Reference to the attributes of a node hand side is chosen by the procedure may be made in several ways. For example, in the last grammar given, (E = "1+2*3").value SELECT RHS LIST := LIST ! ((RANDOM MOD would have the value 7, as would value SIZE(LIST))+I); (E="I+2"3") or (E="1+2*3")~"value"; and (E+"1+2*3").code would be evaluated as The replacement in the evolving sentential "I 2 3 times add". form is accomplished by

If it were considered desirable REPLACE GRAM := "<"^BREAK">"^"> '' immediately to evaluate the expression {$<^((REPLACE GRAM)< (returning 7 as the value of the match in SELECT RHS the example above) we can write this in the (ALT LTST< action portion: (LHS--FIND $2 } INULL{$$};

--522-- This form finds an occurrence of a nonterminal Line (2) applies the form PARTITION (it will find the leftmost as it is applied to each string in the sequence. PARTITION below). It uses this nonterminal as a will "tag" each word in each string, by parameter to LHS_FIND, which is applied to producing a copy of the string with angle the grammar GRAM to return the right hand brackets around the word, creating alternatives. Then SELECT RHS selects an [" An ...", "An ..." alternative, which is plac~d in the context "An analysis ..." ... etc.] A sequence of the nonterminal matched by the pattern is produced for each string in the original portion of the form. Finally, REPLACE is sequence, and these are merged to form a single matched (recursively) to the result. If sequence by the UNION function. the first alternative fails, it means that there is no nonterminal. In that case, the Line (3) adds an occurrence of the "tagged" second alternative will be matched, and will word to the beginning of each string. Line (4) return the entire string, which will be a sorts the sequence obtained (SORT is a built-in string in the language generated by GRAM. function), and Line (5) removes the word added to the beginning of the string in line (3). The entire program will be invoked as The forms PARTITION, TAGFRONT, and RAND STRING GRAM := (REPLACE GRAM) <"". REMOVE are defined as follows:

This assumes that the root symbol is . PARTITION := SPAN(,A,..,Z,) ^, , It should be noted that the parentheses {E'<'^$I^'>'^$2^$>] ^, can be reduced by using the transformation ALPHA:CAT[S1] (PARTITION)} available in Post-X into postfix notation, ISPAN('A'..'Z') {'<'^$1^'> '} with the postfix composition operator ".". Using this, one version of REPLACEGRAM The SPAN matches the first word, and the action would be: part of the form places that word ($I) between angle brackets. The PARTITION<$ portion of the REPLACE GRAM := "<"^BREAK">"^"> '' action returns a sequence of PARTITIONS of {$<^GRAM. the rest of the string, and the first word is (<(LHS FIND $2). concatenated onto the beginning of each of (<)ALT LIST. these. When only one word remains, the second SELECT RHS. alternative is used. (<)REPLACE GRAM ^$>} TAG FRONT:= INULL{$$}; '<'^BREAK'>' {$2 ^, ,^@$$} The freedom that this alternative notation provides is one of the refreshing aspects The @$$ is the whole string (the "flattening" of Post-X. (The particular transformation of tree $$ to a string), the $2 is the portion applied here to REPLACE_GRAM is, incidentally, in angle brackets. A copy of this is moved to analogous to extraposition in English.) the beginning of the string.

2. A KWIC Index of Titles REMOVE TAG:= ~PAN('A'..'Z') ' ' It is assumed that the input consists {$>} of a sequence of titles. It is desired to provide a primitive alphabetized KWIC index. This merely removes the string added at the An input of ["An analysis of the English beginning of the sentence by removing the present perfect" "The role of the word in first word. phonological development"] will produce [" analysis ..." "An ..." Tree Processin 9 in Post-X "... in phonological " ... etc.] Tree processing facilities are pattern- The top-level program to do this is directed and similar to string processing facilities. As pointed out earlier, labelled (I) KWIC := trees may be represented as strings containing (2) UNION ALPHA [(<)[PARTITION]]. brackets. The pattern BAL, carried over from (3) ALPHA [(in) TAGFRONT]]. SNOBOL4, is used to match a full, well-formed (4) SORT. subtree, but is extended to allow a (5) ALPHA [(<)[REMOVETAG]]; specification of the value of that tree. Some tree functions are added for use in forms. The function FUSE fuses a sequence of subtrees together at their top node, leaving the label unchanged. In a tree form, as illustrated

--523-- below, the use of a text string refers to a subtree with that label, but the function LABEL will return the label itself. The function sequences; in fact, sequences form the RELABEL (tree, name) changes the label on a unifying notion within Post-X data structures. subtree to that named. There remains work to be done to determine how effectively one can generalize pattern directed Post-X tree processing facilities are processing to sequences without adding too much currently undergoing a careful study, and may be complexity to the language. changed, but their present capabilities can The LISP programmer will tend to identify be illustrated by the specification of forms for sequences with lists, in the sense of that language. There are differences, however. The two linguistic transformations in English, facilities in LISP are oriented toward lists EQUI NP DELETION and THERE INSERTION: as right-branching binary trees, and though EQUI NP DELETION:=TREE("NP"^BAL^"S"^BAL("NP ''^ one can build LISP functions to overcome this, BAL) the programmer generally must manage the lists {If $I=$4~I then with their links in mind. In Post-X, the programmer is encouraged to deal with the $I^$2^$3^$412} sequence directly, as one becomes accustomed THERE INSERTION:=TREE("S"^BAL("NP"^BAL^"V ''^ to dealing with strings directly in a string BAL^ARB) processing language. {$1^FUSE("there"^$2~3 ^ $2!2^$2~4)) Discussion

These can be compared to a "conventional" The SNOBOL4 language has had few competit- formulation : ors over the years as a general string-process- ing language. Its development and distribution EQUI NP DELETION: X NP y [NP Y] Z were undertaken by Bell Laboratories, and thus S S it has been widely available for more than a decade. Yet SNOBOL4 has never been as widely Structural Index I, 2, 3, 4, 5, 6 used for large applications, including those in computational linguistics, as one might Structural Change I, 2, 3, 0, 5, 6 expect, and in the light of a decade's where 2=4 experience, it is possible to identify various difficulties that account for this. THERE INSERTION: X [NP V 4] Z 2 S The basic string processing operation of Structural Index I, 2, 3, 4, 5 SNOBOL4, pattern matching, is not the source Structural Change I, THERE+3, 2, 4, 5 of these difficulties. In fact, SNOBOL4 pattern matching provides a high-level-data- Illustrating the application of these, directed facility that should be a standard for other languages. The major problems were THERE INSERTION in [S[NP[Det[A]N[boy]] in the fact that the pattern-matching was never sufficiently eeneralized, leading to a VP[V[is]PP[Prep[in] II lit NP[Det[the] linguistic schism , that the syntactic N[garden]]]]] conventions of SNOBOL4 led to difficulties [S[there V[is]NP[Det[A]N[boy]] and poor structuring 5 , and that the PP[Prep[in]NP[Det[the] necessity of constantly assigning values to N[garden]]]]]]. string variables was clumsy and tended to obscure the semantics of all but the simplest Tree forms are not as natural as they might programs be, and changes can be expected, with the major goal being to make their use as closely Recently, various attempts have been analogous to that of strings as possible. made to remedy the problems mentioned above. But to give up pattern matching, as in Icon 7 or to resort to a less rich vocabulary Sequences of patterns, as in Poplar11(despite which, the latter language has a good deal of merit), Post-X has a number of facilities that is to "discard the baby with the bathwater". apply generally to sequences of items. These Post-X is the result of attempts to extend have been illustrated in the examples by, for pattern matching and to improve it, at the instance, the operator ALPHA, which applies same time providing a more natural, flexible a function to each element of a sequence, and and comprehensible linguistic vehicle. UNION, which takes a sequence of sequences and creates a single sequence. Less obvious, perhaps, is the fact that strings and unlabelled trees are themselves

--524-- References

I. Backus, John [78], Can programming be 13. Yngve, V. [58], A programming language liberated from the Van Neumann style?. for mechanical translation, Mechanical A functional style and its algebra of Translation, vol 5, no.l, 25-41. programs, CACM voi.21, no.8, August, 1978, 613-641.

2. Bailes, P.A.C., and Reeker, L.H.[80], Post- X: An Experiment in Language Design for String Processing, Australian Computer Sciehce Communications. Vol.2, no.2, March, 1980, 252-267.

3. Bailes, P.A.C., and Reeker, L.H. [80], The Revised Post-X programming language, Technical Report no.17, Computer Science Department, University of Queensland.

4. Galler, B.A., and Perlis, A.J. [70], A View of Programming Languages, Addison- Wesley, Reading, Massachusetts.

5. Gimpel, J.F. [76], Algorithms in SNOBOL4, John Wiley, New York.

6. Griswold, R.E. [79], the ICON Programming Language, Proceedings,ACM National Conference 1979, 8-13.

7. Griswold, R.E., and Hanson, D.R.[80], An Alternative to the Use of Patterns in String Processing, ACM Transactions on Programming Languages and Systems,Vol 2, no.2, April, 1980, 153-172.

8. Griswold, R.E., Poage,J.F., and Polonsky, I.P. [71], The SNOBOL4 Programming Language, Prentice-Hall, Englewood'Cliffs, New Jersey.

9. Hsu, R. [78] • Grammatical Function and Readability: the syntax of loop constructs, Proceedingsp Hawaii International Conference on Systems.

10. Knuth, D.E. [65], On the translation of Languages from left to right, Information and Control, vol.8, no.6, 607-639.

11. Morris, J.H., Schmidt, E., and Walker, P. [80], Experience with an Applicative String Processing Language, P rOc.Sixth ACM Symp. on Principles of Programming Languages.

12. Reeker, L.H.[79], Natural language devices for programming readability: embedding and identifier load, Proceedings of the 2nd Australian Computer Science Conferen'ce, University of Tasmania, 160-167.

, --525--