An Experimental Applicative Programming Language for Linguistics and String Processing P.A.C
Total Page:16
File Type:pdf, Size:1020Kb
AN EXPERIMENTAL APPLICATIVE PROGRAMMING LANGUAGE FOR LINGUISTICS AND STRING PROCESSING P.A.C. Bailes and L.H. Reeker DeparTment of Computer Science, University of Queensland, St. Lucia, Queensland, Auslralia 4067 Summar~ been retained, and in fact, improved. In an applicative framework, the pattern match The Post-X language is designed to must return a value that can be acted upon provide facilities for pattern-directed by other functions. The pattern itself has processing of strings, sequences and trees in been generalized to a much more powerful an integrated applicative format. data object, called the FORM. Post-X is an experimental language A FORM consists of a series of designed for string processing, and for the alternative PATTERNS and related ACTIONS. other types of operations that one often Each pattern is very much like a pattern in undertakes in computational linguistics and SNOBOL4 (with some slight variations). FORMS language data processing. may be passed parameters (by value), which are then used in the pattern or action portion. In the design of Post-X, the following four goals have been foremost: A PATTERN determines the structure of the string to which it is matched. The pattern (I) To modernize the Markov algorithm based contains a sequence of concatenated elements, pattern matching paradigm, as embodied in which are themselves PATTERNS, PRIMITIVE such languages as COMIT 13 and SNOBOL 8 ; PATTERNS (utilizing most of the SNOBOL4 primitives) or STRINGS. The value returned by (2) To provide a language useful in the pattern is either FALSE (if it fails to computational linguistics and language data match) or a "parse tree" designating the processing, in particular, but hopefully with structure of the string that corresponds to wider applicability; portions of the pattern. As an example, suppose that a pattern is P:=p1^P2^...^pn" (3) To provide a vehicle for the study of It may be matched to a string S=SoSl...s n by applicative programming, as advocated by the use of the operator '~in", and if each of Backus , among others; the Pi match a successive letter s j, one can conceptualize the "tree" returned as (4) To provide a vehicle to study the applic- ation of natural language devices in programming languages, as advocated by Hsu 9 and Reekerl~ s O s I • .. s n Sn+ I The "X" in "Post-X" stands for where s O represents the unmatched portion to "experimental", and is a warning that features the left of the matched portion and Sn+. the of the language and its implementation may portion to the right of the matched portion. change from one day to the next. The eventual goal is to produce a language designed for The numbers I ..... n and the characters wide use, to be called "Post" (after the < and > in the example are SELECTORS, used logician Emil Post). In this paper, we shall in the ACTION portion to refer to the present some of the language's facilities for appropriate substring. The tree returned is string and tree processing. A more detailed denoted by $$, and ! is used for selection, statement of th~ rationale behind the language but $$! can be condensed to $ in this can be found in , and more~details of the context, so the expression $ < returns s O in language are to be found in- . the example above, while $ 2 returns s 2. The selectors give the effect of the short Pattern Matching persistence variables that were found in COMIT, where they were denoted by numerals. These The basic idea of using pattern matching variables had the advantage of having their to direct a computation is found in the normal scope limited to a single line of the program, algorithms of Markov, and was embodied in the thus minimizing the number of variables early string processing language COMIT. The defined at any one time. In Post-X, the series of SNOBOL languages developed at Bell selectors are local to a particular FORM. Laboratories, culminating in SNOBOL4, improved a number of awkward features of Each of the s may be a subtree, rather COMIT and added some features of their own. than a string. In the example above, if Pl Among these latter was the idea of patterns were defined as P11be12the^P ^ .^Pin,subtree then in as data objects. place of s I would Post-X incorporates patterns into an applicative framework, which will be illus- tFated below. In doing so, the powerful pattern matching features of SNOBOL4 have Sll s12 Sln --520-- where is the portion matched by P11 etc. free parsing creates no difficulties. Then s11Sllwould be referenced by $ I . I. Thus we may define Composition of functions is often necess- ary. For the composition of F and G, we write E:=E^"+"^T I T F:G. For example T:=T^"*"^F I F head:sort F:=,,(,,^E^,,),, I ,,x,, where "head" gives the first element of a Then the pattern match sequence and "sort" sorts a sequence of strings into alphabetical order, defines a function E = "x+x*x" which operates on a sequence of strings and gives the first in alphabetical order. will return the tree Certain natural variations in the syntax (E) are permitted. For example (E) +~T expression Iop expression 2 (E) + (T) is defined to be the same as (T) (T) * (~) (op):[expression I, expression2]. (F) (F) x The existence of a conditional function X X cond [a,b,c] The default action is to produce an unlabelled on producing "b v' or "c" depending "a" is tree "compressed" by the elimination of single vital; Post-X allows for a multi-way branch of offspring nodes. The example above is given by the form the tree (pictorially). if condition I then expression I elif condition 2 then expression 2 elif X else expression fi n X ~ X the form or the sequence do sequence of names expression [ ["x","+",E"x","x","x"7 ],"+","x"] od If a labelled phrase marker is desired, then is an expression whose value is the function appropriate actions need to be attached. For which may be applied to a sequence of instance, if a parenthesized representation epxressions, the value of each of which is of the tree above with node labels added is given in the expression (in the d_o_o ... o d) desired, the forms would be" by the corresponding name (i.e. call-by-value). The value of the application is the value of E :=E^,,+,,^T {,,E[,,^$i^,,, +, ,,^$3^,, ],,} the expression (with "substituted" parameters). 1 T {"E["^$ I^"]"}; User-defined functions are named by declarations, T: =T^,,.,,^F {,,T[,,^$i^,,, *, ,,^$3^,, ],,} examples of which are given later, and defined I F {"T["^$1 ^"]"}; In • F:=,, (,, ^E^, ),, {,,F[(,,^$2^,,) ],,} We have already discussed the bas,ic I "x" {"FEM]"}; pattern matching operation and the definition of the FORMS used in that operation. As may In the example above, the application of E be apparent from that discussion, context would return "E[E[E[T[ F[x] ] ],+,T[T[ F[x] ], *, FEx] ]],+,T[x] ]" --521-- Translation to a prefix representation of E:=E^"+"^T {$I + $3} the arithmetic expression could, incidentally, IT; be accomplished by a slight change in the T:=T^"*"^F {$I * $3} actions: rF; F:='('^E^') f {$2} E:=E^"+"^T {"+"^$I ^$3} r span (0..9) {$I}; r T; T:=T^"*"^F {"*"^$I ^$3} Certain predefined patterns and pattern-returning I F; functions are available, being closely modelled F:=" (" ^E^") '' {$2} on those of SNOBOL4 e.g. 3 "x"; any string This time, E="x+x*x" would yield arb break string ++X*XXX. span string arbno string Context sensitive - and even more complex etc... - parsing can be effected by building programs into the actions to give the effect of augmented transition networks. Two Examples It should also be noted that the actions need not merely pass back a single value. I. Random Generation of Sentences Several values may be associated with a node, as in attribute grammars I0. For example Given a context-free grammar as a sequence of rules in BNF notation (i.e. left E'.--E^"+''^T{ and right hand sides separated by "::=", value := $1.value + $3.value; nonterminals surrounded by angle brackets), code := $1.code^$3.code ^" add" we wish to randomly generate sentences from the grammar. We shall assume for simplicity } that a pseudo-random number generator RANDOM I T; is available which generates a number in an appropriate range each time that it is called. T: =T ^''*''^Fa We assume also that the grammar is a string value := $1.value + $3.value; of productions, separated by ";" and is called code := $1.code^$3.code ^" times" GRAM. } I F; The program will utilize a form LHS_FIND to find a production with a particular left F:=,,("^E ,,),, hand side and return its right hand side. {value := $2.value; code := $2.code LHS FIND LHS := LHS^"::="^BREAK";"^";"{$3}; } I span ("0'~.."9'') The alternatives on the right hand side are then converted to a sequence by the {value := $I; pattern ALT_LIST: code := " "^$I}; ALT_LIST := BREAK "I ''^'ir'' {[$1]^(ALT_ LIST <$)>} (As In SNOBOL4, a numerical string in a I REM {[$I]}; numerical context is treated as a number.) The particular alternative on the right Reference to the attributes of a node hand side is chosen by the procedure may be made in several ways. For example, in the last grammar given, (E = "1+2*3").value SELECT RHS LIST := LIST ! ((RANDOM MOD would have the value 7, as would value SIZE(LIST))+I); (E="I+2"3") or (E="1+2*3")~"value"; and (E+"1+2*3").code would be evaluated as The replacement in the evolving sentential "I 2 3 times add".