<<

I Model and Sentence Structure Manipulations for / Natural Language Application Systems

Zenshiro Kawasaki, Keiji Takida, and Masato Tajima Department of Intellectual Information Systems Toyama University 3190 Gofuku, Toyama 930-0887, Japan / {kawasaki, zakida, and Zajima}@ecs.Zoyama-u. ac. jp

Abstract the CC representation format lead to a correspond- ing simplicity and uniformity in the CC structure This paper presents a language model and operation scheme, i.e., CC Manipulation Language its application to sentence structure manip- (CCML). / ulations for various natural language ap- plications including human-computer com- Another advantage of the CCM formalism is that munications. Building a working natural it allows inferential facilitiesto provide flexiblehu- language dialog systems requires the inte- man computer interactions for various natural lan- l gration of solutions to many of the impor- guage applications. For this purpose conceptual re- tant subproblems of natural language pro- lationships including synonymous and implicational cessing. In order to materialize any of these relations established among CFs are employed. Such / subproblems, handling of natural language knowledge-based operations axe under development expressions plays a central role; natural and will not be discussed in this paper. language manipulation facilities axe indis- In Section 2 we present the basic components of / pensable for any natural language dialog CCM, i.e,,the frame and the concept com- systems. Concept Compound Manipula- pound. Section 3 introduces the CC manipulation tion Language (CCML) proposed in this language; the major features of each manipulation paper is intended to provide a practical are explained with illustrative examples. l means to manipulate sentences by means Concluding observations axe drawn in Section 4. of formal uniform operations. 2 Concept Coupling Model / 1 Introduction Sentence structure manipulation facilitiessuch as 2.1 Concept Compound and Concept transformation, substitution, , etc., axe Frame indispensable for developing and maintaining natu- ral language application systems in which language It is assumed that each linguistic expression such structure operation plays an essential role. For this as a sentence or a is mapped onto an ab- / reason structural manipulability is one of the most stract data structure called a concept compound important factors to be considered for designing a (CC) which encodes the syntactic and semantic in- sentence structure representation scheme, i.e.,a lan- formation corresponding to the linguisticexpression guage model. The situation can be compared to in . The CC is realized as an instance of a database management systems; each system is based data structure called the concept frame (CF) which on a specific data model, and a data manipulation is defined for each conceptual unit, such as an entity, sublanguage designed for the data model is provided a , a relation, or a , and serves / to handle the data structure (Date, 1990). as a template for permissible CC structures. CFs In Concept Coupling Model (CCM) proposed in axe distinguished from one another by the syntactic this paper, the primitive building block is a Con- and semantic properties of the they repre- / cept Frame (CF), which is defined for each phrasal sent, and axe assigned unique identifiers. CFs axe or sentential conceptual unit. The sentence analysis classified according to their syntactic categories as is carried out as a CF instantiation process, in which sentential, nominal, adjectival, and adverbial. The several CFs axe combined to form a Concept Com- CCM lexicon is a set of CFs; each entry of the lex- / pound (CC), a nested relational structure in which icon defines a CF. It should be noted that in this the syntactic and semantic properties of the sen- paper inflectional information attached to each CF definition is left out for simplicity. l tence are encoded. The simplicity and uniformity of

Kawasaki, Takida and Tajima 281 Language Model and Sentence Structure Manipulations

Zenshiro Kawasaki, Keiji Takida and Masato Tajima (1998) Language Model and Sentence Structure Manipulations for Natural Language Applications Systems. In D.M.W. Powers (ed.) NeMIazP3/CoNLL98 Workshop on Human Computer l Conversation, ACL, pp 281-286. I 2.2 The CC is an instantiated CF and is defined as shown in (3): In this section we define the syntax of the formal (3) C(H,R,A, description scheme for the CF and the CC, and ex- where the concept identifier C is used to indicate plain how it is interpreted. A CF is comprised of the root node of the CC and represents the ,whole four types of tokens. The first is the concept identi- CC structure (3), and H, R, and A are the head, I fier which is used to indicate the relation name of the role, and attribute slot respectively. The head slot CF structure. The second token is the key-phrase, H is occupied by the identifier of the C's head, i.e., which establishes the links between the CF and the C itself or the identifier of the C's component which I actual linguistic expressions. The third is a list of determines the essential properties of the C. The attribute values which characterize the syntactic and role slot R, which is absent in the corresponding CF semantic properties of the CF. Control codes for the definition, is filled in by a list of syntactic and se- CCM processing system may also be included in the mantic role names which are to be assigned to C list. The last token is the concept pattern which is in the concept coupling process described in Sec- a syntactic template to be matched to linguistic ex- tion 2.3. The last slot represents the C's structure, pressions. The overall structure of the CF is defined an instance of the concept pattern P of the corre- I as follows: sponding CF, and is occupied by the constituent list. (I) c(g, A, P), The members of the list, X1,X2,..., and Xn, are the where C and K are the concept identifier and the CCs corresponding to the immediate constituents key-phrase respectively, A represents a list of at- of C. The tail element M of the constituent list, l tribute values of the concept, and P is the concept which is absent in non-sentential CCs, has the form pattern which is a sequence of several terms: vari- md_(H,R,A, [M1, ..., Mini), where M1,...,Mm rep- ables, constants, and the * which represents resent CCs which are associated with optional ad- I the key-phrase itself or one of its derivative expres- verbial modifiers. sions. The constant term is a string. The vari- By way of example, the concept st'~ruCture corre- able term is accompanied by a set of codes which sponding to the sentence in (4a) is shown in (4b), represent the syntactic and semantic properties im- l which is an instance of the CF in (2). posed on a CF to be substituted for it. These codes, (4a) John broke the box yesterday. each of which is preceded by the symbol +, are classi- (4b) break010 fied into three categories: (a) constraints, (b) roles, (break010, I and (c) instruction codes to be used by the CCM 0, processing system. No reference is made to the se- [sent, dyn, fntcls, past, agr3s], quencing of these codes, i.e., the code names are [johnO060 I uniquely defined in the whole CCM code system. (johnO060, The CF associated with the word break in the [subj, ], sense meant by John broke the box yesterday is [nomphr s, prop, hum, agr 3s, mascIn] , shown in (2): B), I (2) breakOl O('break', [sent, dyn, base], boxO0010 '$1(+nomphrs + hum + subj + agent) • (box00010, $2( +nomphrs + Chert + obj + patnt)'). [obj, patnt], In this example the identifier and the key-phrase are [the_, encrt, nomphr s, agr3s], breakOlO and break respectively. The attribute list 0), indicates that the syntactic category of this CF is md_ sertt(ential) and the semantic feature is dyn(amic). ([yeste010], The attribute base is a control code for the CCM [modyr], processing system, which will not be discussed fur- [advphr s, mo~, ther in this paper. The concept pattern of this CF [yeste010 corresponds to a subcategorization fraxae of the (yeste010, break. Besides the symbol • which represents the 0, key-phrase break or one of its derivatives, the pat- [advphrs, timeAdv], I tern includes two variable terms ($1 and $2), which I) 3) 3). are called the immediate constituents of the con- In (4b) three additional attributes, i.e., f(i)n(i)t(e- cept breakOlO. The appended attributes to these )cl(au)s(e), past, and agr(eement-)3(rd-person- variables impose conditions on the CFs substituted )s(ingular), which are absent in the CF definition, for them. For example, the first variable should be enter the attribute list of break010. Also note that matched to a CF which is a nom(inal-)phr(a)s(e) the constituent list of break010 contains an optional with the semantic feature hum(an), and the syntac- modifier component with the identifier rod_, which tic role subj(ect) and the semantic role agent are to does not have its counterpart in the corresponding be assigned to the instance of this variable. CF definition given in (2). l

Kawasaki, Takida and Tajima 282 Language Model and Sentence Structure Manipulations I II 2.3 Concept Coupling Another important feature of the CC representa- tion is that structural transformation relations can As sketched in the last section, each linguistic ex- easily be established between CCs with different II pression such as a sentence or a phrase is mapped syntactic and semantic properties in tense, voice, onto an instance of a CF. For example, the sentence modality, and so forth. Accordingly, if a conve- (4a) is mapped onto the CC given in (4b) which is nient CC structure manipulation tool is available, an instance of the sentential CF defined in (2). In II sentence-to-sentence transformations can be realized this instantiation process, three other CFs given in through CC-to-CC transformations. The simplicity (5) are identified and coupled with the CF in (2) to and uniformity of the CC data structure allows us to generate the compound given in (4b). devise such a tool. We call the tool Concept Com- II (5a) johnOO60('john', [nomphrs, prop, hum, agr3s, mascln], I . ,). pound Manipulation Language (CCML). Suppose a set of sentences are collected for a spe- (Sb) boxOOOlO('box', [nomphrs, cncrt, base_n], ' * '). cific natural language application such as second lan- (5c) yesteOl O(lyesterday I, [advphrs, timeAdv], ' * i). guage learning or human computer communication. All three CFs in (5) are primitive CFs, i.e., their The sentences are first transformed into the corre- concept patterns do not contain variable terms and sponding CCs and stored in a CC-base, a file of their instances constitute ultimate constituents in stored CCs. The CC-base is then made to be avail- II the CC structure. For example (Sb) defines a CF able to retrieval and update operations. corresponding to the word box. The identifier and the key-phrase are box00010 and box respectively. The CCML operations are classified into three cat- egories: (a) Sentence-CC conversion operations, (b) II The attribute list indicates that the syntactic cate- gory, the semantic feature, and the control attribute CC internal structure operations, (c) CC-base oper- are noro(inal-)phr(o)s(e), c(o)~c~(e)t(e), and base(- ations. The sentence-CC conversion operations con- sists of two operators: the sentence-to-CC conver- }n(oun} respectively. The concept pattern consists sion which invokes the sentence analysis program of the symbol *, for which box or boxes is to'be sub- and parses the input sentence to obtain the corre- stituted. sponding CC as the output, and the CC-to-sentence In the current implementation of concept cou- conversion which generates a sentence corresponding pling, a Definite (DCG) (Pereira and Warren, 1980) rule generator has been devel- to the indicated CC. The CC internal structure oper- ations are concerned with operations such as mod- oped. The generator converts the run-time dictio- ifying values in a specific slot of a CC, and trans- nary entries, which are retrieved from the base dic- II forming a CC to its derivative CC structures. The tionary as the relevant CFs for the input sentence CC-base operations include such operations as cre- analysis, to the corresponding DCG rules. We shall ating and destroying CC-bases, and retrieving and not, however, go into details here about the algo- updating entries in a CC-base. The current imple- II rithm of this rule generation process. The input mentation of these facilities are realized in a Prolog sentence is then analyzed using the generated DCG environment, in which these operations are provided rules, and finally the source sentence structure is ob- II tained as a CC, i.e., an instantiated CF. In this way as Prolog predicates. the sentence analysis can be regarded as a process of In the following sections, the operations men- identifying and combining the CFs which frame the tioned above are explained in terms of their effects on CCs and CC-bases, and are illustrated by means source sentence. II of a series of examples. All examples will be based on a small collection of sample sentences shown in 3 Concept Compound (7), which are assumed to be stored in a file named il Manipulations sophie.text. The significance of the CC representation format is (7a) Sophie opened the big envelope apprehensively. it's simplicity and uniformity; the relational struc- (To) Hilde began to describe her plan. !1 ture has the fixed argument configuration, and ev- (7c) Sophie saw that the philosopher was right. ery constituent of the structure has the same data (7d) A thought had suddenly struck her. structure. Sentence-to-CC conversion corresponds to sentence analysis, and the obtained CC encodes 3.1 Sentence-CC Conversions syntactic and semantic information of the sentence; the CC representation can be used as a model for Two operations, $get_cc and $get_sent, are provided sentence analysis. Since CC, together with the rele- to inter-convert sentences and CCs. vant CFs, contains sufficient information to generate $get_cc~ $get_sent a syntactically and semantically equivalent sentence The conversion of a sentence to its CC can be re- to the original, the CC representation can also be alized by the operation $get_cc as a process of con- employed as a model for sentence generation. In cept coupling described in Section 2.3. The reverse this way, the CC representation can be used as a process, i.e., CC-to-sentence conversion, is carried language model for sentence analysis and generation. out by the operation $get_sent, which invokes the

AI Kawasaki, Takida and Tajima 283 Language Model and Sentence Structure Manipulations mm sentence generator to transform the CC to a corre- perf(e)ct to the slot attribute. sponding linguistic expression. The formats of these (12b) $add( C C , attribute, ~verf ct] , C C New ) . mm operations are: In (12b) the first argument CC is occupied by the (8) Sget_cc( Sent, CC). CC shown in (10c). The last argument .CCNew is (9) $get_sent( CC, Sent). then instantiated as the CC corresponding to the The arguments Sent and CC represent a sentence sentence Sophie had opened the envelope apprehen- !1 or a list of sentences, and a CC or a list of CCs, sively. Note that imperf(e)ct is a default attribute respectively. For the $get_cc operation, the input value assumed in (10c). sentence (list) occupies the Sent position and CC $delete mm is a variable in which the resultant CC (list) is ob- In contrast to add, this operation removes the in- tained. For the $get_sent operation, the roles of the dicated values from the specified slot. The format arguments are reversed, i.e., CC is taken up by an is: input CC (list) and Sent an output sentence (list). (13) Sdelete( CC, Slot, ValueList, CCNew). Example: $substitute (10a) Get CC for the sentence Sophie opened the big This operation is used to replace a value in a slot envelope apprehensively. with another value. The format is: The query (10a) is translated into a CCML state- (14) Ssubstitute( C C, Slot, OldV alue, N ewV alue, ment as: CCNew). (10b) $get_cc('Sophie opened the big envelope Example: II apprehensively', CC). (15a) For the CC in (10c), replace the attribute value Note that the input sentence must be enclosed in sin- past by pres( e )nt. gle quotes. The CC of the input sentence is obtained (15b) $substitute( CC, attribute, past, in the second argument CC, as shown in (10c): presnt, C C New ). II (10c) CC = By this operation CCNew is instantiated as a CC openOOlO(openO010,D, [sent, f ntels, past, agr 3s] , corresponding to the sentence Sophie opens the en- [sophiOlO( sophiOlO, [subj], velope apprehensively. mm [nomphrs, prop, hum, agr3s, femnn], D), $restructure bigOOO20(envelO01, [obj], This operation changes the listing order of imme- [the_, det_modf d, adj_mod, cncrt, diate constituents, i.e., the component CCs in the nomphrs, agr3s], structure slot of the specified CC. The format is: II [envelOOl ( envelO01, ~, (16) Srestrueture( CC, Order, CC New), [cncrt, nomphrs, agr3s, f _n], ~)]), where the first argument CC represents the CC to mm md_([ar e010], [modyr], [advphrs, be restructured and the second argument Order de- [app,'eO10(azo,'e010,D, [ dvphrs], U)])])- fines the new ordering of the constituents. The gen- eral format for this entry is: (17) [Pl,P2,Ps, ..,pn], mm 3.2 CC Internal Structure Operations where the integer p~ (i = 1, 2, ..., n) represents the Since the CC is designed to represent an abstract old position for the pi-th constituent of the CC in sentence structure in a uniform format, well-defined question, and the current position of Pi in the list attributive and structural correspondences can be (17) indicates the new position for that constituent. II established between CCs of syntactically and seman- For example, [2,1] means that the constituent CCs tically related sentences. Transformations between in the first and second positions in the old structure these derivative expressions can therefore be realized are to be swapped. The remaining constituents are by modifying relevant portions of the CC in ques- not affected by this rearrangement. tion. $transform For manipulating the CC's internal structure, The above mentioned basic operations yield CCs CCML provides four basic operations ($add, Sdelete, which do not necessarily correspond to actual (gram- Ssubstitute, Srestructure) and one comprehensive op- matical) linguistic expressions. The higher level eration ( $trans form ). structural transformation operation, $transform, is $add a comprehensive operation to change a CC into one II This operation is used to axid values to a slot. The of its derivative CCs which directly correspond to format is: actual linguistic expressions. Tense, aspect, voice, (11) Sadd(CC, Slot, ValueList, CCNew). sentence types (statement, question, etc), and nega- For the CC given in the first argument CC, the el- tion are among the currently implemented transfor- ements in ValueList are appended to the slot indi- mation types. The format is: cated by the second argument Slot to get the mod- (18) Strans form( C C, TransTypeList, C C New). mm ified CC in the last argument CCNew. The second argument TransTypeList defines the Example: target transformation types which can be selected (12a) For the CC given in (10c), add the value from the following codes: II

Kawasaki, Takida and Tafima 284 Language Model and Sentence Structure Manipulations Voice: act( i )v( e ), pas( si )v( e ). the result in the CC-base named sophie.ccb. Negation: a f firm(a)t(i)v(e), neg(a)t(i)v(e). (22b) $create_ccb(' sophie.text',' sophie.ccb'). II Tense: pres(e)nt, past,/ut(u)r(e). The first line of the file sophie.ccb is taken by the CC Perfective: per f ( e )ct, imper f ( e )ct. given in (10c) which corresponds to the first sentence Progressive: cont(inuous), n(on-)cont(inuous). in the file sophie.text shown in (7). Sentence Type~ stat(e)m(e)nt, quest(io)n, $activate_ccb dir(e)ct(i)v(e), excl(a)m(a)t(io)n. The format of $activate_ccb is: Note that the $transform operation does not re- (23) $activate_ccb( CC BaseFileName). quire explicit indication of the attribute type for This operation copies the CC-base indicated by each transformation code in the above list. This CCBaseFileName to the "current" CC-base which is possible because the code names are uniquely de- can be accessed by retrieval and update operations fined in the whole CCM code system. explained in the following sections. If another CC- II Examples: base file is subsequently "activated", CCs from this (19a) Get the form of the sentence new CC-base file are simply appended to the current Hilde began to describe her plan. CC-base. II The above query is expressed as a series of CCML Example: statements: (24a) Activate sophie.ccb. (19b) $get_cc(~Hilde began to describe her plan ~, (24b) $activate(' sophie.ccb'). II CC), $destroy_ccb Stransf orm( CC, [questn], CCNew), There are two formats for $destro~_ccb: Sget_sent( CC New, SentNew). (25a) $destroy_ccb(CCBaseFileName). The result is obtained in SentNew as: (25b) $destroy_ccb. (19c) SentNew='Did Hilde begin to describe her CCBaseFileName is taken up by the name of a CC- plan?' base file to be removed. If current is substituted for Note that the same values are substituted for like- CCBaseFileName in (25a) or the operation is used !1 named variables appearing in the same query in all in the (25b) format, the current CC-base is removed. of their occurrences, e.g., CC and CCNew in (19b). $save_eeb Another example of the use of $transfarm is given The formats are: in (20): (26a) $save_ccb(C C BaseF ileN ame ). II (20a) Get the present perfect passive form of the sen- (26b) $save_ccb. tence Sophie opened the big envelope apprehensively. The $save_ccb operation is used to store the cur- (20b) $get_cc(tSophie opened the big envelope rent CC-base into the file indicated by CCBaseFile- II apprehensively ~, C C ) , Name. The current CC-base is destroyed by this $trans form( C C, ~resnt, per f ct, pasv], operation. If CCBaseFileName is current in (26a) CCNew), or the operation is used in the (26b) format, the cur- $get_sent( CC New, SentNew ). rent CC-base is stored temporarily in the system's (20c) SentNew='The big envelope has been opened work space. Note that the Ssave_ccb operation dis- by Sophie apprehensively.' cards already existing CC-base in the work space when it is executed. $restore_ccb II 3.3 CC-base Operations This operation takes no arguments. The existing 3.3.1 Storage operations current CC-base is destroyed and the saved CC-base II The CC-base storage operations are: $create_ccb, in the work space is activated. The format is: Sactivate_ccb, Sdestroy_ccb, Ssave_ccb, and Sre- (27) $restore_ccb. store_ccb. $create_ecb II The general format of $create_ccb is: 3.3.2 Retrieval operations (21) $create_ccb( SentenceFileName, Retrieval operations are applied to the current CC BaseFileName). CC-base. Relevant CC-bases should therefore be ac- The file indicated by the first argument Sentence- tivated prior to issuing retrieval operations. FileName contains a set of sentences (one sentence $retrieve_ec per line) to be converted to their CC structures. The The $retrieve_cc operation searches the current $create_ccb operation invokes the sentence analyzer CC-base for CCs which satisfy the specified condi- II to transform each sentence in the file into the cor- tions. Note that if a CC contains component CCs responding CC and store the results in the CC-base which satisfy the imposed conditions, they are also indicated by CCBaseFileName (one CC per line). fetched by this operation; CCs within a CC are all II Example: searched for..The general format of $retrieve_cc is (22a) Convert all sentences in the text file as fallows: II sophie.text shown in (7) into their CCs and store (28) $retrieve_cc( S electionaIC onditionList,

II Kawasaki, Takida and Tajima 285 Language Model and Sentence Structure Manipulations I

RetrievedC C List), The formats axe: where SelectionalConditionList is a list of con- (32a) $append_cc( C C , C C B File ). stralnts imposed on CCs to be retrieved. Each el- (325) $append.cc( C C ). ement of the list consists of either of the following The first argument CC indicates a CC or a list of terms: CCs to be appended to the CC-base specified in (29a) SlotName = ValueList. the second argument CCBFile. If the named CC- (29b) RoleN ame : C ondition List. base is current or the operation is used in the for- SlotName is occupied by a slot name of the CC mat (32b), the $append_cc operation makes the ap- structure, i.e., identifier, head, role, attribute, or pended CC(s) indicated in the first argument be di- B structure. ValueList is a list of values correspond- rectly accessible by retrieval operations. ing to the value category indicated by SlotName. Example: The (29b) format is used when conditions are to (33a) Append the CC for the sentence Sophie saw be imposed on the immediate constituent CC with that the philosopher was right to the current CC- the role value indicated by RoleName. The con- base. ditions are entered in ConditionList, which is a (335) Sget_cc( 'Sophie saw that the philosopher list of terms of the format (29a). Each element of was right', CC), m SelectionalConditionList represents an obligatory Sappend_cc(CC). condition, i.e., all the conditions in the list should be Sdelete_cc satisfied simultaneously. More general logical con- (34a) $delete_cc( C C , C C B File ). II nectives such as negation and disjunction are not (34b) $delete_cc( C C ). available in the current implementation. The re- Removal of the indicated CC(s) from the current trieval result is obtained in the second argument CC-base is carried out by this operation. The in- RetrievedCCList as a list. terpretation of the arguments and their uses are the m Examples: same as those of $append_cc. (30a) Get all finite subordinate in the file sophie.text. | (30b) Sdestroy_ccb, 4 Conclusions $cr eate_ccb( tsophie.text',' sophie.ccb') , $activate_ccb(' sophie.ccU ), A sentence structure manipulation language CCML $retrieve_cc([attribute = fntcls, based on the language model CCM was proposed. attribute = sub_cls], CCL ), In CCM each sentence is transformed into a CC, $get_sent( CC L, SL ). a nested relational structure in which the syntac- (30c) SL=['That the philosopher was right. ']. tic and semantic properties of the sentence are en- m (31a) Assuming that the current CC-base is the one coded in a uniform data structure. This uniformity activated in (30), get sentences/clauses whose sub- in CC's data structure leads to a corresponding uni- jects have the semantic feature hum(an). formity in the CCML operations. The CCML op- (315) $retrieve_cc([ : [attribute = hum]], erations implemented so fax cover a wide range of m areas in sentence structure manipulations including $get.sent( CC L, SentL ). CC L ), sentence-CC inter-conversion operations, CC inter- (31c) Senti,= nal structure operations, and CC-base operations. m ['Sophie opened the big envelope The manipulation language CCML proposed in this apprehensively. ', paper is expected to be used in various natural lan- 'Hilde began to describe her plan." guage application systems such as second-language 'To describe her plan. ', learning systems and human computer communica- 'Sophie saw that the philosopher was right. ', tion systems, in which sentence structure manipula- 'That the philosopher was right. ']. tion plays an essential role. Note that all the embedded sentences are included in the retrieved CC list. Since the non-overt subject of describe in the sentence (7b) is analyzed as Hilde References in the CC generation process, the infinitival clause C. J. Date. 1990. An Introduction to Database m To describe her plan is also retrieved. Systems, Volume 1, Fifth Edition. Addison- Wesley Publishing Company, Inc., Reading, Mas- sachusetts. 3.3.3 Update operations F. Pereira and D. H. D. Warren. 1980. Definite CCML provides two update operations, i.e., clause for language analysis - A sur- Sappend_cc and $delete_cc. These operations are vey of the formalism and a comparison with aug- used to add or delete the specified CCs from the mented transition networks. Artificial Intelligence CC-base indicated. 13:231-278. Sappend_cc

Kawasaki, Takida and Tafima 286 Language Model and Sentence Structure Manipulations m