
project note with software demonstration A parser without a dictionary as a tool for research into French syntax Jacques VERGNE LIMSI 29 rue Titon F-75011 Paris France • "non verbal links": prepositions, co-ordinations, A natural language is not a formal subordinating conjunctions, relative pronouns, language • verbs in all their forms: conjugated, present and past That is why the syntax of a natural language ~ not participles, infinitives. be described by the rules of the syntax of a formal Thus the processing of the verb is unified in such a language. And that is why a natural language cannot be manner that every verb, conjugated or not, of a main or subordinated clause, acts as an adjective (is a "mmslatd" parsed the same way as a formal language. To parse a to adjective, according to Tesnigre's concept, in natural language, it would be necessary to know its [Tesni&e 591), determines a NP and is a link in tile syntax, "chain" of NPs. Thus, major difficulties in parsing a natural language Thereforc, the NP is considered as the basic and are not algorithmic but linguistic. constitutive element of the sentence (as the cell is the Here, I assume that natural language has a very basic and constitutive element of the living tissue). The high formal redundancy. In other words, morpho- inside and the outside of the NP are processed separately syntaclic clues are numerous enough to make deductions and differently, so we have: upon categories and relations in many converging ways. • a grammar of the NP, centered on a nominalized Research into French syntax then consists in word (like the inside structure of the cell, centered on its discovering these clues: morphology of words, nucleus), agreements, relative positions of elements, • and a grammar of the sentence, outside the NP, segmentation in NPs, or in larger segments, typology which is expressed in terms of NPs, each of them of relations, structure of the relations net, formal and considered as a closed entity (like the structure of the quantitative constraints upon this net. tissue expressed as an architecture of cells). Let us name this grammar the "NP stack grammar" or "NPSG". The parser presented here is an experimental device used to test and confirm the linguistic lnside the noun phrase hypotheses upon large text corpora. The formal The internal grammar of the NP is centered on a redundancy is high enough to parse without a dictionary nominalized word, a noun, a nominalized adjective or a with very few data (300 ending rules and a lexicon of verb. It reigns over its determinant and adjectives at the grammatical words of 4 kB) and a NP stack grammar. root of a dependency-determination tree. The branches of the tree are made of partitive, Linguistic hypotheses determinant, indefinite adjective, anteposed adjective A sentence is considered as a stack of NPs (very rare in scientific texts) before the nominalized in our Western grammar traditions, action is the main word, and contiguous or co-ordinated postposed adjectives, after the nominalized word. The nominalized interest and we place verbs in a central position. But, in scientific texts, NPs are the main carrier of word is determined and qualified by the other words of meaning, as they name concepts. They arc also used as the NP. This grammar has been presented for instance in terms for indexation. From the statistical point of view, [Vergne 86] and developed in [Vergne 89]. too, nouns are by far the most numerous category. Outside the noun phrase That is the reason why the basis for this parser is that The NP slack grammar or NPSG, is used to: the sentence is considered as a stack of NPs, with • validate the sentence structure as a stack of NPs, relations of determination between them. An initial NP • confirm the function of words external to NPs, is laid (it is often the focus of the sentence), and the • compute some tight relations external to NPs, such following phrases precise and determine it. Here as verb-object. determination is considered as adding more data. These Valuation functions are used to compute other different NPs are connected in a more or less narrow way looser relations external to NPs, such as prepositional by "links": phrase attachment or co-ordinations. 70 -1- Relations typology Transposing relations internal to a leaf by I propose to distinguish three types of relations: simulated reclothing -1- relations internal to the NP (mainly the Relations which are internal to a leaf pattern must be determination noun ~ adjective). These relations are transposed into the entire NP level pattern. From tile computed during the internal analysis of a NP. positions in a leaf pattern, we are able to compute the -2- relations external to the NP, but internal to a positions in tile entire NP level pattern. recognition pattern, as for instance, the relations To retrieve these absolute positions, we have only to subject <- verb and verb ~-- object in a SVO sentence. simulate tile reclothing of the heart, by using the These relations are computed at the recognition time historical account of the plucking off. After simulated during the validation of the sentence at the NP level. reclothing (by applying the rules in the reverse order), This computation is algorithmic. we obtain the absolute positions in the entire NP level Nota bene: to avoid the Chomsky's term ~, I pattern. propose the term feigner for Tesnibre's concept In a later step of the parsing, after the internal "regissant" (in [Tesnihre 59]). They have the same analysis of NPs, these relations will be transposed into etymology, and both are verbal derivatives: the word level pattern. the feigner reigns over its dependents. In such a way, all relations internal to a leaf pattern -3- relations external to the NP and external to the are computed inside the leaf pattern, then transposed by recognition patterns, as for instance, the relations simulated reclothing into the entire NP level pattern, "reigner" <-prepositional phrase (PP): the PPs arc then at last Iransposed into the word level pattern. recognized by a pattern of the form: prel)osition-Nl ), a These two transpositions may be seen as rel'erence pattern which does not contain the reigner, which is point changes, from a relative position in the leaf either a verbal "link", or a determined NP. These pattern (NP level), to an absolute position in the entire relations are computed by valuation t'unctions. The pattern (word level). computation then is heuristic. The three types of relations are determination Valuation functions: an heuristic way relations. They proceed from the tighter ones, inside the to choose NP (-10, to the looser ones, outside the NP, and outskle Valuation functions have been described in derail in the recognition patterns (-3-). I Vergne 891. The NP stack grammar Principle Validating the NP stack pattern A valuation function is a clear and fine way to express an heuristic, when criteria are too fuzzy to make a At the beginning of this step of the parsing, the choice with an algorithm (a binary tree of "if then else" sentence is represented by a pattern made of a sequence for instance). of letters, in which each letter represents either a NP, or The objective is to make an autoinatic choice without a word external to the NP (preposition, verb, for an algorithm. The principle is the following: insmm:e). • Determine the objects to valuate: the candidates It is possible to imagine this pattern as an distinguished from the non candidates. artichoke, made of leaves around the heart. • Quantify vahmtion of the candidates, using criteria Validating the pattern then consists in plucking off to discriminate thein. The criteria represents Ihe parts of it progressively: knowledge we have about the phenomenon. • the leaves are replaced or removed m a precise order, • The candidate who obtained the higher valuation is until the heart is reached: choosen. - leaves are replaced by context sensitive rules which erase negations, adverbs, auxiliaries; When to use valuation functions - leaves are removed by context free rules which Valuation functions are used to compute the relations remove everything else but the heaFl; of type -2.- : • simultaneously, each time a leaf is replaced or , to search tk)r the "reigner" of a nominal or infinitive removed, the relations internal to this leaf are computed PP, of a present participle or of a gerundive introduced (relations of type -2- in the typology exposed above). by "en", of a past participle, of a subordinated clause; That is the reason why I name this parser: o to search for the left co-ordinated of a NP, of a "plucking off parser" or "POi'". nominal or infinitive PP, of an infinitive, of an The final state of the pattern once plucked off must be attribute, of a main or subordinated clause; one of the different possible hearts. They are: • to search for a referent which agrees with an • a NP alone, anaphoric. • a NP determined by other NPs through a conjugated verb, Using valuation functions with formal criteria is • or a NP determined by an atlribute, a past participle based on the hypothesis of the high formal or a NP through an au×iliary "etre". redundancy of natural language. -2- 71 Attaching prepositional phrases Other features of the parser principle: Technical realization At the beginning of the computation, a "power to • programming language: Turbo Pascal reign" is affected to each word according to its category • machine: Macintosh and its eventual verbal derivational origin. • source size: = 16 000 lines This computation is thought of as the simulation of • code size: = 340 Ko the conflicts that words have between them to reign over • research and development: ~ 3 yearsman other words.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages3 Page
-
File Size-