A Framework for Representing Lexical Resources
Total Page:16
File Type:pdf, Size:1020Kb
A framework for representing lexical resources Fabrice Issac LDI Universite´ Paris 13 Abstract demonstration of the use of lexical resources in different languages. Our goal is to propose a description model for the lexicon. We describe a software 2 Context framework for representing the lexicon and its variations called Proteus. Various Whatever the writing system of a language (logo- examples show the different possibilities graphic, syllabic or alphabetic), it seems that the offered by this tool. We conclude with word is a central concept. Nonetheless, the very a demonstration of the use of lexical re- definition of a word is subject to variation depend- sources in complex, real examples. ing on the language studied. For some Asian lan- guages such as Mandarin or Vietnamese, the no- 1 Introduction tion of word delimiter does not exist ; for others, such as French or English, the space is a good in- Natural language processing relies as well on dicator. Likewise, for some languages, preposi- methods, algorithms, or formal models as on lin- tions are included in words, while for others they guistic resources. Processing textual data involves form a separate unit. a classic sequence of steps : it begins with the nor- malisation of data and ends with their lexical, syn- Languages can be classified according to their tactic and semantic analysis. Lexical resources are morphological mechanisms and their complexity ; the first to be used and must be of excellent qual- for instance, the morphological systems of French ity. or English are relatively simple compared to that Traditionally, a lexical resource is represented of Arabic or Greek. as a list of inflected forms that are projected on a There are two main branches of morphology, text. However, this type of resource can not take inflectional or grammatical morphology and lexi- into account linguistic phenomena, as each unit cal morphology. The first one deals with context- of information is independent. This results in a related variations, as the rules of agreement in number of problems regarding the improvement gender and number or the conjugation of verbs. or review of the resource. On the other hand some The second one concerns word formation, gener- languages such as Arabic, because of the potential ally involving the association of a lexeme to pre- large lexicon, lends itself less easily to this kind of fixes or suffixes. manipulation. 3 Tagging Our goal is to propose a model for the descrip- tion of the lexicon. After presenting the existing Text tagging consists in adding one or more infor- theory and software tools, we introduce a software mation units to a group of characters : the token. framework called Proteus, capable of represent- This association is firstly performed in a context- ing the lexicon and its variations. The different free way, that is to say considering only the to- possibilities offered by this tool will be illustrated ken, and secondly by increasing the context size : through various examples. We conclude with a the tagging process is subsequently repeated in or- 490 Coling 2010: Poster Volume, pages 490–497, Beijing, August 2010 4 Tools and resources Several concepts are related to the use of lexical resources ; here we provide some examples of tools, theoretical as well as computational. resources in the form of a frozen list: Mor- • phalou (Romary et al., 2004), Morfetik (Bu- vet et al., 2007), Lexique3 (New, 2006) ; lexical representation formalisms: • DATR (Evans and Gazdar, 1996) ; inflections’ parsers: Flemm (Namer, 2000) ; • Figure 1: tagging schema complete software platforms: Nooj (Sil- • berztein, 2005), Unitex (Paumier, 2002) ; lexicon acquisition: lefff (Sagot et al., 2006). der to merge multiple tokens. Token merging ap- • plies to polylexical units, syntactic or para-textual 4.1 Frozen resources structures. If this kind of resource is directly used in the We distinguish two main types of resources tagging process, it raises many maintenance is- that can be projected on raw texts in order to sues. Moreover, in the case of languages with rich enrich them. The first of these resources is a morphology, the number of elements becomes too set of inflected forms associated with a num- large. These lists are most often the result of in- ber of information units (in the example below, flection engines that use canonical forms and in- the lemma and the morphosyntactic annotation of flection rules to generate the inflected forms. each form) : 4.2 Hierarchical lexical representation abyssal abyssal A--ms formalisms abysses abysse N-mp The goal of this type of formalisms is to represent the most general information at the top-level of The projection of this type of resources in tex- a hierarchy. There is an inheritance mechanism to tual corpora is quite simple. After identifying a transmit, specify, and if necessary, delete informa- token, the program only needs to check if the to- tion along the structure. It is possible to group un- ken is included in the resource and add the infor- der one tag a set of morphological phenomena and mation units associated with it. their exceptions. Multiple inheritance allows a The second type of resources contains a set of node to inherit from several different hierarchies. rules and a set of canonical forms (usually the lemma but not necessarily). These sets are used 4.3 Inflections’ parsers jointly, to produce all the inflected forms or to They propose a morphological analysis for a given analyse the tokens. Analysis consists in determin- inflected form : they try to apply the derivation ing, for a given inflected form, which rule was rules backwards and test whether the result ob- used, on which canonical form, in order to gener- tained corresponds to an attested canonical form. ate it. Then the information to be associated with The use of canonical forms is optional ; it provides the inflected form is related to the rule found. an analysis for lexical neologisms but can cause Diagram 1 presents the place of different re- incorrect results (Hundred is not the past partici- sources in the tagging process. ple of hundre). 491 4.4 Software platforms quired to validate the form against non- Unitex / Intex / NooJ are complete environments morphological rules ; the processing unit for performing linguistic analysis on documents. here is the token (i.e. global transformation). They are able to project dictionaries on texts and The model was developed to meet the following to query them. They offer a set of tools for man- objectives: aging dictionaries, both lexical and inflectional. NooJ is the successor of Intex ; among the new 1. A verbatim description of a language does features, is the redesign of the architecture of not allow for the analysis of unknown words the dictionaries. It proposes handling simple and even if their inflection is regular. We must compound words in a single way. The method is a therefore develop a mechanism that we can mix of manipulation of characters and use of mor- use for both analysis and generation. Then phological grammars. we will be able to analyse not only known The inflexion mechanism is based on classic words but also neologisms. character handling operators as well as on word 2. In a lexical database, where, for French, the manipulation operators. Here is the list of some number of elements reaches one million, the operators: presence of an error is always possible, even <B> delete last character inevitable. We must therefore consider an ef- <D> duplicate last character fective maintenance procedure : a dictionary <L> go left of lemmas linked to a dictionary of inflec- <R> go right tions and not a read-only resource containing <N> go to the end of next word form all inflected forms. <P> go to the end of previous word form <S> delete next character 3. The concept of word is so complex that we cannot limit a resource to simple words. The model must integrate the management of 5 Representation and structuring of both simple and compound words. The only inflections: the Proteus model limit we set is syntax: the management of id- We introduce a framework capable to represent ioms, even if it is fundamental, requires the and structure inflections efficiently, not only in implementation of other tools. terms of resource creation, but in terms of linguis- 4. The concept of inflection varies depending tic consistency too. At the inflection level we pro- on the language. We must build a system ca- pose a simple multilingual mechanism for simple pable of dealing with all types of affixation and compound forms. It is used both to generate (prefixation, suffixation or infixation). The simple forms and to analyse them. Regarding the treatment of Arabic is from this point of view lexicon, the model allows for clusters. a good indicator since it uses all three types We distinguish three levels: of affixation. the inflection level: determine how to pro- • 5. An inflection rule, applied to a canonical duce a derived form from a base form ; the form, is never completely autonomous ; it is atomic processing unit here is the character part of a group. For instance, we group to- (i.e. local transformation). gether all the inflections of a verb type, for the conjugation level: determine how to or- all tenses. • ganise family rules effectively in order to avoid redundancy ; the atomic processing 6. The transformation is not limited to morpho- unit here is the transformation rule. logical changes. For instance, phonological phenomena can occur too. More generally the word level: once the derived form is there are treatments that cannot be modelled • produced, determine which operation is re- on simple rules. 492 7. The proposed model is based on a set of sim- Step Mot Pile Code restant ple tools, it is able to easily integrate third- 1 ceder´ 3PE/e/3D/ais/` party applications and allows use of dictio- 2 ce´ der E/e/3D/ais/` naries built in another environment.