<<

A framework for representing lexical resources

Fabrice Issac LDI Universite´ Paris 13

Abstract demonstration of the use of lexical resources in different languages. Our goal is to propose a description model for the lexicon. We describe a software 2 Context framework for representing the lexicon and its variations called Proteus. Various Whatever the writing system of a language (logo- examples show the different possibilities graphic, syllabic or alphabetic), it seems that the offered by this tool. We conclude with is a central concept. Nonetheless, the very a demonstration of the use of lexical re- definition of a word is subject to variation depend- sources in complex, real examples. ing on the language studied. For some Asian lan- guages such as Mandarin or Vietnamese, the no- 1 Introduction tion of word does not exist ; for others, such as French or English, the space is a good in- Natural language processing relies as well on dicator. Likewise, for some languages, preposi- methods, algorithms, or formal models as on lin- tions are included in , while for others they guistic resources. Processing textual data involves form a separate unit. a classic sequence of steps : it begins with the nor- malisation of data and ends with their lexical, syn- Languages can be classified according to their tactic and semantic analysis. Lexical resources are morphological mechanisms and their complexity ; the first to be used and must be of excellent qual- for instance, the morphological systems of French ity. or English are relatively simple compared to that Traditionally, a is represented of Arabic or Greek. as a list of inflected forms that are projected on a There are two main branches of morphology, text. However, this type of resource can not take inflectional or grammatical morphology and lexi- into account linguistic phenomena, as each unit cal morphology. The first one deals with context- of information is independent. This results in a related variations, as the rules of agreement in number of problems regarding the improvement gender and number or the conjugation of verbs. or review of the resource. On the other hand some The second one concerns word formation, gener- languages such as Arabic, because of the potential ally involving the association of a to pre- large lexicon, lends itself less easily to this kind of fixes or suffixes. manipulation. 3 Tagging Our goal is to propose a model for the descrip- tion of the lexicon. After presenting the existing Text tagging consists in adding one or more infor- theory and software tools, we introduce a software mation units to a group of characters : the token. framework called Proteus, capable of represent- This association is firstly performed in a context- ing the lexicon and its variations. The different free way, that is to say considering only the to- possibilities offered by this tool will be illustrated ken, and secondly by increasing the context size : through various examples. We conclude with a the tagging process is subsequently repeated in or-

490 Coling 2010: Poster Volume, pages 490–497, Beijing, August 2010 4 Tools and resources Several concepts are related to the use of lexical resources ; here we provide some examples of tools, theoretical as well as computational.

resources in the form of a frozen list: Mor- • phalou (Romary et al., 2004), Morfetik (Bu- vet et al., 2007), Lexique3 (New, 2006) ;

lexical representation formalisms: • DATR (Evans and Gazdar, 1996) ;

inflections’ parsers: Flemm (Namer, 2000) ; • Figure 1: tagging schema complete software platforms: Nooj (Sil- • berztein, 2005), Unitex (Paumier, 2002) ;

lexicon acquisition: lefff (Sagot et al., 2006). der to merge multiple tokens. Token merging ap- • plies to polylexical units, syntactic or para-textual 4.1 Frozen resources structures. If this kind of resource is directly used in the We distinguish two main types of resources tagging process, it raises many maintenance is- that can be projected on raw texts in order to sues. Moreover, in the case of languages with rich enrich them. The first of these resources is a morphology, the number of elements becomes too set of inflected forms associated with a num- large. These lists are most often the result of in- ber of information units (in the example below, flection engines that use canonical forms and in- the lemma and the morphosyntactic annotation of flection rules to generate the inflected forms. each form) : 4.2 Hierarchical lexical representation abyssal abyssal A--ms formalisms abysses abysse N-mp The goal of this type of formalisms is to represent the most general information at the top-level of The projection of this type of resources in tex- a hierarchy. There is an inheritance mechanism to tual corpora is quite simple. After identifying a transmit, specify, and if necessary, delete informa- token, the program only needs to check if the to- tion along the structure. It is possible to group un- ken is included in the resource and add the infor- der one tag a set of morphological phenomena and mation units associated with it. their exceptions. Multiple inheritance allows a The second type of resources contains a set of node to inherit from several different hierarchies. rules and a set of canonical forms (usually the lemma but not necessarily). These sets are used 4.3 Inflections’ parsers jointly, to produce all the inflected forms or to They propose a morphological analysis for a given analyse the tokens. Analysis consists in determin- inflected form : they try to apply the derivation ing, for a given inflected form, which rule was rules backwards and test whether the result ob- used, on which canonical form, in order to gener- tained corresponds to an attested canonical form. ate it. Then the information to be associated with The use of canonical forms is optional ; it provides the inflected form is related to the rule found. an analysis for lexical neologisms but can cause Diagram 1 presents the place of different re- incorrect results (Hundred is not the past partici- sources in the tagging process. ple of hundre).

491 4.4 Software platforms quired to validate the form against non- Unitex / Intex / NooJ are complete environments morphological rules ; the processing unit for performing linguistic analysis on documents. here is the token (i.e. global transformation). They are able to project dictionaries on texts and The model was developed to meet the following to query them. They offer a set of tools for man- objectives: aging dictionaries, both lexical and inflectional. NooJ is the successor of Intex ; among the new 1. A verbatim description of a language does features, is the redesign of the architecture of not allow for the analysis of unknown words the dictionaries. It proposes handling simple and even if their inflection is regular. We must words in a single way. The method is a therefore develop a mechanism that we can mix of manipulation of characters and use of mor- use for both analysis and generation. Then phological grammars. we will be able to analyse not only known The inflexion mechanism is based on classic words but also neologisms. handling operators as well as on word 2. In a lexical database, where, for French, the manipulation operators. Here is the list of some number of elements reaches one million, the operators: presence of an error is always possible, even delete last character inevitable. We must therefore consider an ef- duplicate last character fective maintenance procedure : a dictionary go left of lemmas linked to a dictionary of inflec- go right tions and not a read-only resource containing go to the end of next word form all inflected forms.

go to the end of previous word form delete next character 3. The concept of word is so complex that we cannot limit a resource to simple words. The model must integrate the management of 5 Representation and structuring of both simple and compound words. The only inflections: the Proteus model limit we set is syntax: the management of id- We introduce a framework capable to represent ioms, even if it is fundamental, requires the and structure inflections efficiently, not only in implementation of other tools. terms of resource creation, but in terms of linguis- 4. The concept of inflection varies depending tic consistency too. At the inflection level we pro- on the language. We must build a system ca- pose a simple multilingual mechanism for simple pable of dealing with all types of affixation and compound forms. It is used both to generate (prefixation, suffixation or infixation). The simple forms and to analyse them. Regarding the treatment of Arabic is from this point of view lexicon, the model allows for clusters. a good indicator since it uses all three types We distinguish three levels: of affixation. the inflection level: determine how to pro- • 5. An inflection rule, applied to a canonical duce a derived form from a base form ; the form, is never completely autonomous ; it is atomic processing unit here is the character part of a group. For instance, we group to- (i.e. local transformation). gether all the inflections of a verb type, for the conjugation level: determine how to or- all tenses. • ganise family rules effectively in order to avoid redundancy ; the atomic processing 6. The transformation is not limited to morpho- unit here is the transformation rule. logical changes. For instance, phonological phenomena can occur too. More generally the word level: once the derived form is there are treatments that cannot be modelled • produced, determine which operation is re- on simple rules.

492 7. The proposed model is based on a set of sim- Step Mot Pile Code restant ple tools, it is able to easily integrate third- 1 ceder´ 3PE/e/3D/ais/` party applications and allows use of dictio- 2 ce´ der E/e/3D/ais/` naries built in another environment. 3 der /e/3D/ais/` 4 ce` der 3D/ais/ 5.1 Inflection description 5 ceder` /ais/ 1 Let f be the inflexion function and f − its inverse 6 cederais` function, then Steps: f(canonicalform, code) = inflectedform 1 2 : stack of three characters → 1 f − (inflectedform, code) = canonicalform 2 3 : deleting of a character → By simplifying the model to the extreme, we 3 4 : adding the character e` use of a rule that generates an inflected form from → its lemma. The form is represented as a list of car- 4 5 : dumping three characters from the stack acters : (i) it shifts characters from the list to a → stack or vice versa (ii) and deletes or inserts char- 5 6 : addition of three characters ais acters in the list. By default, the operations apply → to characters placed at the end of the list of char- This code can inflect french verbs like ceder´ (as acters or, depending on the operator, at the top of rev´ eler´ , esperer´ , . . . ). However, this kind of code the stack. Most operators allow for the applica- does not allow reversing the operation, i.e. find the tion of the inverse function for the analysis of an lemma from the inflection : the E (erase) operator, inflection. Due to the operator based construction, unlike other operators, is not reversible. Therefore the function has the following property: we add another operator to erase this function or let c1, c2 be valid code and x a character string, remove the characters to delete. then f(f(x, c ), c ) = f(x, c c ) \x\ : erase given character from the list (this 1 2 1 • 2 rule cannot be applied if the character is not We now present the different operators. They present) must be sufficiently numerous, to offer the nec- essary expressive power to represent any kind of The code of the previous example becomes inflection, but small enough, not to make the task 3P\e\/´ e/3D/ais/` . of creating the rule too difficult. Since consonant duplication is a common phe- nomenon, we introduce a specific operator : P (Push) : move a character from the list to the stack C (Clone) : duplicates the last character of the D (Dump) : moves the character of the stack to list. the list The code C/ing/ generates the present par- E (Erase) : deletes a character from the list ticiple of words such as run, sit or stop The management of prefixes requires the addi- /x/ : adds the character x at the end of the list tion of operators:

To simplify code writing, it is possible to ] (fill stack) : transfers all the characters from indicate the number of repetitions of an operator. the list to the stack Here is an example of code that generates an inflected form from its lemma. [ (empty stack) : transfers all the characters from the stack to the list

493 Operator ] can prepare an addition at the begin- a name (optional element ) which • ning of a word since all characters are put in the corresponds to the tag associated to the in- stack. We are now able to describe the inflexion flected form ; of the form: move unmovable. The transfor- → information (optional element ) mation ”remove the character ’e’ at the end of a • word, add ’un’ at the beginning and ’able’ end of a about the inflection ; word” is coded \e\]/un/[/able/. The same a Proteus inflection code (element ). code can analyse verb constructions that end with • the character e. However it is often necessary to combine sev- Processing compound words requires the addi- eral transformations : masculine/feminine and tion, or rather the transformation, of an operator. singular/plural for nouns and adjectives, persons The difficulty here is to distinguish the different and tenses for verbs. Take for example the con- components of an expression with respect to one jugation of a French verb in the first group in the or more separators (traditionally in French space, present tense. The prototypical inflection may be hyphen or apostrophe). given as follows: P x (Push): moves the character of the list to the | | Vp stack to meet the character x verbes indicatif present´ Changing the stacking operator allows us to ac- 1s cess directly an element of an expression or a /e/ compound word. Please note that access to dif- ferent elements of an expression is achieved by 2s stacking and unstacking successively. The code /es/ 2P|-|/s/[ allows to form the plural of expres- sions such as brother-in-law : only the third word 3s from the end is pluralized (brothers-in-law). To /e/ preserve the analysis function of the model, it would be necessary to add, symmetrically, a con- 1p ditional popping operator (e.g. D|x|). However, /ons/ compound words analysis is far more complex, and such an operator could not bring the solution. 2p /ez/ 5.2 Management of inflexion We have defined an XML DTD to manage the in- 3p flections expressed in code. /ent/ Np Noun plural with In this structure we regroup all the inflections a terminal y \y\/ies/ of a given tense. Each inflection is : associated <\flex> to its own identifier, prefixed with the main iden- tifier, separated with a point, and associated to a The above definition associates n-y-p identifier name which is also a concatenation. Note that it with the code \y\/ies/. A typical inflexion is is the identifier that must be unique and not the characterised by: name. This mechanism allows for the expression an identifier (attribute id) is used by the de- of variants in a paradigm (see below). The pre- • scription language ; vious definition states that, for the first group of the present tense, French verbs require suffixes at a status (optional attribute type); the end of the canonical form. Note that this is •

494 a generic definition that can take into account ex- ceptions, and can be apllied to any tense or mood. add e after a g identifier name code ]5D/e/[ v1ip.p1ns Vip1s /e/ v1ip.p2ns Vip2s /es/ ]5D/e/[ v1ip.p3ns Vip3s /e/ v1ip.p1np Vip1p /ons/ ]5D/e/[ v1ip.p2np Vip2p /ez/ v1ip.p3np Vip3p /ent/ The previous definition transforms code as It is also possible to group inflections with a \er\/ons/ in \er\/eons/, \re\/ais/ in new element ( with the attribute type). \re\/eais/, . . . The mask is used in combina- tion with the attribute mask in a inflection defini- first group tion, as in the conc attribute. indicative verbes en "er" To the previous definitions we need to modify You can build a complex inflection by using a the code in order to add a prefix operation: remove base and applying masks successively. The inflec- the ’er’ at the end of the lemma. So we added the tion of the French verb neiger (to snow) can be ex- possibility of code concatenation to a previously pressed using two masks. First a mask to take into defined group. In the example bellow the pos account the pronunciation of the [ÿ] and a second attribute determines if the code to be added is a one, the weather verb mask, which is only used in prefix (p) or a suffix (s). The value attribute in- the third singular person. dicates the identifier of the structure upon which the operation is applied. 5.3 Applications The examples bellow show the different capabili- "er" verb ties of the model. \re\ 5.3.1 Neologisms in French The formation of French inflections should not create significant problems. For the most com- In some cases, modification has to be per- mon languages, simple forms are less than 1 mil- formed on a particular inflection. This is done lion. Therefore, most systems use this set of in- via the application of a mask which operates on a flected forms and supply modules to guess un- group of inflections and changes, possibly selec- known words, only when they arise. This experi- tively, codes of inflexion. A mask is a set of rules ment is used to validate the model and to analyse applied on code. A on the iden- unknown forms. tifier (ervalue attribute) performs the selection. The example below shows how we analyse an We use Proteus code to modify Proteus code. This unknown form. mise en abyme seems inconsistent, since Proteus anticonservationnistes =(/s/)=> has been designed to apply on a language element. anticonservationniste =(]/anti/[)=> But it seemed inappropriate to introduce a new conservationniste =(/niste/)=> syntax. conservation =(\e\/ation/)=> The definition below allows to add the letter e conserve to a form in order to maintain its pronunciation The algorithm tries to apply a code and reiter- [ÿ]. ates the process on the result until we obtain an

495 attested form. The set of rules provides a poten- 5.3.3 Old French tial analysis of the unknown word. Note that the We are developping (author reference) an Old rules used allow to determine the part of speech. French resource, as exhaustive as possible. One In the example, the analysed word can be a plural difficulty is to consider the various alterna- noun or an adjective. tives, dialectal or chronological. This proto- morphological problem complicates the develop- 5.3.2 Arabic verbs ment of the dictionary nomenclature. We solved Arabic is a Semitic language ; it uses a semitic this problem by introducing an arbitrary ”lan- root to derive all the words used. For example guage Phantom” and by adding one level to the from the root  (which refers to the writing) it I. J» composition of the nomenclature, in the form of is possible to produce verb (write), noun (desk) or a label named hyperlemma. All derivations are adjective (written). from this entity using Proteus rules. All variants With these lemmas it is possible to agglutinate are generated from this item by application of suc- prefixes and suffixes. The rules are very regular cessive masks. in morphology but also very productive. We build The example below shows the successive masks all inflexion from the semitic root. So we have the applied on the inflection rules to account for the schema: root radical inflected form. We → → variations of the imperfect tense. Each mask is then define inflexion for prefix/suffix (identifier named modifxxx and corresponds to the modifica- pass1term), a mask for the radical (identifier tion of the Proteus rule for each century xxx. pass1radical ) and a definition which com- bine both. Vii first group imparfait add radical past ]/2P\/´\/D\/´\/D/[+

6 Implementation Vis passe This framework is not only a theoretical tool ; it is designed to be implemented in a tagging software as an autonomous module. Based on abstract de- scriptions (Proteus code and XML language), it The problem encounter with arab text is the allows the resource creator to focus on linguistic possible non use of all vowels. In fact they are aspects. It is simple enough to be easily expressed rarely used, generally in pedagogical or religious in any computational language. text. This mean that the context is fundamental The platform described here is developed in to interpret a text, a vowel is added only to re- Python, which allows a very compact coding and move an ambiguity. However we decide to de- can be used for both generation and analysis. scribe language fully vowelled and to manage this specificity in an earlier stage. 7 Conclusion The objective is to provide a resource used dur- Our work is part of a set of tools and resources ing an lexical analysis. This can be done in two dedicated to the analysis of natural language. We 1 ways : have presented a model for the representation of 1We used here a transliteration version of arab writing to inflections coupled with a language to structure be more clear. the transformation rules. Compound words are

496 handled in the same way as simple words. The Paumier, Sebastien,´ 2002. Manuel d’utilisation du proposed model also allows simple word identi- logiciel Unitex. Universite´ de Marne-la-Vallee.´ fication in both analysis and resource generation Romary, Laurent, Susanne Salmon-Alt, and Gil Fran- functionality. We have presented three examples copoulo. 2004. Standards going concrete : from of the use of the model, each introducing a speci- lmf to morphalou. In Zock, Michael, editor, COL- ficity: French, Old French and Arabic. In the near ING 2004 Enhancing and using electronic dictio- naries, pages 22–28, Geneva, Switzerland, August future we expect to begin work on the Korean, 29th. COLING. Polish and Greek. The best way to improve the framework is to Sagot, Benoˆıt, Lionel Clement,´ Eric´ Villemonte de la Clergerie, and Pierre Boullier. 2006. The lefff 2 create real, i.e. exhaustive linguistic resources. syntactic lexicon for french: architecture, acquisi- The development of the framework can be con- tion, use. In LREC’06,Genes.ˆ sidered from several ways. The Proteus code and the XML language de- Silberztein, Max. 2005. NooJ’s dictionaries. In Ve- tulani, Zygmunt, editor, LTC’05, pages 291–295, scription need stability. In our opinion, addition Poznan,´ Poland, April. of operations to take into account some language specificities would complicate the model without adding any significant improvement. These modi- fications will take place during the third stage, the word level, where post-treatments are applied. For instance, the tonic accent in Greek can move along the last three syllables of a word and affects the use of the diaeresis mark in diphthongs. As far as the analysis functionnality is con- cerned, we are considering to develop specific heuristics for each language in order to guide the choice of rules.

References Buvet, Pierre-Andre,´ Emmanuel Cartier, Fabrice Issac, and Salah Mejri. 2007. Dictionnaires electroniques´ et etiquetage´ syntactico-semantique.´ In Hathout, Nabil and Philippe Muller, editors, Actes des 14e journees´ sur le Traitement Automatique des Langues Naturelles, pages 239–248, Toulouse. IRIT Press.

Evans, Roger and Gerald Gazdar. 1996. Datr: A lan- guage for lexical knowledge representation. Com- putational Linguistics, 22(2):167–216.

Namer, F. 2000. Flemm : Un analyseur flexionnel du franc¸ais a` base de regles.` Revue Traitement Au- tomatique des Langues, 41(2).

New, Boris. 2006. Lexique 3 : Une nouvelle base de donnees´ lexicales. In Mertens, P., C. Fairon, A. Dis- ter, and P. Watrin, editors, Verbum ex machina. Actes de la 13e conference´ sur le Traitement au- tomatique des langues naturelles, Cahiers du Cen- tal 2,2, Louvain-la-Neuve. Presses universitaires de Louvain.

497