Lexical markup framework (LMF) for NLP multilingual resources Gil Francopoulo, Nuria Bel, Monte George, Nicoletta Calzolari, Monica Monachini, Mandy Pet, Claudia Soria

To cite this version:

Gil Francopoulo, Nuria Bel, Monte George, Nicoletta Calzolari, Monica Monachini, et al.. Lexical markup framework (LMF) for NLP multilingual resources. International Committee on Computa- tional Linguistic and the Association for Computational Linguistics - COLING / ACL 2006, coling acl, 2006, Sydney/Australia. ￿inria-00121483￿

HAL Id: inria-00121483 https://hal.inria.fr/inria-00121483 Submitted on 21 Dec 2006

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. LEXICAL MARKUP FRAMEWORK (LMF)

FOR NLP MULTILINGUAL RESOURCES

Gil Francopoulo1, Nuria Bel2, Monte George3, Nicoletta Calzolari4, Monica Monachini5, Mandy Pet6, Claudia Soria7

1INRIA-Loria: [email protected] 2UPF: [email protected] 3ANSI: [email protected] 4CNR-ILC: [email protected] 5CNR-ILC: [email protected] 6MITRE: [email protected] 7CNR-ILC: [email protected]

applications is not restricted. LMF is also used to Abstract model machine readable (MRD), which are not within the scope of this paper. Optimizing the production, maintenance and extension of lexical resources is one 2 History and current context the crucial aspects impacting Natural In the past, this subject has been studied and de- Language Processing (NLP). A second veloped by a series of projects like GENELEX aspect involves optimizing the process [Antoni-Lay], EAGLES, MULTEXT, PAROLE, leading to their integration in applica- SIMPLE, ISLE and MILE [Bertagna]. More re- tions. With this respect, we believe that cently within ISO1 the standard for terminology the production of a consensual specifica- management has been successfully elaborated by tion on multilingual can be a the sub-committee three of ISO-TC37 and pub- useful aid for the various NLP actors. lished under the name "Terminology Markup Within ISO, one purpose of LMF (ISO- Framework" (TMF) with the ISO-16642 refer- 24613) is to define a standard for lexi- ence. Afterwards, the ISO-TC37 National dele- cons that covers multilingual data. gations decided to address standards dedicated to NLP. These standards are currently elaborated as 1 Introduction high level specifications and deal with word Lexical Markup Framework (LMF) is a model segmentation (ISO 24614), annotations that provides a common standardized framework (ISO 24611, 24612 and 24615), feature struc- for the construction of Natural Language Proc- tures (ISO 24610), and lexicons (ISO 24613) essing (NLP) lexicons. The goals of LMF are to with this latest one being the focus of the current provide a common model for the creation and paper. These standards are based on low level use of lexical resources, to manage the exchange specifications dedicated to constants, namely of data between and among these resources, and data categories (revision of ISO 12620), lan- to enable the merging of a large number of indi- guage codes (ISO 639), script codes vidual electronic resources to form extensive (ISO 15924), country codes (ISO 3166), dates global electronic resources. (ISO 8601) and (ISO 10646). Types of individual instantiations of LMF can include monolingual, bilingual or multilingual This work is in progress. The two level organiza- lexical resources. The same specifications are to tion will form a coherent family of standards be used for both small and large lexicons. The with the following simple rules: descriptions range from , syntax, 1) the low level specifications provide standard- semantic to translation information organized as ized constants; different extensions of an obligatory core pack- age. The model is being developed to cover all natural languages. The range of targeted NLP 1 www.iso.org 2) the high level specifications provide struc- In other words, LMF is mainly focused on the tural elements that are adorned by the standard- linguistic representation of lexical information. ized constants. 4 Key standards used by LMF 3 Scope and challenges LMF utilizes Unicode in order to represent the The task of designing a model that satis- orthographies used in lexical entries regardless of fies every user is not an easy task. But all the language. efforts are directed to elaborate a proposal that Linguistic constants, like /feminine/ or fits the major needs of most existing models. /transitive/, are not defined within LMF but are In order to summarise the objectives, let's see specified in the Data Category Registry (DCR) what is in the scope and what is not. that is maintained as a global resource by ISO TC37 in compliance with ISO/IEC 11179- LMF addresses the following difficult chal- 3:2003. lenges: The LMF specification complies with the • Represent words in languages where modeling principles of Unified Modeling Lan- multiple orthographies (native scripts or guage (UML) as defined by OMG2 [Rumbaugh transliterations) are possible, e.g. some 2004]. A model is specified by a UML class dia- Asian languages. gram within a UML package: the class name is not underlined in the diagrams. The various ex- • Represent explicitly (i.e. in extension) amples of word description are represented by the morphology of languages where a de- UML instance diagrams: the class name is under- scription of all inflected forms (from a list lined. of lemmatised forms) is manageable, e.g. English. 5 Structure and core package • Represent the morphology of languages LMF is comprised of two components: where a description in extension of all in- 1) The core package consists of a structural flected forms is not manageable (e.g. Hun- skeleton that describes the basic hierarchy of in- garian). In this case, representation in in- formation in a lexical entry. tension is the only manageable issue. 2) Extensions to the core package are ex- • Easily associate written forms and spo- pressed in a framework that describes the reuse ken forms for all languages. of the core components in conjunction with addi- tional components required for the description of • Represent complex agglutinating com- the contents of a specific . pound words like in German. In the core package, the class called Database • Represent fixed, semi-fixed and flexible represents the entire resource and is a container multiword expressions. for one or more lexicons. The Lexicon class is the container for all the lexical entries of the • Represent specific syntactic behaviors, same language within the database. The Lexicon as in the Eagles recommendations. Information class contains administrative infor- • Allow complex argument mapping be- mation and other general attributes. The Lexical tween syntax and semantic descriptions, as Entry class is a container for managing the top in the Eagles recommendations. level language components. As a consequence, the number of representatives of single words, • Allow a semantic organisation based on multi-word expressions and affixes of the lexicon SynSets (like in WordNet) or on semantic is equal to the number of lexical entries in a predicates (like in FrameNet). given lexicon. The Form and Sense classes are • Represent large scale multilingual re- parts of the Lexical Entry. Form consists of a text sources based on interlingual pivots or on string that represents the word. Sense specifies or transfer linking. identifies the meaning and context of the related form. Therefore, the Lexical Entry manages the LMF does not address the following topics: relationship between sets of related forms and • General sentence grammar of a language their senses. If there is more than one orthogra- • World knowledge representation 2 www.omg.org phy for the word form (e.g. transliteration) the gories that describe the attributes of that orthog- Form class may be associated with one to many raphy. Representation Frames, each of which contains a The core package classes are linked by the re- specific orthography and one to many data cate- lations as defined in the following UML class diagram:

Dat abas e

1

1..* 1 Lexicon 1

Lexicon Information 1

1..* Lexical Entry 1 0..* Entry Relation 0..* 0..*

1 1 1..* 0..*

For m 1 Sense 1 0..* Sense Relation 0..* 0..* 0..* 1

0..*

Representation Frame

Form class can be sub-classed into Lemmatised age. Current extensions for NLP dictionaries are: Form and Inflected Form class as follows: NLP Morphology 3 , NLP inflectional paradigm, NLP Multiword Expression pattern, NLP Syntax,

For m NLP Semantic and Multilingual notations, which is the focus of this paper. 6 NLP Multilingual Extension

Lemmatised Form Inflected Form The NLP multilingual notation extension is

dedicated to the description of the mapping be- A subset of the core package classes are ex- tween two or more languages in a LMF database. tended to cover different kinds of linguistic data. The model is based on the notion of Axis that All extensions conform to the LMF core package links Senses, Syntactic Behavior and examples and cannot be used to represent lexical data in- pertaining to different languages. "Axis" is a dependently of the core package. From the point of view of UML, an extension is a UML pack- 3 Morphology, Syntax and Semantic packages are described in [Francopoulo]. term taken from the Papillon4 project [Sérasset sider them as two separate languages. In fact, one 2001] 5 . Axis can be organized at the lexicon is a variant of the other. The differences are mi- manager convenience in order to link directly or nor: a certain number of words are different and indirectly objects of different languages. some limited phenomena in syntax are different. Instead of managing two distinct copies, it is more effective to manage one lexicon with some 6.1 Considerations for standardizing multi- objects that are marked with a dialectal attribute. lingual data Concerning the translation from English to Por- The simplest configuration of multilingual tuguese: a limited number of specific Axis in- data is a bilingual lexicon where a single link is stances record this variation and the vast major- used to represent the translation of a given ity of Axis instances is shared. form/sense pair from one language into another. But a survey of actual practices clearly reveals (v) The model should allow for representing other requirements that make the model more the information that restricts or conditions the complex. Consequently, LMF has focused on the translations. The representation of tests that following ones: combine logical operations upon syntactic and semantic features must be covered. (i) Cases where the relation 1-to-1 is impos- 6.2 Structure sible because of lexical differences among lan- guages. An example is the case of English word The model is based on the notion of Axis that “river” that relates to French words “rivière” and link Senses, Syntactic Behavior and examples “fleuve”, where the latter is used for specifying pertaining to different languages. Axis can be that the referent is a river that flows into the sea. organized at the lexicon manager convenience in The bilingual lexicon should specify how these order to link directly or indirectly objects of dif- units relate. ferent languages. A direct link is implemented by a single axis. An indirect link is implemented by (ii) The bilingual lexicon approach should several axis and one or several relations. be optimized to allow the easiest management of The model is based on three main classes: large databases for real multilingual scenarios. In Sense Axis, Transfer Axis, Example Axis. order to reduce the explosion of links in a multi- bilingual scenario, translation equivalence can be 6.3 Sense Axis managed through an intermediate "Axis". This Sense Axis is used to link closely related object can be shared in order to contain the num- senses in different languages, under the same ber of links in manageable proportions. assumptions of the interlingual pivot approach, and, optionally, it can also be used to refer to one (iii) The model should cover both transfer or several external knowledge representation sys- and pivot approaches to translation, taking also tems. into account hybrid approaches. In LMF, the The use of the Sense Axis facilitates the repre- pivot approach is implemented by a “Sense sentation of the translation of words that do not Axis”. The transfer approach is implemented by necessarily have the same valence or morpho- a “Transfer Axis”. logical form in one language than in another. For example, in a language, we can have a single (iv) A situation that is not very easy to deal word that will be translated by a compound word with is how to represent translations to languages into another language: English “wheelchair” to that are similar or variants. The problem arises, Spanish “silla de ruedas”. Sense Axis may have for instance, when the task is to represent transla- the following attributes: a label, the name of an tions from English to both European Portuguese external descriptive system, a reference to a spe- and Brazilian Portuguese. It is difficult to con- cific node inside an external description. 6.4 Sense Axis Relation 4 www.papillon-.org 5 Sense Axis Relation permits to describe the To be more precise, Papillon uses the term "axie" from "axis" and "lexie". In the beginning of the LMF linking between two different Sense Axis in- project, we used the term "axie" but after some bad stances. The element may have attributes like comments about using a non-English term in a stan- label, view, etc. dard, we decided to use the term "axis". The label enables the coding of simple inter- 6.6 Transfer Axis Relation lingual relations like the specialization of Transfer Axis Relation links two Transfer Axis “fleuve” compared to “rivière” and “river”. It is instances. The element may have attributes like: not, however, the goal of this strategy to code a label, variation. complex system for knowledge representation, which ideally should be structured as a complete 6.7 Source Test and Target Test coherent system designed specifically for that purpose. Source Test permits to express a condition on the translation on the source language side while 6.5 Transfer Axis Target Test does it on the target language side. Both elements may have attributes like: text and Transfer Axis is designed to represent multi- comment. lingual transfer approach. Here, linkage refers to information contained in syntax. For example, 6.8 Example Axis this approach enables the representation of syn- tactic actants involving inversion, such as (1): Example Axis supplies documentation for sample translations. The purpose is not to record (1) fra:“elle me manque” => large scale multilingual corpora. The goal is to eng:“I miss her” link a Lexical Entry with a typical example of translation. The element may have attributes like: Due to the fact that a lexical entry can be a comment, source. support verb, it is possible to represent transla- 6.9 Class Model Diagram tions that start from a plain verb to a support verb like (2) that means "Mary dreams": The UML class model is an UML package. The diagram for multilingual notations is as follows: (2) fra:“Marie rêve” => jpn:"Marie wa yume wo miru"

0..* 1 Sense 0..* Sense Axis

0..* 1 SynSet 0..* 0..1 0..* Sense Axis Relation

Target Test 0..*

Source Test 0..* 0..* 1 1 0..* Syntactic Behavior 0..* 1 1 0..* Transfer Axis

1 1

0..* 0..1 Transfer Axis Relation

0..* 0..* Example Axis SenseExample "river" in English. In the diagram, French is lo- 7 Three examples cated on the left side and English on the right side. The axis on the top is not linked directly to 7.1 First example any English sense because this notion does not The first example is about the interlingual ap- exist in English. proach with two axis instances to represent a near match between "fleuve" in French and

: Sense : Sense Axis label = fra:fleuve

: Sense Axis Relation comment = flows into the sea label = more precise

: Sense : Sense label = fra:rivière label = eng:river : Sense Axis

7.2 Second example some local exceptions, the goal is to avoid a full and dummy duplication. For instance, the nomi- Let's see now an example about the transfer native forms of the third person clitics are largely approach about slight variations between vari- preferred in Brazilian rather than the oblique ants. The example is about English on one side form as in European Portuguese. The transfer and European Portuguese and Brazilian on the axis relations hold a label to distinguish which other side. Due to the fact that these two last axis to use depending on the target object. variants have a very similar syntax, but with

: Transfer Axis : Syntactic Behavior label = Deixa-me ver

: Transfer Axis Relation label = European Portuguese

: Syntactic Behavior label = let me see : Transfer Axis

: Transfer Axis Relation label = Brazilian

: Syntactic Behavior : Transfer Axis label = Deixa eu ver

a multilingual transfer lexicon. It represents the translation of the English “develop” into Italian 7.3 Third example and Spanish. Recall that the more general sense A third example shows how to use the Trans- links “eng:develop” and “esp:desarrollar”. Both, fer Axis relation to relate different information in Spanish and Italian, have restrictions that should be tested in the source language: if the second elements (picture, mentalCreation, building) it argument of the construction refers to certain should be translated into specific verbs.

: Source Test : Transfer Axis : Syntactic Behavior semanticRestriction = eng:picture label = esp:revelar syntacticArgument = 2 : Transfer Axis Relation

: Source Test semanticRestriction = eng:mentalCreation syntacticArgument = 2

: Transfer Axis : Syntactic Behavior label = ita:sviluppare : Transfer Axis Relation

: Syntactic Behavior : Syntactic Behavior label = eng:develop : Transfer Axis label = esp:desarrollar

: Syntactic Behavior : Transfer Axis Relation label = esp:construir

: Transfer Axis : Syntactic Behavior : Source Test label = ita:costruire semanticRestriction = eng:building syntacticArgument = 2

8 LMF in XML sion of the LMF document [LMF 2006] a DTD has been provided as an informative annex. The following conventions are adopted: • each UML attribute is transcoded as a DC (for Data Category) element • each UML class is transcoded as an XML element content inclusion • UML shared associations (i.e. associa- tions that are not aggregations) are transcoded as IDREF(S) The first example (i.e. "river") can be represented with the following XML tags: Acknowledgements The work presented here is partially funded by 6 the EU eContent-22236 LIRICS project , par- 8 OUTILEX programs. References Antoni-Lay M-H., Francopoulo G., Zaysser L. 1994 A generic model for reusable lexicons: the GENELEX project. Literary and linguistic comput- ing 9(1) 47-54 Bertagna F., Lenci A., Monachini M., Calzolari N. open issues and MILE perspectives LREC Lisbon Francopoulo G., George M., Calzolari N., Monachini M., Bel N., Pet M., Soria C. 2006 Lexical Markup Framework (LMF) LREC Genoa. LMF 2006 Lexical Markup Framework ISO- CD24613-revision-9, ISO Geneva Rumbaugh J., Jacobson I.,Booch G. 2004 The unified modeling language reference manual, second edi- 9 Comparison tion, Addison Wesley A serious comparison with previously existing Sérasset G., Mangeot-Lerebours M. 2001 Papillon models is not possible in this current paper due Lexical Database project: monolingual dictionaries to the lack of space. We advice the interested & interlingual links NLPRS Tokyo colleague to consult the technical report "Ex- tended examples of lexicons using LMF" located at: "http://lirics.loria.fr" in the document area. The report explains how to use LMF in order to represent OLIF-2, Parole/Clips, LC-Star, Word- Net, FrameNet and BDéf. 10 Conclusion In this paper we presented the results of the ongoing research activity of the LMF ISO stan- dard. The design of a common and standardized framework for multilingual lexical databases will contribute to the optimization of the use of lexi- cal resources, specially their reusability for dif- ferent applications and tasks. Interoperability is the condition of a effective deployment of usable lexical resources. In order to reach a consensus, the work done has paid attention to the similarities and differ- ences of existing lexicons and the models behind them.

6 http://lirics.loria.fr 7 www.technolangue.net 8 www.at-lci.com/outilex/outilex.html