Lexical Markup Framework (LMF) for NLP Multilingual Resources

Lexical markup framework (LMF) for NLP multilingual resources Gil Francopoulo, Nuria Bel, Monte George, Nicoletta Calzolari, Monica Monachini, Mandy Pet, Claudia Soria To cite this version: Gil Francopoulo, Nuria Bel, Monte George, Nicoletta Calzolari, Monica Monachini, et al.. Lexical markup framework (LMF) for NLP multilingual resources. International Committee on Computa- tional Linguistic and the Association for Computational Linguistics - COLING / ACL 2006, coling acl, 2006, Sydney/Australia. inria-00121483 HAL Id: inria-00121483 https://hal.inria.fr/inria-00121483 Submitted on 21 Dec 2006 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. LEXICAL MARKUP FRAMEWORK (LMF) FOR NLP MULTILINGUAL RESOURCES Gil Francopoulo1, Nuria Bel2, Monte George3, Nicoletta Calzolari4, Monica Monachini5, Mandy Pet6, Claudia Soria7 1INRIA-Loria: [email protected] 2UPF: [email protected] 3ANSI: [email protected] 4CNR-ILC: [email protected] 5CNR-ILC: [email protected] 6MITRE: [email protected] 7CNR-ILC: [email protected] applications is not restricted. LMF is also used to Abstract model machine readable dictionaries (MRD), which are not within the scope of this paper. Optimizing the production, maintenance and extension of lexical resources is one 2 History and current context the crucial aspects impacting Natural In the past, this subject has been studied and de- Language Processing (NLP). A second veloped by a series of projects like GENELEX aspect involves optimizing the process [Antoni-Lay], EAGLES, MULTEXT, PAROLE, leading to their integration in applica- SIMPLE, ISLE and MILE [Bertagna]. More re- tions. With this respect, we believe that cently within ISO1 the standard for terminology the production of a consensual specifica- management has been successfully elaborated by tion on multilingual lexicons can be a the sub-committee three of ISO-TC37 and pub- useful aid for the various NLP actors. lished under the name "Terminology Markup Within ISO, one purpose of LMF (ISO- Framework" (TMF) with the ISO-16642 refer- 24613) is to define a standard for lexi- ence. Afterwards, the ISO-TC37 National dele- cons that covers multilingual data. gations decided to address standards dedicated to NLP. These standards are currently elaborated as 1 Introduction high level specifications and deal with word Lexical Markup Framework (LMF) is a model segmentation (ISO 24614), annotations that provides a common standardized framework (ISO 24611, 24612 and 24615), feature struc- for the construction of Natural Language Proc- tures (ISO 24610), and lexicons (ISO 24613) essing (NLP) lexicons. The goals of LMF are to with this latest one being the focus of the current provide a common model for the creation and paper. These standards are based on low level use of lexical resources, to manage the exchange specifications dedicated to constants, namely of data between and among these resources, and data categories (revision of ISO 12620), lan- to enable the merging of a large number of indi- guage codes (ISO 639), script codes vidual electronic resources to form extensive (ISO 15924), country codes (ISO 3166), dates global electronic resources. (ISO 8601) and Unicode (ISO 10646). Types of individual instantiations of LMF can include monolingual, bilingual or multilingual This work is in progress. The two level organiza- lexical resources. The same specifications are to tion will form a coherent family of standards be used for both small and large lexicons. The with the following simple rules: descriptions range from morphology, syntax, 1) the low level specifications provide standard- semantic to translation information organized as ized constants; different extensions of an obligatory core package. The model is being developed to cover all natural languages. The range of targeted NLP 1 www.iso.org 2) the high level specifications provide struc- In other words, LMF is mainly focused on the tural elements that are adorned by the standard- linguistic representation of lexical information. ized constants. 4 Key standards used by LMF 3 Scope and challenges LMF utilizes Unicode in order to represent the The task of designing a lexicon model that satis- orthographies used in lexical entries regardless of fies every user is not an easy task. But all the language. efforts are directed to elaborate a proposal that Linguistic constants, like /feminine/ or fits the major needs of most existing models. /transitive/, are not defined within LMF but are In order to summarise the objectives, let's see specified in the Data Category Registry (DCR) what is in the scope and what is not. that is maintained as a global resource by ISO TC37 in compliance with ISO/IEC 11179- LMF addresses the following difficult chal- 3:2003. lenges: The LMF specification complies with the • Represent words in languages where modeling principles of Unified Modeling Lan- multiple orthographies (native scripts or guage (UML) as defined by OMG2 [Rumbaugh transliterations) are possible, e.g. some 2004]. A model is specified by a UML class dia- Asian languages. gram within a UML package: the class name is not underlined in the diagrams. The various ex- • Represent explicitly (i.e. in extension) amples of word description are represented by the morphology of languages where a de- UML instance diagrams: the class name is under- scription of all inflected forms (from a list lined. of lemmatised forms) is manageable, e.g. English. 5 Structure and core package • Represent the morphology of languages LMF is comprised of two components: where a description in extension of all in- 1) The core package consists of a structural flected forms is not manageable (e.g. Hun- skeleton that describes the basic hierarchy of in- garian). In this case, representation in information in a lexical entry. tension is the only manageable issue. 2) Extensions to the core package are ex- • Easily associate written forms and spo- pressed in a framework that describes the reuse ken forms for all languages. of the core components in conjunction with addi- tional components required for the description of • Represent complex agglutinating com- the contents of a specific lexical resource. pound words like in German. In the core package, the class called Database • Represent fixed, semi-fixed and flexible represents the entire resource and is a container multiword expressions. for one or more lexicons. The Lexicon class is the container for all the lexical entries of the • Represent specific syntactic behaviors, same language within the database. The Lexicon as in the Eagles recommendations. Information class contains administrative infor- • Allow complex argument mapping be- mation and other general attributes. The Lexical tween syntax and semantic descriptions, as Entry class is a container for managing the top in the Eagles recommendations. level language components. As a consequence, the number of representatives of single words, • Allow a semantic organisation based on multi-word expressions and affixes of the lexicon SynSets (like in WordNet) or on semantic is equal to the number of lexical entries in a predicates (like in FrameNet). given lexicon. The Form and Sense classes are • Represent large scale multilingual re- parts of the Lexical Entry. Form consists of a text sources based on interlingual pivots or on string that represents the word. Sense specifies or transfer linking. identifies the meaning and context of the related form. Therefore, the Lexical Entry manages the LMF does not address the following topics: relationship between sets of related forms and • General sentence grammar of a language their senses. If there is more than one orthogra- • World knowledge representation 2 www.omg.org phy for the word form (e.g. transliteration) the gories that describe the attributes of that orthog- Form class may be associated with one to many raphy. Representation Frames, each of which contains a The core package classes are linked by the re- specific orthography and one to many data cate- lations as defined in the following UML class diagram: Dat abas e 1 1..* 1 Lexicon 1 Lexicon Information 1 1..* Lexical Entry 1 0..* Entry Relation 0..* 0..* 1 1 1..* 0..* For m 1 Sense 1 0..* Sense Relation 0..* 0..* 0..* 1 0..* Representation Frame Form class can be sub-classed into Lemmatised age. Current extensions for NLP dictionaries are: Form and Inflected Form class as follows: NLP Morphology 3 , NLP inflectional paradigm, NLP Multiword Expression pattern, NLP Syntax, For m NLP Semantic and Multilingual notations, which is the focus of this paper. 6 NLP Multilingual Extension Lemmatised Form Inflected Form The NLP multilingual notation extension is dedicated to the description of the mapping be- A subset of the core package classes are ex- tween two or more languages in a LMF database. tended to cover different kinds of linguistic data. The model is based on the notion of Axis that All extensions conform to the LMF core package links Senses, Syntactic Behavior and examples and cannot be used to represent lexical data in- pertaining to different languages. "Axis" is a dependently of the core package. From the point of view of UML, an extension is a UML pack- 3 Morphology, Syntax and Semantic packages are described in [Francopoulo]. term taken from the Papillon4 project [Sérasset sider them as two separate languages. In fact, one 2001] 5 . Axis can be organized at the lexicon is a variant of the other. The differences are mi- manager convenience in order to link directly or nor: a certain number of words are different and indirectly objects of different languages.

Load more