LMF for Multilingual, Specialized Lexicons Gil Francopoulo, Monte George, Nicoletta Calzolari, Monica Monachini, Nuria Bel, Mandy Pet, Claudia Soria
Total Page:16
File Type:pdf, Size:1020Kb
LMF for multilingual, specialized lexicons Gil Francopoulo, Monte George, Nicoletta Calzolari, Monica Monachini, Nuria Bel, Mandy Pet, Claudia Soria To cite this version: Gil Francopoulo, Monte George, Nicoletta Calzolari, Monica Monachini, Nuria Bel, et al.. LMF for multilingual, specialized lexicons. International Conference on Language Resources and Evaluation - LREC 2006, elra, 2006, Gênes/Italie. inria-00121478 HAL Id: inria-00121478 https://hal.inria.fr/inria-00121478 Submitted on 21 Dec 2006 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. LMF for multilingual, specialized lexicons Gil Francopoulo1, Monte George2, Nicoletta Calzolari3, Monica Monachini4, Nuria Bel5, Mandy Pet6, Claudia Soria7 1INRIA-Loria: [email protected] 2ANSI: [email protected] 3CNR-ILC: [email protected] 4CNR-ILC: [email protected] 5UPF: [email protected] 6MITRE: [email protected] 7CNR-ILC: [email protected] Abstract Optimizing the production, maintenance and extension of lexical resources is one the crucial aspects impacting Natural Language Processing (NLP). A second aspect involves optimizing the process leading to their integration in applications. With this respect, we believe that the production of a consensual specification on lexicons can be a useful aid for the various NLP actors. Within ISO, the purpose of LMF (ISO-24613) is to define a standard for lexicons that covers multilingual and specialized data. codes (ISO 639), scripts codes (ISO 15924), country 1. Introduction codes (ISO 3166), dates (ISO 8601) and Unicode (ISO Lexical Markup Framework (LMF) is a model that 10646). provides a common standardized framework for the This work is in progress. The two level organization construction of Natural Language Processing (NLP) will form a coherent family of standards with the lexicons. The goals of LMF are to provide a common following simple rules: model for the creation and use of lexical resources, to 1) low level specifications provide standardized manage the exchange of data between and among these constants; resources, and to enable the merging of a large number of 2) high level specifications provide structural individual electronic resources to form extensive global elements that are adorned by the standardized constants. electronic resources. Types of individual instantiations of LMF can include 3. Scope and challenges monolingual, bilingual or multilingual lexical resources. The task of designing a lexicon model that satisfies The same specifications are to be used for both small and every user is not an easy task. But all the efforts are large lexicons. The description range from morphology, directed to elaborate a proposal that fits the major needs of syntax, semantic to translation information organized as most existing models. different extensions of an obligatory core package. The In order to summarise the objectives, let's see what is model is being developed to cover all natural languages. in the scope and what is not. The range of targeted NLP applications is not restricted. LMF addresses the following difficult challenges: LMF is also used to model machine readable dictionaries 1. Represent words in languages where multiple (MRD), which are not within the scope of this paper. orthographies (native or transliterations) are possible, e.g. some Asian languages. 2. History and current context 2. Represent the morphology of languages where a In the past, this subject has been studied and de- description in extension of all inflected forms is veloped by a series of projects like GENELEX [Antoni- not manageable (e.g. Hungarian). In this case, Lay], EAGLES, MULTEXT, PAROLE, SIMPLE , ISLE representation in intension is the only manageable and MILE [Bertagna]. More recently within ISO1 the issue. standard for terminology management has been 3. Easily associate written forms and spoken forms successfully elaborated by the sub-committee ISO-TC37 for all languages. and published under the name "Terminology Markup 4. Represent complex compound words (like in Framework" (TMF) with the ISO-16642 reference. German, Dutch among other languages) Afterwards, the ISO-TC37 National delegations decided 5. Represent fixed, semi-fixed and flexible to address standards dedicated to NLP. These standards multiword expressions. are currently elaborated as high level specifications and 6. Represent specific syntactic behaviors (as deal with word segmentation (ISO 24614), annotations recommended in Eagles). (ISO 24611, 24612 and 24615), feature structures (ISO 7. Allow complex argument mapping between 24610), and lexicons (ISO 24613) with this latest one syntactic and semantic descriptions (as being the focus of the current paper. These standards are recommended in Eagles). based on low level specifications dedicated to constants, 8. Allow a semantic organization based on SynSets namely data categories (revision of ISO 12620), language (like in WordNet) or on semantic predicates (like in FrameNet). 1 www.iso.org 9. Represent large scale multilingual resources based on interlingual pivots or on transfer linking. Dat abas e LMF does not address the following topics: 1. General sentence grammar of a language 1 2. World knowledge representation 1..* 1 Lexicon In other terms, LMF is mainly focused on lexical 1 Lexicon Information linguistic information representation. 1 1..* 4. Key standards used by LMF Lexical Entry 1 0..* Entry Relation 0..* 0..* LMF utilizes Unicode in order to represent the scripts 1 and orthographies used in lexical entries regardless of 1 language. 1..* 0..* 1 0..* Form 1 Sense Sense Relation Linguistic constants, like /feminine/ or /transitive/, are 0..* 0..* not defined within LMF but are specified in the Data 0..* Category Registry (DCR) that is maintained as a global 1 resource by ISO TC37 in compliance with ISO/IEC 0..* 11179-3:2003. Representation Frame The LMF specification complies with the modeling principles of Unified Modeling Language (UML) as defined by OMG2 [Rumbaugh]. A model is specified by a UML class diagram within a UML package: the class Form class can be subclassed into Lemmatised Form name is not underlined. The various examples of word and Inflected Form class as follows: description are represented by UML instance diagrams: the class name is underlined. For m 5. Structure and core package LMF is comprised of two components: 1) The core package which is the structural skeleton Lemmatised Form Inflected Form which describes the basic hierarchy of information in a lexical entry. 2) Extensions to the core package, which are expressed in a framework that describes the re-use of the A subset of the core package classes are extended to core components in conjunction with these additional cover different kinds of linguistic data. All extensions components required for the description of the contents of conform to the LMF core package and cannot be used to a specific lexical resource. represent lexical data independently of the core package. In the core package, one class called Database From the point of view of UML, an extension is a UML represents the entire resource and is a container for one or package. Current extensions for NLP dictionaries are: more lexicons. The Lexicon class is the container for all NLP Morphology, NLP inflectional paradigm, NLP the lexical entries of the same language within the Multiword Expression pattern, NLP Syntax, NLP database. The Lexicon Information class contains Semantic and Multilingual notations, which is the focus of administrative information and other general attributes. this paper. Extensions for Morphology, Syntax and The Lexical Entry class is a container for managing the Semantic extensions are described in [Francopoulo]. All top level language components. As a consequence, the extensions are described in [LMF 2006]. number of representatives of single words, multiword expressions and affixes of the lexicon is equal to the 6. NLP Multilingual extension number of lexical entries in a given lexicon. The Form and Sense classes are parts of the Lexical Entry. Form The NLP multilingual notation extension is dedicated consists of a text string that represents the word. Sense to the description of the mapping between two or more specifies or identifies the meaning and context of the languages in a LMF database. The model is based on the related form. Therefore, the Lexical Entry manages the notion of Axis that links the notions of Sense, Syntactic Behavior and Example pertaining to different languages. relationship between sets of related forms and their senses. 3 If there is more than one orthography for the word form "Axis" is a term taken from the Papillon project (e.g. transliteration) the Form class may be associated [Sérasset]. Axis can be organized at the lexicon manager with one to many Representation Frames, each of which convenience in order to link directly or indirectly objects contains a specific orthography and one to many data of different languages. categories that describe the attributes of that orthography. The core package classes are