Language Resource Management—Lexical Markup Framework (LMF)

Home , ISO 11940, Lexical Markup Framework, Pinyin

Reference number of working document: ISO/TC 37/SC 4 N130 Rev.5

Date: 2005-03-19

Working draft of ISO WD 24613:2005

Committee identification: ISO/TC 37/SC 4

Secretariat: KATS

Language resource management—Lexical markup framework (LMF)

Warning

This document is not an ISO International Standard. It is distributed for review and comment. It is subject to change without notice and may not be referred to as an International Standard.

Recipients of this document are invited to submit, with their comments, notification of any relevant patent rights of which they are aware and to provide supporting documentation.

Document type: International standard Document subtype: if applicable Document stage: 00.20 Document language: en

ISO 24613:2005

This ISO document is a draft revision and is copyright-protected by ISO. While the reproduction of draft revisions in any form for use by participants in the ISO standards development process is permitted without prior permission from ISO, neither this document nor any extract from it may be reproduced, stored or transmitted in any form for any other purpose without prior written permission from ISO.

Requests for permission to reproduce this document for the purpose of selling it should be addressed as shown below or to ISO’s member body in the country of the requester:

[Indicate : the full address telephone number fax number telex number and electronic mail address

as appropriate, of the Copyright Manager of the ISO member body responsible for the secretariat of the TC or SC within the framework of which the draft has been prepared]

Reproduction for sales purposes may be subject to royalty payments or a licensing agreement.

Violators may be prosecuted.

ISO 24613:2005

Table of contents 1 Scope ...... 8 2 Normative references ...... 8 3 Definitions...... 9 4 General principles...... 15 4.1 Fundamental principles...... 15 4.1.1 Principle of simplicity and clarity...... 15 4.1.2 Principle of universal expressiveness...... 15 4.1.3 Principle of scalability ...... 15 4.2 Overall architecture ...... 15 4.3 LMF Essential Structural Elements...... 15 4.3.1 Introduction ...... 15 4.3.2 LMF Class Diagram...... 15 4.4 LMF Attributes...... 17 4.5 Introduction ...... 17 4.6 Data Category Registry ...... 17 4.7 User-defined Data Categories...... 17 4.8 Data Category Selection...... 17 4.9 Diagram...... 18 5 The LMF Framework ...... 19 5.1 High Level Abstraction...... 19 5.1.1 Lexical database ...... 19 5.1.2 Lexical entry ...... 19 5.2 Form and Sense ...... 20 5.2.1 Form Analysis Class...... 21 5.2.2 The Form Analysis Process...... 22 5.2.3 Sense Analysis Class ...... 29 5.3 The Linguistic Frame...... 30 5.4 Global Information ...... 31 5.5 Overview of Core LMF Classes ...... 31 5.6 LMF lexical extensions...... 32 5.7 Selected LMF lexical extensions...... 32 5.8 LMF conformant lexicon ...... 33 Annex A (normative): Extension for machine readable lexicons ...... 34 Annex B (normative) : Natural Language Processing extension...... 36 B.1 Purpose of the extension ...... 36 B.2 Morphology...... 36 B.2.1 Objectives...... 36 B.2.2 Absence vs presence of morphology in a lexicon ...... 37 B.2.3 Options...... 37 B.2.4 Description of morphological model ...... 37 B.2.5 Class diagram...... 42 B.2.6 Examples ...... 43 B.2.7 Summary...... 47 B.3 Syntax ...... 47 B.3.1 Objectives...... 47 B.3.2 Absence vs presence of syntax in a lexicon...... 48 B.3.3 Options...... 48 B.3.4 Description of syntactic model...... 48 B.3.5 Class diagram...... 50 B.3.6 Example of the Italian Construction « amare »...... 50 B.3.7 Summary...... 51 B.4 Semantics ...... 51 B.4.1 Objectives...... 51 B.4.2 Absence vs presence of semantics in a lexicon ...... 51 B.4.3 Options...... 51 B.4.4 Description of semantic model...... 52

ISO 24613:2005

B.4.5 Class diagram...... 55 B.4.6 Example of the French verb “aider” ...... 55 B.5 Multilingual notations...... 56 B.5.1 Objectives...... 56 B.5.2 Absence vs presence of multilingual notations in a lexicon ...... 56 B.5.3 Options...... 57 B.5.4 Description of multilingual model...... 57 B.5.5 Class diagram...... 59 B.5.6 Example of translation of “fleuve” and “rivière”...... 59 B.5.7 Summary...... 60 B.6 Extended examples of NLP lexicons ...... 60 B.6.1 Foreword...... 60 B.6.2 Getting started lexicon ...... 60 B.6.3 LMF and Open Lexicon Interchange Format (OLIF)...... 62 B.6.4 LMF and CLIPS...... 65 B.6.5 LMF and LC-Star ...... 69 B.6.6 LMF and WordNet ...... 72 B.6.7 LMF and FrameNet...... 74 B.6.8 LMF and BDéf...... 78 B.7 Open questions...... 81 B.7.1 Have a core model able to represent multilingual databases ...... 81 B.7.2 Improve attribute specification in NLP extension ...... 81 B.7.3 ListOfComponents and ListOfStems...... 81 B.7.4 SynSet and Senses...... 81 B.7.5 Improve UML specification ...... Erreur ! Signet non défini. B.7.6 Collocation (comments from LexiquePourLeTAL network)...... 81 B.7.7 Surface/deep morphology (comments from LexiquePourLeTAL network)...... 82 B.7.8 Syntax (comments from LexiquePourLeTAL network)...... 82 B.7.9 Multilingual notations...... 82 B.7.10 Derivation in morphology ...... 82

ISO 24613:2005

Foreword

ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies (ISO member bodies). The work of preparing International Standards is normally carried out through ISO technical committees. Each member body interested in a subject for which a technical committee has been established has the right to be represented on that committee. International organizations, governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.

International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 3.

Draft International Standards adopted by the technical committees are circulated to the member bodies for voting. Publication as an International Standard requires approval by at least 75 % of the member bodies casting a vote.

International Standard 24613 was prepared by Technical Committee ISO/TC 37, Terminology and other language resources, Subcommittee SC 4, Language resource management.

ISO 24613 is designed to coordinate closely with ISO Draft Revision 12620, Computer applications in terminology – Data categories – Part 3: Data category registry, and ISO DIS 16642, Computer applications in terminology – TMF (Terminological Markup Framework).

Annexes A-B form an integral part of this International Standard.

ISO 24613:2005

Introduction

One of the crucial aspects impacting human language technologies (HLT) in general and natural language processing (NLP) in particular, with special consideration for human- oriented translation technologies, is the need to optimize the production, maintenance, and extension of lexical resources. A second crucial aspect involves optimizing the process leading to their integration in applications. The Lexical Markup Framework (LMF) is an abstract metamodel that provides a common, standardized framework for the construction of computational lexicons. The LMF ensures the encoding of linguistic information in a way that enables reusability in different applications and for different tasks. The LMF provides a common, shared representation of lexical objects, including morphological, syntactic, and semantic aspects.

The goals of LMF are to provide a common model for the creation and use of very large scale lexical resources, to manage the exchange of data between and among these resources, and to enable the merging of large numbers of different individual electronic resources to form large global electronic resources. LMF provides a mechanism for describing lexical entries and all corresponding translations by depicting the scripts and orthographies used in entries and their related descriptions. The ultimate goal of LMF is to create a modular structure that will enable true content interoperability across all aspects of lexical resources.

LMF is comprised of the following components: The core model comprises a metamodel, i.e., the structural skeleton of the LMF, which describes the basic hierarchy of information included in a lexical entry. The core model is supplemented by various knowledge sources that are part of the definition of LMF: • Specific data categories used by the variety of resource types associated with the LMF, both those data categories relevant to the metamodel itself, and those associated with the extensions to the core model; • The constraints governing the relationship of these data categories to the metamodel and to its extensions; • Standard procedures for expressing these categories in XML and thus for anchoring them on the structural skeleton of the LMF and relating them to the respective extension models; • The vocabularies used by the LMF to express related informational objects as XML elements and attributes according to the corresponding styles; and • Methods for describing how to extend the LMF through linkage to a variety of specific lexical resources (extensions) and methods for analyzing and designing such linked systems. Extensions of the core model, which are documented in this standard in Annexes A-B, include: - Machine readable lexicons - Natural Language Processing lexicons

The LMF extensions are expressed in a framework that describes the reuse of the LMF core components (such as structures, data categories, and vocabularies) in conjunction with the additional components required for a specific lexical resource.

Types of individual instantiations of LMF can include such lexical resources as fairly simple lexical and terminological databases, NLP and machine-translation lexicons, as well as

ISO 24613:2005 electronic monolingual, bilingual and multilingual lexical resources. LMF provides general structures and mechanisms for analyzing and designing new lexical resources, but the LMF does not specify the structures, data constraints, and vocabularies to be used in the design of specific lexical resources. LMF also provides mechanisms for analyzing and describing existing lexical resources using a common descriptive framework. For the purpose of both designing new lexical resources and describing existing lexical resources, LMF defines the conditions that allow the data expressed in any one lexical resource to be mapped to the LMF framework, and thus provides an intermediate format for lexical data exchange.

ISO 24613:2005

1 Scope

The Lexical Markup Framework (LMF) standard describes a high level model for representing data in lexical resources used with multilingual computer applications. The scope of these lexical resources includes both human oriented translation tools, such as machine-readable lexicons, and automated human language technologies (NLP), such as machine translation, information extraction, information retrieval, summarisation, sentence generation, word clustering, recognition and extraction of multiword expression, word sense disambiguation, proper noun recognition, parsing, and coreference resolution. Lexical resources employed in this context include multilingual natural language processing (NLP) lexicons, speech and text corpora, multimodal resources, terminology databases (termbases), and electronic monolingual, bilingual and multilingual lexicons. This standard is designed to be used in close conjunction with the metamodel presented in ISO 16642:2003, Terminology Markup Framework and with revision of ISO 12620, Terminology and other language resources ― Data categories for electronic lexical resources (ELR). It supports specific linguistic processing environments such as the NLP model defined in AFNOR/TC37/SC4/N090 Proposition de Norme des Lexiques pour le traitement automatique du langage and existing lexical resource models such as the EAGLES International Standards for Language Engineering (ISLE) and Multilingual ISLE Lexical Entry (MILE) model.

2 Normative references

The following normative documents contain provisions that, through reference in this text, constitute provisions of this part of ISO 24613. For dated references, subsequent amendments to, or revisions of, any of these publications do not apply. However, parties to agreements based on ISO 24613 are encouraged to investigate the possibility of applying the most recent editions of the normative documents indicated below. For undated references, the latest edition of the normative document referred to applies. Members of ISO and IEC maintain registers of currently valid International Standards.

ISO 639-1:2002, Codes for the representation of names of languages – Part 1: Alpha-2 Code.

ISO 639-2:1998, Code for the representation of languages – part 2: Alpha-3 Code.

ISO 639-3:200?, Codes for the representation of languages – Part 3: Alpha-3 Code for the comprehensive coverage of languages

ISO 1087-1:2000, Terminology – Vocabulary – Part 1: Theory and application

ISO 1087-2:1999, Terminology – Vocabulary – Part 2: Computer application.

ISO/IEC 10646-1:2003, Information technology – Information technology -- Universal Multiple- Octet Coded Character Set (UCS)

ISO/IEC 11179-3:2003, Information Technology – Data management and interchange – Metadata Registries (MDR) – Part 3: Registry Metamodel (MDR3)

ISO 15924:200?, Code for the representation of names of scripts

ISO 16642:2003, Computer applications in terminology – TMF (Terminological Markup Framework)

ISO 24613:2005

3 Definitions

For the purposes of this International Standard, the terms and definitions given in ISO 1087-1, ISO 1087-2 and the following apply: abbreviated form representation formed by omitting words or letters from a longer form automaton abstract mechanism characterized by a certain amount of states and whose behaviour is governed by a finite number of rules autonomous word word that can appear as a single word or as a component of a multiword expression

Example: “Father” in the multiword expression “father-in-law”

Note: opposed to non-autonomous word circonstant non-essential element associated with a verb when viewed from a theoretical perspective as opposed to syntactic actants

Example: Alfred (syntactic actant) read a book (syntactic actant) today (circonstant)

Note: Adverbs are possible circonstants for a sentence citation quotation from a book, article or document [ISO 1951] closed data category data category whose content is constrained by a list of permissible instances which comprise its conceptual domain

NOTE: A typical closed data category might be /grammatical number/, which can have as its content the values: /singular/, /plural/ or /dual/. combination of morphological features unlimited number of associations of distinct morphological features

NOTE: An example of a combination of morphological features would be the pair: /grammatical number/ and /grammatical gender/. compiled program set of instructions written in a high-level symbolic language that has been translated into machine language before the instructions can be executed complex data category data category that can have content values

NOTE: Complex data categories include both closed data categories and open data categories. conceptual domain set of permissible values associated with a closed data category

Note: The conceptual domain of the data category /grammatical number/ can be defined as /singular/, /plural/ and /dual/. data category result of the specification of a given data field or the content of a closed data field

ISO 24613:2005

NOTE: A data category is to be used as an elementary descriptor in a linguistic structure or an annotation scheme. Examples are: /term/, /definition/, /part of speech/ and /grammatical gender/. Data categories for the management of lexical resources and terminology are comparable to data element concepts in ISO/IEC 11179-3:200?. derivation change in the form of a word to create a new word, usually by modifying the base/root or affixation

NOTE: Sometimes derivation signals a change in part of speech, such as "nation" to "nationalize". Sometimes the part of speech remains the same as in “nationalization” vs. “denationalization”. elision the leaving out of a sound or sounds in speech

Example: In rapid speech in English, “factory” is often pronounced as [‘fæktri]

Note: In certain languages, elision is written like in French: “le” + “enfant” yields “l’enfant”. ergative verb a verb that allows the object to function as the subject.

Example: “boil” in “He boils the kettle of water” and “the kettle boils”.

Note: Ergative verbs are a class of verbs in which the object of the transitive form can be used as the subject of the intransitive form with an equivalent meaning. “Open” is an example of an ergative verb, in that “I opened the door” and “The door opened” have equivalent meanings. electronic lexical resource ELR lexical database lexical resource database collection consisting of individual data entries each of which documents a word and provides data pertinent to the senses associated with that word, as well as in some cases equivalent words in one or more languages [adapted from ISO 1087]

NOTE: Lexical databases and lexical resources are generally corpus-driven and provide more predictable, machine-processable information than machine readable dictionaries, which only mirror the layout of print dictionaries, although many today provide some degree of knowledge organization. Lexical resources can include features for spellchecking and grammar checking, parsing, concordancing, speech recognition and generation, semantic taxonomies and disambiguation, text segmentation, knowledge management, and other NLP functions. electronic terminological resource ETR database collection consisting of individual data entries each of which documents a concept and provides data pertinent to the terms associated with that concept in one or more languages etymology information on the origin of a word and the development of its meaning [ISO 12620] form any sequence of letters, numerals and pictograms used to write or pronounce a word form operation any modification of the form

ISO 24613:2005

full form the complete representation of a word for which there is an abbreviated form [ISO 12620]. grammatical category See part of speech. homograph word that is written like another word, but that has a different pronunciation, meaning, and/or origin [adapted from ISO 12620]

NOTE: An example of difference in meaning for the same spelling of a word is bark: 1) the sound made by a dog; 2) outside covering of the trunk or branches of woody plants; 3) a sailing vessel. homonym word that sounds the same and is written the same as another word, but is different in meaning

NOTE: An example is “bear” as a /noun/ and “bear” as a /verb/. homophone word that sounds like another word but is different in writing or meaning

NOTE: An example of difference in spelling is “pair” compared to “pear” or “pare” in “The cook used a knife to pare the pair of pears”. human language technology HLT technology as applied to natural languages

NOTE: At the broadest level, these technologies cover: applying language knowledge to human machine interaction; providing automated multi-linguality in systems; managing information recorded as human language. These technologies include: speech recognition, spoken language understanding (i.e. speech interpretation), and speech generation; speaker identification and verification; dialogue design and analysis-controlled language design and processing document image analysis, optical character recognition, and handwriting recognition: recognition and understanding of multi-modal human communication; computer assisted text creation and editing; language analysis and understanding; information extraction and summarization language generation; (synthetic) speech generation; language identification, machine translation and computer aided translation; production of language resources and the tools to support it. http://www.hnk.ffzg.hr/jthj/glosar.htm idiom See multiword expression inflected form form that a word can take when used in a sentence or a phrase

NOTE: An inflected form of a word is associated with a combination of morphological features, such as grammatical number or case. inflectional paradigm set of form operations that builds the various inflected forms of a lemmatised form

NOTE: An inflectional paradigm is not the explicit list of inflected forms. interlingua an abstract intermediary language used in the machine translation of human languages

ISO 24613:2005 intransitive verb a verb which does not take a direct object

Example: Some verbs never take objects: “I arrive. She complains”. Some verbs can be intransitive or transitive: “The cook watched while his apprentice made the sauce” (“watched” is intransitive) “The cook watched his apprentice” (“watched” is transitive). lemmatised form lemma conventional form chosen to represent the word

NOTE: In European languages, the lemmatised form is the /singular/ if there is a variation in /number/, the /masculine/ form if there is a variation in /gender/ and the /infinitive/ for all verbs. In some languages, certain nouns are defective in the singular form, in which case, the /plural/ is chosen. Certain words are also defective in the /masculine/ in which case, the /feminine/ is chosen. The lemmatised form can be graphical or phonetic. lexical database lexical resource See electronic lexical resource machine translation lexicon lexical resource in which the individual entries contain equivalents in two or more languages together with semantic information to facilitate automatic or semi-automatic processing of lexical units during machine translation [ISO 1087] morphological feature category induced from the inflected form of a word

NOTE: ISO 12620 provides a comprehensive list of values for European languages. An example of a morphological feature is: /grammatical gender/. morphology of a word morpho-syntax of a word description comprising the lemmatised form or forms of a word, its /part of speech/ data categories, possibly its inflectional paradigm or paradigms, possibly its explicitly listed inflected forms.

NOTE: Despite the reference to syntax, morpho-syntactic information does not include syntactic information. multiword expression MWE idiom a word composed of an ordered group of words that has properties that are not predictable from the properties of the individual words or their normal mode of combination.

Note: The group of words making up an MWE can be continuous or discontinuous.

Example: to take an exam. He took the exam. (continuous). What did you do with the exam that he supposedly took last Thursday ? (discontinuous). natural language processing NLP field covering knowledge and techniques involved in the processing of linguistic data non-autonomous word a word that appears in multiword expressions but cannot appear alone

Example: In French “au fur et à mesure”, the component “fur” cannot appear alone. In English,

ISO 24613:2005

“to take an umbrage”, the component “umbrage” cannot appear alone.

See also autonomous word object language language of the lexical object being described [ISO 16642 definition 3.10] open data category data category whose content is completely optional Example: Typical open data categories might include /term/, /lemma/, /definition/. orthography a way of spelling or writing words that conforms to a specified standard

Note: Aside from standardized spellings of alphabetical languages, such as standard UK or US English, or reformed German spelling, there can be variations such as transliterations or romanizations of languages in non-native scripts, stenographic renderings, or representations in the Iinternational Phonetic Alphabet. In this regard, orthographic information in a lexical entry can describe a kind of transformation applied to the form that is the object of the entry. The specific value /native/ is dedicated to represent the absence of transformation.

Examples: /transliteration/ and /romanization/ are possible values for a transformation. part of speech grammatical category word class category assigned to a word based on its grammatical and semantic properties

NOTE: ISO 12620 provides a comprehensive list of values for European languages. Examples of such values are: /noun/ and /verb/. romanization transliteration using Latin script script set of graphic characters used for the written form of one or more languages (ISO/IEC 10646-1, 4.14)

Note: The description of scripts ranges from a high level classification such as hieroglyphic or syllabary writing systems vs alphabets to a more precise classification like Roman vs Cyrillic. Scripts are defined by a list of values taken from ISO-15924. Examples are: Hiragana, Katakana, Latin, Cyrillic. semantics of a word description of the meanings of the word simple data category data category that is itself the possible content of a closed data category, but that cannot itself have content Example: /masculine/, /feminine/, and /neuter/ are possible simple data categories associated with the conceptual domain of the closed data category /grammatical gender/ as it is associated with the German language. single word word that does not contain any other word splitting conditions the criteria why a linguistic phenomena is described by one element or by several elements

ISO 24613:2005

Example: the criteria specifying why a word is described by one entry or two entries in a lexicon support verb verb that has a generic semantic contribution and that combines with an noun to form a lexicalised unit.

Note: Generally, the subject of the verb is a participant in an event most closely identified with the noun.

Examples: "take an exam" or "give an exam". In these examples, "take" and "give" are not neutral, they are generic. syntactic actant one of the essential and functional elements in a clause that identifies the partipants in the process referred to by a verb

Example: Alfred (syntactic actant) read a book (syntactic actant) today (circonstant)

Note: The subject, indirect object and direct object are possible syntactic actants for a sentence.

See also circonstant syntax of a word description of the behaviour of the word with respect to other words in a sentence or a phrase transcription form resulting from a coherent method of writing down speech sounds transitive verb a verb which takes a direct object

Example: “He saw the car” transliteration form resulting from the conversion of one writing system into another usage note note explaining the correct and/or incorrect usage of a word word in the context of a given language, is a description composed of at least a /part of speech/ and a lemmatised form

NOTE: The description can be more complete with more morphological information and/or syntactic and semantic information. A word is either a single word or a multiword expression. word class See part of speech. word frequency number of occurrences of a particular word in a certain corpus, divided by the number of words in this corpus working language language used to describe objects in a lexical resource [ISO 16642 definition 3.21]

ISO 24613:2005

4 General principles

4.1 Fundamental principles

4.1.1 Principle of simplicity and clarity

The model shall be clearly represented, readily understandable, and easy to apply in a variety of individual resources.

4.1.2 Principle of universal expressiveness

The model shall be able to represent existing lexicons as far as possible. If this is impossible, problematic information must be identified and isolated.

4.1.3 Principle of scalability

The same model shall be useable for both small and large lexical databases.

4.2 Overall architecture

LMF shall provide a highly modular architecture that allows the development and integration of a variety of lexical resources types through the extension of core elements.

LMF shall be used according to the following steps:

Step 1: Define a LMF conformant lexicon

Step 2: Populate this lexicon

A LMF conformant lexicon is defined as the combination of a LMF core model (see section 4.3), zero, one or more lexical extensions (see annexes) and a set of data categories. The combination of all these elements is described in section 5.

4.3 LMF Essential Structural Elements

4.3.1 Introduction

The LMF consists of a network of structural elements that are relevant for linguistic description. All LMF conformant lexicons share this structure, which acts as a skeleton to enable interoperability and interchangeability among a wide variety of lexical resources. This introductory section provides an overview of the essential elements of the UML diagrams that will be used to describe the LMF structure. This section does not describe all the features of the UML diagrams, but focuses on several key elements that will be most useful for describing the framework.

4.3.2 LMF Class Diagram

The class diagram:

• Gives an overview of the LMF structure

ISO 24613:2005

• Uses associations and generalizations to describe the structures relationships among the classes

• Shows the possible states of the LMF system

Lexical DB -language 1

1 1

Global Information -administration

1..*

Lexical Entry 0..* 1 Linguistic Frame -partOfSpeech 0..* 10..*

1 1 1

1..* 1..* For m Sense -orthography

Figure 1

Figure 1 shows a sample of a UML class diagram that contains a number of the elements provided by the LMF standard. The diagram consists of six classes (i.e. Lexical Database, Global Information, Lexical entry, Form, Sense, and Linguistic Frame). The sample attributes illustrate some of the allowable class-attribute mappings.

The Lexical Database class represents the entire resource and contains attributes that are inherited by subordinate classes.

The Global Information class contains administrative information and other general attributes. There is a composition aggregation relationship between the Lexical Database class and the Global Information class.

The Lexical Entry class is a subclass of the Lexical Database class that describes a specific lexical unit, such as a single word, or a multiword expression. The Form and Sense classes are subclasses of Lexical Entry. The Form class contains attributes that describe the orthographic and phonetic values of the lexical unit. The Sense class contains attributes that describe one of the meanings of the word. In this example, there is an aggregation relationship between the Lexical Entry class and the subordinate Form and Sense classes.

The UML standard allows classes to inherit from as many super-classes as desired. For example, the Form class can inherit both the PartofSpeech attribute from the lexical Entry class and the Language attribute from the Lexical Database class.

ISO 24613:2005

LMF classes can have abstract or concrete aspects. For example, a Lexical Entry object in a Machine Readable Dictionary can be composed of objects from the Form and Sense classes. However, a developer may decide use an abstract Form class and implement the form related attributes directly in the Lexical Entry object rather than implementing a separate Form object.

The Linguistic Frame is a general class that can be used to implement a range of Linguistic objects applicable to both MRD and NLP lexical.

Comment: discuss associations and multiplicity using the example of the Linguistic Frame.

4.4 LMF Attributes

4.5 Introduction

Comment: Discuss implementation of attributes through Data Categories.

4.6 Data Category Registry

The Data Category Registry (DCR) is a set of data category specifications defined by ISO 12620 and maintained as a global resource by ISO TC 37. The designers of any specific LMF lexicon shall rely on the DCR when creating their own data category selection.

4.7 User-defined Data Categories

Lexicon creators can define a set of new data categories to cover data category concepts that are needed and that are not available in the DCR. This supplemental set of data categories shall be represented and managed in conformance with ISO 12620.

4.8 Data Category Selection

The data category selection (DCS) describes the set of data categories that can be used within a given LMF lexicon. Data categories are the basic building blocks for creating lexical resources, where each data category functions as information units that describe a specific linguistic or administrative concept.

The kind of data categories that will be needed depends on the following decisions:

- The precision chosen in the lexical description. For instance, a lexicon without any syntactic description will not be associated with a DCS containing syntactic data categories.

- The type of extensions selected.

- The languages selected. Not all languages have the same characteristics. For instance, a French monolingual lexicon will not include the data category /dual/ in its DCS, but a Hebrew monolingual lexicon will include this value because it is a feature of the Hebrew language. In the case of a multilingual lexicon, the DCS shall contain the aggregate of all the data categories required for all the languages treated in the resource.

When two LMF conformant lexicons are based upon two different DCSs, comparison of the data categories in each provides a framework for identifying what information can be exchanged between one format and another, or what will be lost during a conversion.

ISO 24613:2005

When LMF is used to describe an existing lexical resource, it will be necessary to map the existing lexical resource to corresponding data categories in the DCR.

4.9 Diagram

Comment: We need to modify the following diagram so we can present: (1) an overview of how the structural design and the data category selection work together; and (2) a more complete description of the analysis and design methodology for use later in the standard.

A LMF conformant is defined by the following operations described by the mean of the following UML activity diagram:

LMF Core Model

Data Category Registry User-defined Data Categories

Build a Data Category Selection LMF Lexical Extensions

Select Data Category Selection

Selected LMF Lexical Extensions

Compose

LMF conformant lexicon

Figure 2: Process to use LMF

All class instances are defined in the following paragraphs.

ISO 24613:2005

5 The LMF Framework

5.1 High Level Abstraction

At this level, the LMF class structure is abstract analysis class. In practical terms, the structure would allow the design of only the simplest lexica. However, this abstract structure provides the fundamental conceptual basis for the Lexical Markup Framework can be extended to support the practical analysis and design of wide range of MRD and NLP lexica.

The core model is defined by the following UML class model:

Lexical DB -global information

1..* Lexical Entry -form 0..* -sense

Figure 3

5.1.1 Lexical database

The Lexical Database class represents the entire resource and contains attributes that are shared by subordinate classes. The cardinality between Lexical database and Lexical Entry shall be 1..*, that is, a lexicon must contain at least one Lexical Entry. Attributes mapped to the Lexical Entry are inherited by all subclasses (unless overridden). The specific attributes associated with the Lexical Database class are dependent on design considerations. The Lexical Database is accompanied by Global Information including metadata, creation description, analytic mapping, and administrative information.

5.1.2 Lexical entry

The Lexical Entry describes a single word or multiword expression; no word can be described without treating it in a lexical entry. As a consequence, the number of words or multiword expressions of the lexicon is equal to the number of lexical entries in a given lexicon. The Lexical Entry class contains form and sense attributes. The form consists of a text string that repesents a spoken word or a multi-word phrase. The sense is a comment that specifies or disambiguates the meaning and context of the form. As a high level abstraction, the form and sense attributes function as a form-sense pair, i.e. the signifier and the significant.

ISO 24613:2005

Each lexical object is atomic. That is, for every form there is one and only one specific meaning. However, each lexical object can be associated with zero to many other lexical objects. The sum of associations provide further context for the Lexical Object. For example, a polysemous term can be thought of as a set of atomic Lexical Objects linked by typed associations.

5.2 Form and Sense

At the next lower level of abstraction, a Lexical Entry can be modeled as composition class with Form and Sense subclasses.

Lexical DB

0..* 0..* Lexical Entry 1

1 1

1 1 For m Sense

Figure 4

At this level, the Form and Sense classes have both concrete and abstract aspects. The Form class provides a framework for analyzing the attributes that specify the lexical type (e.g. word, sentence, phrase) and the orthographical and phonetic values of the lexical unit. As part of the analysis and design process, a developer may decide to implement the Form as a concrete object. A developer may also decide to use of the Form to an analysis class and implement the “form” object as an attribute of the Lexical Entry class. For example, in MRD, the lexical form is frequently tagged as a „Headword“ that identifies the lexical unit as a citation, or key, form of the lexical entry. Here, the Form is treated in an abstract (or implicit) manner through metadata tags that describe attributes of the form, rather than tags that explicitly describe the text string as a form or type of form.

ISO 24613:2005

5.2.1 Form Analysis Class

The attributes mapped to the form can be organized in related sets and mapped to abstract superclasses.

In the following example, from TBX (Term Base Exchange), a LISA standard that is compliant with ISO 16642, a lexical form is labeled with the Data Category name ‚’Term’, from ISO 12620.

state transition table

We can analyze the term as a text string (form) that functions linguistically as a term and consists of a multiword expression. 1 0..n

Form Attributes

:transition state table :term

:multiword expression

The LMF provides formal procedures for analyzing the Form class.

An example of the use of analysis classes follows. The example does not attempt to show a complete set of analysis classes and related attributes.

Lexical Type Gr a mma r Language -lemma -pos -language -headword -script -mwe -orthography -word -phonology

For m

Figure 5

Examples of some of the elements of these abstract superclasses include:

ISO 24613:2005

Lexical Type indicates the linguistic nature of the form, and thus the nature of the entry in the lexicon. Possible values are:

Word MWE Lemma Phrase Word form

Grammar indicates the grammatical functions of the form. Possible values are:

PartofSpeech

Language indicates the language, script, orthographical, and phonological features of the form. Possible values include:

Language Script Standard Orthography Transliteration Transcription Intonation

5.2.2 The Form Analysis Process

5.2.2.1 Process description

1. In the first stage of analysis, identify all the attributes directly related to the form and create analysis objects with the Form as the superclass.

2. In the second stage of analysis, attach the form to the core LMF structure and normalize the structure using good design principles in the context of the design objective.

3. Capture the language mappings identified during the analysis and design phase in the language ontology description in the Global Information section.

In the following examples, the sample process is normative. The specific analysis and design decisions are informative.

The following examples illustrate the design process using the Form class. The examples show the development of data for an MRD format, but this process could also represent an intermediate step in the design of an NLP resource.

ISO 24613:2005

Example entries:

English „airplane“

Chinese „feiji“

The abstract Form analysis classes are:

5.2.2.2 First use case

The first use case is the development of a monolingual English lexicon. For the sake of simplicity, an MRD example is used.

The designer wants to include these features:

A Headword (search key) in the standard orthography:

A Syllabification with Stress Markings

A grammatical category

For m -form type -lexical type -pos -language -script

1..*

Orthography Frame -name -type -orthography

Figure 6

ISO 24613:2005

: Form script = latin language = english form type = keyForm pos = noun lexical type = word

: Orthography Frame : Orthography Frame type = standard type = syllabification with stress indicator name = American English name = XYZ proprietary orthography = airplane orthography = air' plane

Figure 7

During the initial analysis phases, the LMF class structures and attributes are purely analytical. During later analysis phases and the design phases, the developer can retain analytic structural features and attributes, or implement a mix of analytic and synthetic structural features and attributes. For example, a common best practice in lexical design has been to synthesize the Lexical Type and Orthography features into a single feature, such as „Headword: airplane“; „Syllabification: air’ plane“.

ISO 24613:2005

: Lexical DB

: Global Information

: Lexical Entry

Analytic mapping Synthesis

: Form

Figure 8

In his example, the developer has made the following design decisions:

1. In a monolingual database, the working language and script attributes can be at the highest level in the structural hierarchy, since all subclasses can inherit the attribute. (As an alternate design approach, these attributes could possibly be mapped to the Global Information class).

2. The variant forms of the lexical unit can be synthesized with the respective lexical and/or orthographical type descriptors to provide the Headword and Syllabification attributes. At this point, the abstract Form class is empty and this class is not retained during the design phase.

3. The analytic information identified during the anlysis phase is described in the Global Information section. This information helps specifically identify the attributes and the structural relationship of the attributes used in the database design. One goal is to support the interchange and reuse of lexical data.

Example of the markup:

en ... ISO12620 keyForm American English

ISO 24613:2005

ISO12620 keyForm XYZ Sylabification Style Syllabification StressMarking [GlobalInformation]

airplane air’ plane noun

5.2.2.3 Second use case

The second example shows how the LMF analysis and design process can be used for a database that contains lexical data in two or more bilingual resource sets where at least one of the resources is multi-orthographic in nature. The example is a simple MRD in English and Chinese.

The English resource contains the same data sets described in the first example.

The Chinese resource has variant forms in two types of Chinese native script (Simplified and Traditional) and one type of transcription.

ISO 24613:2005

: Lexical DB

: Global Information

: Lexical Entry

: Form Analytic Map

: Orthographic Frame

Figure 9

This example skips the first several analysis phases.

The developer creates a Global Information class that has a 1 to 1 cardinality with the Lexical Database class. Making the Lexical Database class a composition class allows the developer to manage the complex attribute sets found in the Global Information class.

Any of the variant forms for the Chinese language resource could be a valid search key, so the designer decides not to synthesize the forms on a single element, e.g. Headword. Synthesizing the forms to handle the English language data alone would be acceptable, but there is a requirement for consistency in the database.

The database contains two source working languages, English and Chinese, so the language attribute is at the Lexical Entry Level.

Because the designer wants to include multiple orthographical variants as the search key, the Form class is instantiated as an Form object and the Lexical Entry is implemented as a composition that includes a Form object as an integral part with a one to one cardinality.

Because the Chinese language resource contains two subsets of the Chinese script (Simplified and Traditional), and Latin script for transcription forms, the Script attribute is mapped to the Orthography Frame class.

ISO 24613:2005

In this example, the deep analytic structure is carried over into the design and implementation phase. This example illustrates the LMF analysis and design process. However, the specific design decisions, as the two divergent examples show, depend on the goals of the developer.

Example of the markup:

.... en noun keyForm American English airplane XYZ proprietary Syllabification StressMarking air’ plane ch keyForm ha Simplified Chinese [ChineseCharacters] Traditional Chinese [ChineseCharacters] la Non-Spaced Pinyin Transcription Romanization feiji Pinyin and Tone Transcription Romanization HanToneIndicator Syllabification FEI1 JI1

ISO 24613:2005

5.2.3 Sense Analysis Class

Lexical resources can also treat the Sense in an abstract or concrete manner. For example, Machine Readable Dictionaries almost always treat the Sense as a concrete class (an object) containing a selection of Definition, Example, Equivalent and other subclasses to comment on or explain the form.

The following example shows high level analysis classes with several sample attributes.

Sense -definition -context -example

For m

Figure 10

The sense example shows the analysis of a simple Chinese-English MRD. The designer wants to include these features:

A Chinese Headword (search key) in the standard orthography:

A Romanized Headword (search key) in a proprietary Pinyin orthography.

A grammatical category

A simple English equivalent or equivalents

ISO 24613:2005

Lexical DB

1 1 Global Information 0..* Lexical Entry -pos

1 1

1 0..* For m Sense -keyForm -language -language -equivalent

1..* Orthography Frame -script -orthography name -orthography type

Figure 11

The Chinese language data elements contain multiple scripts and orthographies, so the developer uses both Form and Orthography Frame analysis classes. The database contains a source working languages, Chinese, and a target language, English. The decision to map the language attribute to the Form and Sense classes is an arbitrary decision. For example, the Chinese language attribute could have been mapped to the Lexical Entry level and overridden in the Sense class.

5.3 The Linguistic Frame

The Lexical Entry can also contain zero to many Linguistic Frames. Each Linguistic Frame can be assigned a type attribute, enabling the creation of new frame types for constructing extensions for both Machine Readable Dictionaries and NLP Lexica. Each Reference Frame can contain zero to many Form classes and Form related subclasses, zero to many internal

ISO 24613:2005 cross references, and zero to many external cross references. The internal cross Reference attribute can be used to map a reference Frame to the KeyForm, to an associated Sense (if present), or to other Reference Frames in the Lexical Entry. NLP lexica can be represented by typing reference frames (e.g. morphological unit, syntactic unit) and defining the links between the frames using the internal cross reference features. Machine Readable Lexica can be represented by typing the Reference Frame as a related linguistic entity (e.g. derivative, synonym, antonym) and linking the related linguistic entity, i.e. the Linguistic Frame, to the keyForm. The related form can also be mapped to a specific Sense frame in order to specify a relevant Sense, or Sense, out of a range of Senses. The external cross reference feature enables greater flexibility in modeling both Machine Readable and NLP Lexical extensions by allowing developers to build associations among lexical objects at the level of the Lexical Entry and at the level of the Lexical Database. The Linguistic Frame provides a mechanism for reducing complex lexical associations to a simple and extensible conceptual model. As part of the overall Core Reference Model, the Linguistic Frame provides greater flexibility for modeling LMF extensions.

5.4 Global Information

The following example shows how Global Information could be expressed in XML.

PR.xml NLP

Comment: add an example of language mapping

Pierre Richelet 13 Sept 2004 Dictionnaire françois augmenté 1.1

Figure 2: Example of Global Information

5.5 Overview of Core LMF Classes

The following diagram provides an overview of key LMF core classes and some of the allowable associations. The diagram is not meant to be an example of a required structure for specific implementations.

ISO 24613:2005

Lexical DB

1 1 Global Information 1 -metaData -language mapping -creation -administration

0..* Lexical Entry 10..*Reference Frame 1 0..* 1

1 1 1 0..*

0..* 0..* For m Sense

0..*

Figure 12

5.6 LMF lexical extensions

Comment: This section comes at the end.

The LMF lexical extensions are described in the annexes of this current standard.

They include the following options:

• Machine readable lexicons

• Natural Language Processing lexicons

Creators of lexicons should select the subsets of these possible extensions that are relevant to their needs.

5.7 Selected LMF lexical extensions

These extensions are a subset of the available LMF lexical extensions that are selected by the lexicon creator.

ISO 24613:2005

5.8 LMF conformant lexicon

A LMF conformant lexicon is defined by :

- a class model combining the LMF core model and the selected extensions.

- A data category selection in order to define constants and basic types.

ISO 24613:2005

Annex A (normative): Extension for machine readable lexicons

The Machine Readable Lexical Extension can be modeled using the LMF Core Reference Model, a metamodel with additional feature structures, and a DCS attached to the metamodel.

The following UML diagram shows a simple bilingual dictionary model that was designed using the Core Reference Model and a feature structure based metamodel. An example of a dictionary entry in XML notation is also given.

Diagram To Be Added

Example

Example of an Albanian-English Bilingual Machine Readable Lexicon based of the Core Reference Model using the metamodel and a DCS.

noun masculine acronym alb la Albanian Native ABD unidentified proprietary Pronunciation-type Hyphenation-type >a-bë-dë' full form Albanian Native artileria bregdetare military

ISO 24613:2005

unevaluated eng la U.S. English Native coastal artillery

ISO 24613:2005

Annex B (normative) : Natural Language Processing extension

B.1 Purpose of the extension

The general purpose of this extension is to describe the lexicons dedicated to NLP.

The aim of the NLP extension is to reflect best practices in the NLP field. Numerous lexicons exist with various linguistics data expressed in these lexicons, so the model offers a great liberty regarding to cardinalities in associations. But, to reach a good level of interoperability, the classes are intended for some sort of linguistic data and not for something else. For instance, a lexicographer using the Form class for representing the meaning does not respect the standard because i) this class is not design for this purpose ii) another class (named Sense) is specifically designed for that. Such an action breaks interoperability with other LMF conformant lexicons.

This annex is divided into four sections. And each section describes a model specialized for a certain aspect of a word description. Of course, these models can be combined together. The sections are: morphology, syntax, semantics and multilingual notations. Due to intrication of syntax and semantics in most languages, the section on semantics comprises also the connection to syntax.

B.2 Morphology

B.2.1 Objectives

The purpose of this sub-chapter is to describe the morphology of a lexical entry.

The goal is to describe all the pairs: 1) combination of morphological features (see definition) 2) inflected form

For instance, for the word “go” one of these pairs will be: 1) (/third person/ + /singular/ + /present/) 2) “goes”.

The morphological model must fill the following criteria:

• represent single and multiple words;

• represent graphical and phonetical information;

• accommodate simple and complex languages;

• accommodate mono and multi-orthographic languages (e.g. some Asian languages).

This model can be used for two basic functions: generation and recognition. Generation involves the production of inflected forms based on a lexical entry. Recognition involves parsing an inflected form in order to determine the corresponding lexical entry (or several entries in case of ambiguity).

ISO 24613:2005

B.2.2 Absence vs presence of morphology in a lexicon

Some system designers do not want to include the morphological computation associated with each entry on the grounds that it would be too complex or that there is no consensus. The argument against this position is that a lot of NLP lexica include morphological information. This is true for single words and multi-word expressions.

For instance, a user could have software and lexicons that operate in N languages and he wants to add a language by importing normalized data. For a language with relatively simple morphology like English, the problem is not very serious, but for a language like Hungarian or French whose morphology is rather complex, the burden of creating the inflectional categories will be avoided if the imported lexicon includes morphology.

B.2.3 Options

There appears to be no consensus on the approach for representing morphology, but it is possible to synthesize the situation by listing the four different options:

• Option-1: Inflected forms are explicitly represented;

• Option-2: The inflectional paradigm is fully and analytically described within the lexicon;

• Option-3: The inflectional paradigm refers to an external automaton;

• Option-4: The inflectional paradigm refers to an opaque compiled program.

When option-4 is selected, it is impossible to modularize or exchange data. For option-3, exchange is very difficult.

B.2.4 Description of morphological model

B.2.4.1 Introduction

The following description is based on the assumption that:

• It is possible to explicitly represent all the inflected forms (i.e. option-1);

• It is possible to fully describe the inflectional paradigm (i.e. option-2) by the means of a symbolic description based on other elements.

Option-1 is feasible when the model is used to manage data for a language with simple morphology. It is not advised for a language with complex morphology because maintenance will be difficult. However, when the model is used for exporting data, option-1 can be used to deliver an inflected forms lexicon that is immediately useable. If needed, option-2 can be used as a source in order to automatically derive an automaton (i.e. option-3) or a compiled program (i.e. option-4).

For single word morphology, the description is based on a list of stems that are modified by a set of operations.

For multiple word morphology, the description is based on the combination of morphological features. This idea is taken from the Genelex-Eagles model with some improvements.

In the next three chapters, the following elements will be defined: Inflectional Paradigm, Composer, Inflected Form Calculator and Form.

ISO 24613:2005

B.2.4.2 Single words

The inflectional paradigm is the set of pairs that connects a combination of morphological features with a mechanism capable of computing an inflected form. The inflectional paradigm is shared by all the forms that have the same morphological behavior. Examples of large scale lexicons show that for languages with complex morphology like French, 350 paradigms are sufficient to describe 80 000 lemmatised forms that give 500 000 inflected forms. In a more complex language like German, 800 paradigms are needed.

The mechanism for producing an inflected form is the following:

• The computation refers to the lemmatised form or a list of stems. A zero reference indicates the lemmatised form. An integer superior to one indicates the rank assigned to the stem.

• Operations are specified in order to indicate that the string obtained by the previous point needs to be modified. Operations are ordered. Each operation is defined by an operator and an ordered list of arguments. An operator is a reference to an ISO- 12620 data category. An operator could be a very simple mechanism like “add a string” or a more linguistically motivated mechanism like “move the stress to the penultimate syllab”. An argument is a string. Examples of operations are:

⎯ /add before/ that means “add a string to the left” e.g. in German “lessen” => “gelessen”

⎯ /remove after/ that means “remove N characters from the right” e.g. in French “chanter” => “chante”

⎯ /add after/ that means “add a string to the right” e.g. in French “chanter” => “chantera”

⎯ /substitute/ that means “replace N characters at position X by a string” e.g. for the English internal plural “foot” => “feet”

⎯ /copy/ that means “duplicate N characters from position X at position Y” e.g. the plural by the mean of duplication like in Indonesian [25] “mata” (eye) => “mata-mata” (eyes)

Each operation is applied once. Positions are defined by positive integers when starting from left and by negative integers when starting from right.

The inflectional paradigm is an element that contains the following attributes:

• /identifier/

• Example. This a character string of a typical word that uses the given inflectional paradigm. For instance, in French, the commonly chosen example for the regular verbal inflection (first group) is “aimer” (to love).

• /part of speech/. This is a reference to an ISO-12620 data category;

• A Boolean value indicating whether the inflectional paradigm applies for written forms;

• A Boolean value indicating whether the inflectional paradigm applies for phonetical forms.

ISO 24613:2005

The inflectional paradigm is defined using morphological features combiners (see class model below). These combiners do not possess any attributes. They are nonlinguistic elements that only serve to link a composer with a morphological feature (see definition).

An inflected form calculator contains a double set of operators: one for graphical computation and one for phonetical computation. Its attributes are the following:

• A reference to a stem, which is an integer that refers to the stems attached to the lexical entry. The zero value indicates the use of the lemmatised form;

• The operators applied to the graphical form;

• The operators applied to the phonetical form;

• A string to describe contextual variation. It could be used for instance in French in order to mark elision.

• Underspecification for an inflectional paradigm is possible. The paradigm can be declared without any morphological feature combiner. For instance, the lexical entry “drinken” can be described as being ruled by the paradigm “asSingen” and this paradigm can exist as independent instance without being fully described.

B.2.4.3 Multiword expressions

A multiword expression is [20]:

• A continuous compound word like fra:“pomme de terre” or eng:“coefficient of friction”. Multiword expressions can be any part of speech, for instance fra:“d’après” (preposition), fra:“après-vente” (adjective), eng:“in the course of” (preposition).

• An agglutinative word that is defined as word composed of various words without any graphical separator. Agglutinative words can also be any part of speech, e.g. “aftermarket”, used either as a noun or an adjective.

• A discontinuous and flexible expression like in French “passer en revue”. A discontinuous expression accepts an insertion. We could have “passer en revue“, “passer N1 en revue” or “passer en revue N1”.

A multiword expression could involve a support verb like: “make progress”.

A multiword expression is composed of components that are either autonomous or non- autonomous. For instance, in French “au fur et à mesure”, the component “fur” cannot appear alone. Such a word is called non-autonomous.

The mechanism for creating multiword expressions can be applied to agglutinative words. In fact, an agglutinative word is considered to be a compound word without any graphical separator.

The mechanism can also be applied recursively: a multiword expression can be composed of components that are themselves multiword expressions.

The system for representing multiword expressions combines three types of descriptions:

• Description of the order of the components;

• A description of any possible insertions if any;

ISO 24613:2005

• Information attached to each of the components indicating:

⎯ The combination of all morphological features, which is the most important indication required to compute inflected forms based on the components.

⎯ Reference to the component itself.

⎯ An optional graphical separator such as for instance a space or a hyphen. It could be in the case of some languages, a graphical separator resulting from a phonological insertion like in the German word : “Gesellschaftszimmer” = “Gesellschaft” + “s” + “Zimmer”. The “s” is purely phonological. Let’s note that “Gesellschaft” is a feminine noun that does not take an “s” in any of its morphological forms. The only reason for the “s” is that it is physically impossible (for a German speaker) to get his tongue from the “t” to the “z” without sliding through the “s”.

⎯ A phonetic separator.

⎯ An open code to specify a transformation, e.g. in certain agglutinative languages where the component usually begins with an uppercase letter, but when used in a multiword the initial letter becomes a lowercase letter.

A composer contains the following attributes:

• A Boolean value indicating whether the word is single or multiple;

• The rank of the stem or of the component.

B.2.4.4 Form

Instead of referring to a Lexical Entry, the various descriptive mechanisms in Morphology refer to the form. One lexical entry can have various forms. The cardinality 1 to many between lexical entry and form allows to represent orthographic variants of the same word. All these forms being connected to the same lexical entry, that means that there is no distinction in meaning between these orthographic variants.

The various forms of a lexical entry may not hold any distinctive attribute (the lemmatised form excepted) that permits to consider that a particular orthographic variant is a primary form and the others are secondary variants. For instance, in French, concerning the two forms “clé” and “clef” (“key” in English) a native speaker cannot decide which of these two variants is the primary orthographic variant and which is not.

The basic notion to retain is that in LMF there is no bijection between a Lexical Entry and a Form. A lexical entry may have several forms, and for many different linguistic reasons, even if in languages like English (compared to Chinese) most of the time there is only one form for a lexical entry.

Thus Form element hold the following information:

• Written lemmatised form expressed as an Unicode character string.

• Phonetic lemmatised form expressed as an Unicode character string.

• Transliteration and transcription coding. As said in the beginning of the current chapter, one lexical entry can have several forms. For instance, in Chinese, a

ISO 24613:2005

lexical entry can have one or several native coded forms (one in traditional ideographs and one in simplified ideographs) and one or several transliteration coded forms (one in non-spaced pinyin and one in spaced pinyin and tone). Examples could be: /non-spaced pinyin/ for “BEIYASHI” in Chinese /spaced pinyin and tone/ for “BEI4 YA1 SHI4” in Chinese

• Expansion variation. This is a data category that describes the kind of variation between various forms of the same lexical entry. The values could be for instance: /full form/ for “Global Positioning System” /abbreviated form/ for “GPS”

• Geographic variation. This a data category that describes that a specific form is used in a certain region as opposed to another form used in another region. The values could be for instance: /us/ for “analyze” /gb/ for “analyse”

B.2.4.5 Varia

It should be recalled that the /part of speech/ information is not held by the Form but by the Lexical Entry element.

The morphological features are not held by the lexical entry because in a lot of languages the combination between features cannot be factorised at this level in a generic manner. For instance in French some words are /masculine/ when /singular/ and /feminine/ when /plural/. The French word “amour” (“love” in English) is an example of such a word ([27] for explanations). When describing the morphology of a language, recording the /gender/ at the lexical entry level implies a non-generic strategy, in other terms: an ad hoc strategy for certain words and an another strategy for other words. These situations give very poor conditions of interoperability.

ISO 24613:2005

B.2.5 Class diagram

By convention, classes defined by the morphological model are colored. Classes taken from the core model or from DCS are in white. The following UML diagram shows the classes of the morphological model.

LexicalEntry

0..*

0..* {ordered} 1..* 0..1 1 ListOfComponents For m 1 ListOfStems 0..1 0..1

0..* 1

0..* 0..1 {ordered} 1..* InflectionalParadigm InflectedForm 0..* Stem 1

0..* 0..* MorphologicalFeaturesCombiner 0..* 0..* DataCategory12620

1..* 1 1

0..* 0..* 0..1 Composer InflectedFormCalculator

{ordered} 0..* 0..* 1 0..* GraphicalArgument Operation 1 PhoneticArgument {ordered} {ordered}

Figure B-1: Morphological model

ISO 24613:2005

B.2.6 Examples

B.2.6.1 Morphology for the English word “clergyman” without any Inflectional Paradigm

Figure B.2 illustrates a very simple description. The form is “clergyman” and two inflected forms are connected to this instance. The first inflected form is “clergyman” for singular and the second one is “clergymen” for plural. The words are masculine so the inflected forms are all connected to the /masculine/ data category.

: Form lemmatisedForm = clergyman

: InflectedForm : InflectedForm writtenForm = clergy man writtenForm = clergy men spokenForm spokenForm

: DataCategory12620 v al = masculine att = gender

: DataCategory12620 : DataCategory12620 v al = singular v al = plural att = number att = number

Figure B-2: Clergyman with inflected forms

ISO 24613:2005

B.2.6.2 Morphology for the English word “clergyman” with an InflectionalParadigm

Regarding to the last diagram, another possibility is to use an inflectional paradigm. And figure B.3 illustrates an example of an inflectional paradigm for single words. The singular of this word is “clergyman” and the plural is “clergymen”. Two letters are removed and two letters are added. The inflectional paradigm is called “asMan”. The English morphology is relatively simple, so the representation is simple, which means it is not necessary to manage any stem and a reference to the lemmatised form can be used. So, the value for the stem attribute is zero.

Compared with the last diagram, this one seems more complex. But the inflectional paradigm being factorised all over the lexicon, the only particular information for this entry is the link between the form and the inflectional paradigm.

: InflectionalParadigm : Form id = asMan lemmatisedForm = clergyman graphy = true

: MorphologicalFeaturesCombiner : InflectedFormCalculator stem = 0 : MorphologicalFeaturesCombiner

: DataCategory12620 att = gender for "clergyman" val = masculine

: DataCategory12620 : DataCategory12620 att = number att = number val = singular val = plural

: InflectedFormCalculator stem = 0

: Operation : Operation GraphicalOperator = remov eAf ter GraphicalOperator = addAf ter

: GraphicalArgument : GraphicalArgument val = 2 val = en for "clergymen" comment comment

Figure B-3: Clergyman with an inflectional paradigm

ISO 24613:2005

B.2.6.3 Morphology for the German word “trinken”

Figure B.4 illustrates another example of an inflectional paradigm for single words but this example is a little bit more complex. The example is the German verb “trinken” and due to the lack of space, only the past participle is drawn. The paradigm is called “asSingen”. Two operations are applied. The first operation adds the prefix “ge” at the beginning of the word. The second operation substitutes one letter in the middle of the word. A specification from the end of the word is used and in order to do that, the position is given as a negative integer. The letter “u” takes the place of the letter “i”.

The infinitive is “trinken” and the past participle is “getrunken”.

: Form : InflectionalParadigm lemmatisedForm = trinken id = asSingen graphy = true

: MorphologicalFeaturesCombiner : InflectedFormCalculator stem = 0 for "getrunken"

: DataCategory12620 att = tense : Operation : Operation val = pas t GraphicalOperator = addBef ore GraphicalOperator = substitute

: DataCategory12620 : GraphicalArgument : GraphicalArgument : GraphicalArgument val = ge val = -5 val = u att = mood comment comment = where comment = what val = participle

Figure B-4: Trinken

ISO 24613:2005

B.2.6.4 Morphology for the French word “pomme de terre”

The inflected forms are computed from the components of the multiword by the means of a reference to the combination of the morphological features for each of the components. The singular of “pomme de terre” is “pomme de terre”. The plural is “pommes de terre”. This is a common behavior in French for a pattern NdeN to exhibit this kind of variation on the sole head of the compound noun with a frozen modifier. The morphological feature combiners on the left side represent the number and gender of the compound. The preposition “de” is not bound to any data category because it has no morphological feature.

: InflectionalParadigm : Form : ListOfComponents id = NVariableDeFixe lemmatisedForm = pomme de terre graphy = true

: Form lemmatisedForm = pomme

: Form lemmatisedForm = de

: Form lemmatisedForm = terre

: DataCategory12620 : MorphologicalFeaturesCombiner : DataCategory12620 v al = gender v al = gender att = feminine : Composer att = f eminine

: DataCategory12620 separator = NIL : DataCategory12620 rank = 0 v al = number head v al = number att = singular att = singular

: Composer separator = SPACE rank = 1 head : DataCategory12620 for "pomme de terre" v al = gender : Composer att = f eminine : DataCategory12620 separator = SPACE rank = 2 v al = number head att = singular

: MorphologicalFeaturesCombiner : DataCategory12620 : DataCategory12620 v al = gender : Composer att = feminine v al = gender separator = NIL att = f eminine : DataCategory12620 rank = 0 head : DataCategory12620 v al = number att = plural v al = number : Composer att = plural separator = SPACE rank = 1 head : DataCategory12620 for "pommes de terre" v al = gender : Composer att = f eminine : DataCategory12620 separator = SPACE rank = 2 v al = number head att = singular

Figure B-5: Pomme de terre

B.2.6.5 Morphology for the French abbreviation ZEP

Morphological description is not connected directly to the Lexical Entry but is connected to the Form. A lexical Entry can have various forms and each form can have its own morphology. In certain languages, like French, various orthographic variants can have different morphological behaviors. It is the case for abbreviations like “ZEP”. The singular abbreviated form is “ZEP”. Its plural is “ZEPs”. The singular extended form is “Zone d’éducation prioritaire” (“priority educational area” in English). The plural extended form is “Zones d’éducation prioritaire”. In French, the MWE headword is at the beginning of the expression, so the “s” for plural is set on the first word. For the abbreviated word, the “s” is at the end of the word. So, the two orthographic variants have different morphologies. And this behavior is not an exception: it’s

ISO 24613:2005 the rule in French. The linguistic description can be illustrated as follows without any detail concerning the two inflectional paradigms:

: LexicalEntry language = f re partOf Speech = noun

: Form : Form lemmatisedForm = ZEP lemmatisedForm = zone d'éducation prioritaire expansion v ariation = abbrev iated expansion v ariation = f ull

: InflectionalParadigm : InflectionalParadigm graphy = true graphy = true id = asTable id = asNdeNA

B.2.7 Summary

The model presented here permits the description of inflectional morphology. The model is the same for languages with simple morphology and for languages with complex morphology.

The model is based on Genelex-Eagles model [23][24] with three improvements:

• Possibility to represent explicitly inflected forms;

• Improved support for multiword expressions;

• Support for Asian and Semitic languages.

Let’s note that due to the fact that derivation in most languages relies on the distinction of semantic units, derivation shall be described in the semantic extension.

B.3 Syntax

B.3.1 Objectives

The purpose of this sub-chapter is to describe the properties of the word to be combined with other words in a sentence.

It is possible to split syntactic properties of a word into two separate groups: those that characterize the word as syntactically dependent, and those that treat the behavior of its dependent words. The properties of the first group deal with the capacity of the word to enter into certain syntactic frames, which are more grammatical than lexical, and thus are not discussed here. The second type of properties concern syntactic actants and are bound to the lexicon, so these properties will be described. Consequently, the syntactic model describes specific syntactic properties of words and does not express grammar.

ISO 24613:2005

B.3.2 Absence vs presence of syntax in a lexicon

The syntactic description is attached to the lexical entry unit and to the sense unit.

Syntactic description is optional, so it is possible to describe morphology and semantics without any syntactic description. Let’s note that this possibility is a major improvement regarding to Eagles, where the syntax is mandatory.

Conversely, what is possible in Eagles is still possible in LMF/NLP: it’s feasible to describe syntax without semantics.

In other terms, and schematically, instead of having a layer structure like in Eagles with three layers: Morphology, Syntax and Semantics, the backbone of LMF/NLP is a triangle comprising three sub-parts: Morphology, Syntax and Semantics. Each sub-part holds a central object that is respectively the Lexical Entry, the Syntactic Behavior and the Sense. Only the Lexical Entry is mandatory, the others are optional.

Such a structure is modelled as follows:

LexicalEntry

SyntacticBehavior Sense

The syntactic behavior of a word is also connected to another sub-part of the model: the multilingual model, in order to represent transfer between two languages. This means that if users want to describe transfer, they need to include the syntactic descriptions.

B.3.3 Options

The syntactic model shown here is based on the Eagles model.

Splitting conditions are not defined in the model. These conditions are just constrained by cardinalities [32][19]. In particular, it’s possible for a given Lexical Entry to share common syntactic behavior. For instance, in Italian, the lexical entry “limone” (“lemon” in English) is polysemous. One sense is lemon as a fruit and another sense is lemon as a tree. These two senses being rather standard count nouns, it’s possible to share a common element Syntactic Behavior in order to simplify coding.

B.3.4 Description of syntactic model

Syntactic Behavior

ISO 24613:2005

The syntactic behavior is an element that represents one of the possible behaviors of one or several senses. Syntactic behavior contains no attribute.

Syntactic Frame

Syntactic Frame is the element that describes one syntactic construction.

Syntactic Frame contains four attributes: id, function, example and comment.

Self

Self is the element that describes the central node of the Frame. The syntactic peculiarities of the word can be described by attaching as many Syntactic Features as needed.

Syntactic Feature

Syntactic Feature is a pair of data categories. In Italian, for instance, such a pair could be used to describe the property of a verb like “amare” (to love) concerning the type of auxiliary. The pair will be composed of /auxiliary/ and “avere” (to have).

Position

A Position is an element that connects a syntactic actant and a semantic actant.

Syntactic Actant (see definition)

The syntactic actant contains the following attributes:

• definition

• comment

• introducer, for instance, the proposition that is located at the beginning of the syntactic group

• syntactic group label

• restriction

Frame Set

The Frame Set element describes a set of syntactic Frames and possibly the relation that undergoes these Frames.

Certain languages have simple syntax and other languages have complex syntax. In the latter, describing every behavior precisely is a huge (if not impossible) task. The mapping from a representation where a predicate-argument structure is meant to describe 'deep syntactic' relations into a representation of surface grammatical relations or functions is subject to certain morpho-syntactic rules (active/passive voice) and to lexically determined features of the predicates (transitive, ergative, pronominal verbs). These mappings, when regular, can be described resorting to types or to sets of frames that a verb can enter into, and help to reduce redundant information in the lexicon. For this purpose, Frame Set is used. Frame Set is an element that regroups together various syntactic Frames that appear frequently for certain sets of words ; the objective being to factorize syntactic descriptions and to have a minimum of syntactic behavior elements in the lexicon. For instance, in English, it is possible to have one Frame Set for ergative verbs. For “boil” in “he boils a kettle of water” and “the kettle boils”, this verb will have only one syntactic behavior

ISO 24613:2005

(referring to a sole Frame Set) instead of two syntactic behaviors (one for “he boils a kettle of water” and one for “the kettle boils”).

Frame Set contains two attributes: example and comment.

B.3.5 Class diagram

LexicalEntry 0..1

0..* 0..* SyntacticBehavior 0..* 0..* Sense Described in Semantic section

0..* 0..* 0..* 0..* 0..* FrameSet Self 1 SyntacticFrame 0..* 0..* 1 1 1 0..* RelatedSlot 0..1 0..* 0..* 0..1 Position SyntacticFeature 1 1

0..* 0..* SyntacticActant Argument Described in Semantic section

Syntactic classes

Figure B-6: Syntactic model B.3.6 Example of the Italian Construction « amare »

This example is taken from the CLIPS lexicon (www.ilc.cnr.it). In this example, only syntactic structures are used, nothing in semantics is being described.

This is a rather simple construction in Italian where both the subject and the direct object are Noun Phrase. The self object describes a verb that takes the auxiliary “avere”. A typical example of such a construction is “Gianni ama Maria”.

ISO 24613:2005

: SyntacticFrame id = amare-SyntFrame

: Self : Position id = amare-self function = object : Position id = NPobj function = subject id = NPsubj : SyntacticFeature : SyntacticActant value = avere name = aux syntacticGroup = NP : SyntacticActant syntacticGroup = NP

Figure B-8: Amare B.3.7 Summary

The model adopted here derives from an extensive study of various syntactic theoretical orientations. In the spirit of Genelex-Eagles, the model is general enough to accommodate a variety of theoretical positions. Moreover, various implementations of this system have already been realized.

B.4 Semantics

B.4.1 Objectives

The purpose of this section is to describe one sense and its relations with other senses belonging to the same language. The linked senses belonging to different languages are to be described by using the multilingual section.

B.4.2 Absence vs presence of semantics in a lexicon

Semantics is not mandatory. Some lexicons contain only morphological entries without any syntactic and semantic descriptions. Nevertheless most lexicons that contain syntactic descriptions have at least very simplistic semantic features like +human or +location.

B.4.3 Options

LMF does not impose any depth in the semantic description. And different degree of richness can cohabit within the same lexicon. For instance, a technical and specialized lexicon can have a rich description for its technical words and very shallow (or non-existent) semantic information attached to general words or senses.

Various descriptive mechanisms are proposed like synsets, predicates, linkage with syntax, relations, features. LMF does not impose any exclusive usage of these mechanisms. For instance, a user can manage a multilingual database with synsets in English and predicates

ISO 24613:2005 in another language. Of course, these mechanisms can be combined within the same language, temporarily or in a permanent manner.

B.4.4 Description of semantic model

Several types of important information are distinguished:

• The sense with semantic feature and relation

• The example

• The definition with proposition

• The synset and synset relation

Sense

The sense element is used to represent one sense of a word. A word that has more that one sense will be represented by one lexical entry and more than one sense.

A sense has a label.

A sense can be refined into more specific senses like in some publishing lexicons. For instance, in [31] the entry “Regret” has two first level senses. Each of these has two sub-sub level senses etc. And certain branches of this hierarchy have four levels. But this mechanism is not described through a direct link like in the following diagram but with a Relation in order to permit to attach auxiliary information like meta-data to the relation if needed.

Sense

So, in order to represent this structure:

#1 : Sense

#2 : Sense

The following relation is elaborated:

ISO 24613:2005

# 1 : Sense

: Relation type = specificToGeneric

# 2 : Sense

Semantic Feature

The semantic feature is a pair of data categories that could be for instance. The first one is called the attribute and the second one the value.

• /dating/ This is a data category that marks the sense in order to indicate whether the sense is considered as old or modern. The values could be for instance: ”old” for “???” in English “modern” for “???” in English

• /style/ This is a data category that marks the lexical entry in order to indicate the usage level. The values could be for instance: “standard” for “voiture” in French “slang” for “bagnole” in French

• /frequency/ This is an open attribute that could be filled by a number or a data category like “frequent” or “rare”.

• /geography/ This is a data category that marks the sense in order to indicate that the sense is typically used in a certain geographical region. The values could be for instance: ”canada” for “caribou” in French ”metropolitan” for “renne” in French

• /human/ The value could be /plus/ or /minus/.

Relation

A relation is used to represent semantic derivation or synonymy. The relation mechanism can be used to represent lexical functions in the spirit of DEC [22]. The relation contains the following attributes:

• name

• type

• comment

• example

Example

It’s an element that contains the attributes: text, source, language.

A sense can have zero to many examples. The language is the same as the one of the lexical entry but the form could be expressed in a more or less explicit way, for instance a lexicon in

ISO 24613:2005 bambara can hold examples expressed with usual orthography and examples with tones added, in order to permit beginners to understand the example.

Definition

Textual definition is not provided for being used by programs. It is provided to ease the maintenance by human beings and could be displayed to the final user.

It’s an element that contains the attributes text, source, language, view.

A sense can have zero to many definitions. It should be noted that the definition could be expressed in another language than the one of the lexical entry.

Proposition

Optionally, a definition can be defined by several propositions.

The proposition holds: name, type and text.

Predicative Representation

Predicative Representation describes the link between the sense and the predicate. For instance, a semantic derivation between a sense of a noun can be linked to the predicate. In such a situation, the type of predicative representation can be qualified as “VerbNominalization”.

The class holds:

• type

Predicate

The predicate holds:

• name

• type

• definition

• view

Argument

The Argument is for the semantic actant in order to link with syntax. Argument contains the following attributes:

• semantic role

• type

• definition

• restriction.

Predicate Relation

Predicate relation permits to link two predicates and contains the attributes:

ISO 24613:2005

• name

• type

B.4.5 Class diagram

SemanticFeature LexicalEntry

1 0..*

0..* 1 1 0..* Relation SyntacticBehavior 0..* 0..* Sense 1 0..* 1..* 0..* 1 1 0..1 1 PredicativeRepresentation SynSet 0..1 1 0..* 0..* SyntacticFrame 0..* 1 0..1 0..* Predicate SynSetRelation 1 Example 1 1 1 0..* 0..*

0..* Definition 0..* 0..1 0..* Position Argument 1 PredicateRelation 0..*

Proposition

Semantic classes

Figure B-9: Semantic model B.4.6 Example of the French verb “aider”

The following example presents the syntax of the sense “Aider1” in the style of “Dictionnaire Explicatif et Combinatoire” (DEC) as taken from the chapter: “zone de combinatoire” [17].

For “Aider1” is linked to the semantic actants: “X aide Y à Z-er par W” as in “il vous aidera par son intervention à surmonter cette épreuve”.

This example yields eight frames:

Frame-1: La Grande-Bretagne aide ses voisins. Frame-2 : La Grande-Bretagne a aidé à la création de l’ONU. Frame-3 : La Grande-Bretagne a aidé dans toutes les activités de l’ONU. Frame-4 : La Grande-Bretagne a aidé l’ONU à réussir ce projet. Frame-5 : La Grande-Bretagne a aidé l’ONU pour la réussite de ce projet. Frame-6 : La Grande Bretagne a aidé l’ONU avec ses activités. Frame-7 : La Grande-Bretagne a aidé à la création de l’ONU. Frame-8 : La Grande-Bretagne a aidé les autres pays à créer l’ONU.

ISO 24613:2005

The following instance diagram presents Frame-1 : ‘La Grande-Bretagne aide ses voisins’. The decision has been taken to combine syntactic and semantic structures.

: LexicalEntry

: SyntacticBehavior : Sense

: PredicativeRepresentation

: SyntacticFrame : Predicate

: Position : Argument function = directObject : Position id function = subject id

: Argument : SyntacticActant syntacticGroup = N : SyntacticActant syntacticGroup = N

Figure B-7: Aider

B.5 Multilingual notations

B.5.1 Objectives

The purpose is to describe the translation of a sense or a syntactic behavior from one language into another language.

B.5.2 Absence vs presence of multilingual notations in a lexicon

The multilingual model can be used for a lexical database describing two or more languages. There is no need to use the multilingual notations in a monolingual lexicon.

ISO 24613:2005

B.5.3 Options

In a bilingual lexicon, a link is needed to represent the translation of a given sense from one language into another. For true one-to-one equivalents, a simple link can be sufficient, but actual practice reveals at least three types of problems:

Problem 1: diversification and neutralization

In certain circumstances, simple bijection from one language to the next does not work very well because the precision of the source language is not the same as that of the target language. Thus, when translating the French word “fleuve” (river that flows into the sea) into English, the solution cannot be exactly as precise as in French because no single word with this precise meaning exists in the target language. The French word “rivière” can be exactly translated into the English “river”, “fleuve” can also be translated not inaccurately, albeit with some loss of precision. The problem arises when one must translate English “river” into French. A solution is to create an intermediate object and specify a specialization according to the translation link “rivière” to the English “river”.

Problem 2: number of links

Although the strategy of one-to-one equivalence is viable for two languages, it becomes untenable for a more extensive number of languages—even with only four or five languages, the number of links explodes to unmanageable proportions. For example, the European environment has to contend with a mosaic of languages, where most environments operate in a minimum of five languages with lexicons containing hundreds of thousands of senses. The number of links is in the order of the square of the number of senses, in other words there is at least a quadratic relationship between the number of senses and the number of potential links. The exact formula is: (N * (N-1))/2. For just five languages, the number of links for a lexicon with 100,000 senses implies the management of at least one million links. And, five languages is a minimum—some systems manage more than ten languages, and many localization environments manage between twenty and forty.

In order to avoid these problems, translation equivalents can be managed through an intermediate object, which is called an “axis” in the Papillon project and serves as a pivot structure that links lexical elements from lexicons in multiple languages.

Problem 3: transfer or pivot

There are two approaches to multilingual translation in NLP, that are transfer and pivot. Transfer operates based on syntax and pivot operates based on semantics. The majority of translation (or computer aided translation) tools use the transfer approach, but it is important not to ignore advocates of the pivot approach who focus on technical niches or start from an interlingual representation layer. Furthermore, translation is not the only application of a multilingual lexicon: cross-lingual or so-called, trans-lingual information retrieval is an important and increasingly critical domain, even if it is less familiar. In this area, the distribution between pivot and transfer approaches is not always clearly defined. Indeed, hybrid systems may prove to be most effective in the future. As a consequence, the model presented here must allow for both approaches.

B.5.4 Description of multilingual model

The model is inspired by the Papillon model [21].

The model provides the following three mechanisms:

The Sense Axis links different closely related senses in different languages. This axis is used to implement the approach based on the pivot. This link can be used also for the transfer approach based on situations where a translation equivalent is available, but there is no

ISO 24613:2005 information to describe syntax. The axis facilitates the translation of words that do not necessarily have the same conceptual valance or morphological form from one language to another. Thus, in the source language, we can have a single word that will be translated by a compound word in the target language. The axis can be described by an axis relation. The AxisRelation holds the optional attributes:

• An identifer

• A label that is a character string. It could be a free textual definition

• The name of an external descriptive system

• A reference to the external description

The label enables the coding of simple interlingual relations like the specialization of “fleuve” compared to “rivière” and “river”. It is not, however, the goal of this strategy to code a complex system for knowledge representation, which ideally should be structured as a complete, coherent system designed specifically for that purpose.

The Transfer Axis is designed to represent multilingual transfer. Here linkage is at the level of syntax. This approach enables the representation of syntactic actants involving inversion, such as:

fra:“elle me manque” => eng:“I miss her”

Due to the fact that a lexical entry can be a support verb, it is possible to represent translations that start from a plain verb (in the source language) to a support verb (in the target language) like:

fra:“Marie rêve” => jpn:"Marie wa yume wo miru".

This axis contains no attribute.

The Example Axis provides documentation for sample translations. The goal is not to record large scale multilingual corpora. The goal is to link a Lexical Entry with a typical example of translation.

ISO 24613:2005

B.5.5 Class diagram

It shoud be noted that the system is applicable to bilingual and multilingual lexica. In the following diagram, the fact that a language is located in the left side or in the right side has no particular meaning. The structure can be crossed in whatever direction.

Sense in language-1 0..* 0..* SenseAxis 0..* Sense in language-2 0..* 0..* 1 0..* 0..* Sense in language-3 0..*

AxisRelation -label 1

0..* SyntacticBehaviour in language-2

0..* SyntacticBehaviour in language-1 0..* 0..* TransferAxis SyntacticBehaviour in language-3 0..* 0..*

0..* Example in language-2 0..* Example in language-1 0..* 0..* ExampleAxis 0..* 0..* Example in language-3

Figure B-11: Multilingual model B.5.6 Example of translation of “fleuve” and “rivière”

In the diagram French is located on the left side and English on the right side.

: Sense : SenseAxis label = fra:fleuve

: AxisRelation label = flows into the sea

: Sense : Sense : SenseAxis label = eng:river label = fra:rivière

Figure B-12: Fleuve

ISO 24613:2005

B.5.7 Summary

This section of the model is not new. All ideas come from Papillon (www.papillon- dictionary.org) and MILE. The model allows to represent both pivot and transfer approach and is suited for both small and large scale databases.

B.6 Extended examples of NLP lexicons

B.6.1 Foreword

The subject of this section is to study lexicons in order to describe the way they fit or they map to LMF/NLP. The first example is an invented one and is called “getting started lexicon”. The other examples are real existing lexicons.

When available, raw data are presented.

Concerning UML to XML conversion, the following conventions are adopted:

• each UML attribute is transcoded as an XML attribute;

• UML aggregations are transcoded as content inclusion;

• UML shared associations (i.e. associations that are not aggregations) are transcoded as XML attributes;

• each UML class is transcoded as an XML element.

B.6.2 Getting started lexicon

This example is not a real NLP existing lexicon but an invented one taken from a publishing dictionary [26].

This lexicon is named “Getting started lexicon” and ISO-639-2 will be used for language coding, in order to specify languages with three lower-case letters. As a consequence, these values will be recorded inside the Global Information instance.

We will describe the English word “balcony” and the Spanish words “balcón” and “anfiteatro”. According to [26], the word “balcony” has two senses 1) a structure built onto the outside of a high window, so that you can stand or sit outside 2) the seats upstairs in a theater. On the English side, the morphology comprises graphical and phonetic inflected forms. Nothing is said in syntax. We have two senses and two textual definitions. Each sense is connected to a Sense Axis. On the Spanish side, we have two senses connected to two different lexical entries. The UML instance diagram is as follows:

ISO 24613:2005

: LexicalDB

: GlobalInformation name = Getting Started lexicon languageCode = ISO-639-2

: LexicalEntry language = spa partOf Speech = noun

: LexicalEntry

language = eng : Form partOf Speech = noun lemmatisedForm = balcon

: Form lemmatisedForm = balcony

: Sense : SenseAxis : Sense

: LexicalEntry language = spa partOf Speech = noun : InflectedForm writtenForm = balcony spokenForm = baelkEni : Form lemmatisedForm = anf iteatro : InflectedForm writtenForm = balconies spokenForm = baelkEniz : Sense : SenseAxis : Sense

: TextualDefinition

: TextualDefinition text = A structure built onto the outside of a high window, so that y ou can stand or sit outside. source = Longman DAE text = The seats upstairs in a theater. language = eng source = Longman DAE language = eng

The data could be expressed by the following XML file:

ISO 24613:2005

B.6.3 LMF and Open Lexicon Interchange Format (OLIF)

B.6.3.1 Presentation

OLIF has its origin in the Open Translation Environment for localization (OTELO) project, which worked on a multi-vendor machine translation environment and was funded by the EC in the 4th Framework program. The version of OLIF (OLIF-1) was a lean and flat format for lexicon exchange. OLIF-2 was defined later and is XML compliant. The consortium contained Microsoft, SAP, Basis, Trados, Systran, Xerox (www.olif.net).

The basic idea of OLIF is to facilitate the exchange of primarily the pivotal information in entries. This information should be easily compilable into information that is needed by other formalisms. OLIF also provides the option of a deeper lexical representation. Included in the OLIF format, is general coverage of inflection paradigms, verb argument structure, semantic types and selectional restrictions.

Each entry is uniquely defined by a set of key data : canonical form, part of speech, language code, subject area, and in the case of homonyms, a semantic reading. Entries represent independent semantic units (e.g. bank/river and bank/economy are two different entries). In addition to these obligatory key data, several groups of optional attributes can be used: a) detailed monolingual description (e.g. grammatical gender). b) cross-reference information, indicating related entries in the language of the entry itself (e.g. abbreviation). c) transfer information indicating entries in languages different from the language of the entry itself, that may serve as translations if certain conditions hold. With the conventions that a “?” means optional, “*” means zero or more and “+” means one or more, such a structure is illustrated in the following figure:

monolingual section key description canonical form language part of speech subject field semantic reading ? monolingual description ? monolingual administration ? monolingual morphology ? monolingual syntax ? monolingual semantics ? general description ? cross reference * key descriptor (link type general data data category) + transfer * key descriptor transfer restriction contextual expression test expression action *

B.6.3.2 Application

The most important characteristic in OLIF is that an unique key is mandatory in order to ease interoperability. And the structure attached to the key is largely left underspecified. OLIF is targeted for European languages. Inflectional paradigms are explicitly given for five languages (i.e. fra, eng, ger, por, spa). OLIF comprises recommendations for defining the

ISO 24613:2005 canonical form in case of simple words and multi-word expressions on a per language basis with six covered languages [29]. OLIF has not been designed with Semitic and Asian languages in mind [30]. OLIF is well suited for localization process and simple translations.

B.6.3.2.1 Data within OLIF

Briefkurs de noun gac-fi b brief-kurs cmp FISHERF ver sapterm sap brief:kurs like Tisch kurs m cnt meas HANSENPOU 1999-28-01 online online-A bank selling rate en noun gac-fi b full

ISO 24613:2005

B.6.3.2.2 Migration into LMF

OLIF entries fit into LMF structure. The key description splits into three LMF classes: Form, Lexical Entry and Sense. Concerning the subject field, a set of 37 categories (taken from Eurodicautom) is pre-defined in OLIF. In LMF, the corresponding values must be taken from the ISO-12620 DCR. So, prior to import, 37 Semantic Features must be created in order to be connected to the senses. In OLIF, it is possible to use a non defined value, and in this case, such a value much be managed on the fly.

Import of an OLIF entry implies the following operations:

within OLIF type of operation concerned LMF class canonical form creation of an instance Form language set a data category Lexical Entry part of speech set a data category Lexical Entry subject field creation of an instance and connect a Sense semantic feature semantic reading set the label Sense

Of course, this kind of insertion tends to produce small atomic instances and LMF does not impose any particular grouping. So, in order to minimize the number of instances, and in a second phase, the user can group together Lexical Entries and Forms that share the same values for the form, language and part of speech.

The OLIF transfer descriptors implies the creation of a Sense Axis within the LMF context.

For the other values, it does not seem to be possible to process automatically in a generic manner (i.e. whatever OLIF source it is). But if a coherent strategy has been respected within the OLIF data, an automatic processing can be applied.

B.6.3.3 Data within LMF

ISO 24613:2005

val=”gac-fi”/>

B.6.4 LMF and CLIPS

B.6.4.1 Presentation

“Corpora e Lessici dell’Italiano Parlato e Scritto” (CLIPS) was a three-year Italian national project headed by the “Consiglio Nazionale delle Ricerche, Istituto di Linguistica Computazionale” (CNR-ILC) that started in 2000 (www.ilc.cnr.it/clips/CLIPS_ENGLISH.htm). One of its main objectives was to create a wide-coverage and multipurpose computational semantic lexical database for Italian, by extending the PAROLE-SIMPLE lexicon which share with other eleven European lexica a common conceptual model, representation language and lexicon building methodology. The underlying theoretical model is grounded on the EAGLES project recommendations and, at semantic level, it implements and extends major aspects of Generative Lexicon (GL) theory; nevertheless, the lexicon is not strictly theory-dependent. The model enables a very fine-grained description to be performed, but allows a more shallow one too, in so far as the information provided meets the model requirements.

To date, in 2005, the lexicon is the largest Italian computational lexical resource. The lexicon consits of 53 000 lemmas encoded at morphological level and phonological level (for a total of about 390 000 word-forms), 51 000 lemmas encoded at syntactic level, and 57 000 semantically encoded word senses. Conformity of the data to the model is ensured by an XML DTD, whereas internal formal validation is performed by an XML parser.

B.6.4.2 Data within CLIPS

Like in Eagles, a word is a chain of small elements starting from morphology, crossing syntax and ending in semantics. The chain begins with a morphological unit that is simple (i.e. a MuS), a graphical morphological unit (i.e. a Gmu), a syntactic unit (i.e. a SynU), an intermediate object between syntax and semantics called a CorrespSynUSemU, a semantic unit (i.e. a SemU) and a predicate. It’s not very easy to present CLIPS entries because a lot of objects are concerned. The structure is not a tree but it is a graph of interconnected objects. In the following entries, only some specific characteristics are presented and the cut branches of the network are signaled with an XML comment.

Two features are presented here :

1. The linkage between syntax and semantic for the entry “costruire” (“to build” in English).

2. The semantic derivation around “costruire”. One entry is the verb “costruire” and the second entry is the noun “costruzione”. These two entries are bound at the semantic level by two sorts of links: a) they share the same predicate b) there is a relation that states that “costruzione” is the resulting state of “costruire”.

ISO 24613:2005

costruire < !—The verb has three meanings but only the following is presented Æ < !—Subcategorization frameÆ < !—Surface syntactic realizations (not expanded)Æ

ISO 24613:2005

example=”soggetto di pensare, fare, sapere" comment="Usually 'translate' as SUBJECTS in surface. They can occur in the following contexts: Maria fa, sa, ecc. compare with non-ProtoAgent subjects"/> costruzione

The XML data is more easily readable on the following UML diagram:

ISO 24613:2005

: MuS : SynU : CorrespSynUSemU : Correspondance gramcat = V

: SimpleCorrespArgPos : Gmu spelling = costruire : WayToPosition : Construction : SimpleCorrespArgPos : Description : WayToPosition : Self : InstantiatedPositionC : InstantiatedPositionC

: SyntagmaT : PositionC : PositionC : Argument featurel = TAUXavere

: PredicativeRepresentation : Predicate typeoflink = Master : Argument type = LEXICAL : SemU id = USemD585costruire : PredicativeRepresentation typeoflink = VerbNominalization

: RWeightVamSemU

: RSemU id = SRResultingState : SemanticRole id = RoleProtoAgent

: InformArg : SemU weightvalsemfeaturel = HUMAN id = USem4174costruzione : SemanticRole : MuS : SynU : CorrespSynUSemU id = RoleProtoPatient gramcat = N : Correspondance : InformArg : Gmu weightvalsemfeaturel = BuildingPROT : SimpleCorrespArgPos : SimpleCorrespArgPos spelling = costruzione : WayToPosition : WayToPosition

B.6.4.3 Migration into LMF

The structure of the LMF NLP extension is very similar to Eagles. There are three main differences:

• The class names are taken from “ordinary” usage instead of being coined. For instance, in LMF the name for SemU is Sense. The motivation of this choice is to ease model understanding.

• The structure is a little bit lighter because several small intermediate classes have been removed. In Eagles, these classes had little usefulness but were mandatory and as a consequence, they make the whole model very complex. CorrespSynUSemU is one of these classes.

• A little bit more flexibility in order to lighten descriptions. For instance, it is possible to describe a sense without recording a syntactic description when nothing is to be said in Syntax.

ISO 24613:2005

B.6.4.4 Data within LMF

: LexicalEntry : SyntacticBehavior : Sense partOfSpeech = verb id = USemD585costruire label

: Form lemmatisedForm = costruire

: PredicativeRepresentation type = Master : SyntacticFrame : Argument semanticRole = RoleProtoAgent : Position : Self restriction = HUMAN type : SyntacticFeature : SyntacticActant

: Relation : Predicate type = SRResultingState type = LEXICAL : Position : Argument view semanticRole = RoleProtoPatient name : SyntacticActant restriction = BuildingPROT type

: PredicativeRepresentation type = VerbNominalization : LexicalEntry partOfSpeech = noun : Sense id = USem4174costruzione : Form label lemmatisedForm = costruzione

B.6.5 LMF and LC-Star

B.6.5.1 Presentation

LC-STAR is an IST-project focusing on creating language resources for speech-to-speech translation components and thus improving human-to-human and man-machine communication in multilingual environments (www.lc-star.com). The objectives are flexible vocabulary speech recognition, high quality text-to-speech synthesis and speech centered translation into selected languages. The components are designed to be embedded in mobile appliances and network servers. Lexicons for 13 languages have been created: Catalan, Finnish, German, Greek, Hebrew, Italian, Mandarin Chinese, Russian, Slovenian, Spanish, Arabic, Turkish and US-English. The consortium contains companies like Nokia, Siemens and IBM.

B.6.5.2 Data within LC-STAR

For each entry, orthography is defined as the correct way of writing a given inflected form. When more than one spelling is acceptable, the most common spelling for an inflected form is taken. Orthography is the key to all of the phonetic, morphological and POS information for that entry. This structuring is coded as entry groups. Therefore, words that can be associated to multiple POS have a unique entry. Multi-token entries are allowed with blanks replaced with underscores (e.g. New_York). Phonetic transcription is coded for each entry with SAMPA phonetic alphabet with stress marker, syllable boundary marker and tone markers. Multiple pronunciations are coded if they are common use. For each group, the lemmatised form is

ISO 24613:2005 specified with the POS(s) it belongs to. For each POS, where applicable its attributes are described (e.g. number, person …) depending on the language. Additional rules are given for each of the 13 described languages covering orthography, syllabification, lemmatization and pairing POS with morphological features.

Two statements can be made:

• Data is organized toward recognition: each entry goes from the inflected form to the lemmatised form.

• For highly inflected languages like German or Hungarian, using such a format produces numerous duplicated information. The lexicon is huge and hardly manageable. In these languages, this format could be considered more as a delivery format than a management format. Inflectional paradigms are more suited for management.

An example in Arabic is as follows:

<"xml:lang="ar "اﻟﺤﻤﺮاء"=ENTRYGROUP orthography> ﺣﻤﺮاء a l – " X\ a m r a: ? ﺣﻤﺮاء a l – " X\ a m r a: ?

ISO 24613:2005

An example in English is as follows:

read " r i: d read " r e d read " r e d

Within LC-Star, what is called “orthography” is the inflected form and not the lemmatised form. So this entry is valid for “read” and not for “reads”. Another entry must describe the inflected form “reads” as follows.

read " r i: d z

B.6.5.3 Migration into LMF

Data must be mapped. In LC-Star, the headword is the inflected form and everything is organized around that. In NLP extension, the headword is the lemmatised form.

B.6.5.4 Data within LMF

These two LC-Star entries give the following structuring:

ISO 24613:2005

: LexicalEntry : Form language = eng : InflectedForm partOf Speech = v erb lemmatisedForm = read expansion v ariation writtenForm = read spokenForm = " r e d : InflectedForm writtenForm = read spokenForm = " r e d : InflectedForm : InflectedForm writtenForm = read writtenForm = reads spokenForm = " r i: d spokenForm = " r i: d z

: DataCategory12620 : DataCategory12620 : DataCategory12620 : DataCategory12620 att = person att = tense att = tense att = mood val = not_3 v al = present v al = past v al = participle

: DataCategory12620 : DataCategory12620 att = person val = 3 att = person val = invariant

: DataCategory12620 att = mood val = finite

Another option could be to automatically compute inflectional paradigms during import.

B.6.6 LMF and WordNet

B.6.6.1 Presentation

WordNet is an online English lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory [33]. Downloading the data being free, WordNet is one of the most popular lexical databases. And compared to WordNet-1.X, version-2.0 offers significant additions like morphology and derivation. WordNet is developed at Princeton University (http://wordnet.princeton.edu).

Information in WordNet is organized around logical grouping called synsets. English nouns, verbs, adjectives and adverbs are described. Each synset consist of a list of synonymous words or collocation (e.g. “fountain pen” or “take in”) and pointers that describe the relation between this synset and other synsets. A word or collocation may appear in more than one synset, and in more than one part of speech. The words in a synset are logically grouped such that are interchangeable in some context. Two kinds of relation are described: lexical and semantic ones. Lexical relations hold between word forms; semantic relation hold between word meanings. These semantic relations include hypernymy/hyponymy, antonymy, entailment, meronymy/holonymy. These semantic relations link the synonym sets in an ontological structure. But unlike what is often said in a too simplistic manner, WordNet is not an ontology: WordNet contains an ontology of English meanings.

WordNet has two file formats: lexicographer (specified in WNINPUT documentation) and data file formats (specified in WNDB documentation). The first one is intended to human beings editing and is ruled by some defaulting behavior in order to avoid painful manual operations [34]. These source files are then processed by a data compiler called the “grinder” that expands into data file format. Data file format is an ASCII (not XML), readable but non- editable format. And this latter format attests more accurately than the first one of the real WordNet’s structure.

B.6.6.2 Data within WordNet

Three synsets taken from WordNet-2.0 will be presented: “oak” in the sense of the tree, “oak” in the sense of the wooden material and “tree” in the sense of the plant. All forms are linked to

ISO 24613:2005 one or several synsets. The synset oak/tree holds an hyponymy relation with the synset “tree”. Said in other words: “an oak tree is a tree”.

Raw data is as follows:

11520081 20 n 02 oak 0 oak_tree 0 029 @ 12352501 n 0000 #m 11519931 …

11520753 20 n 01 oak 2 004 @ 14243413 n 0000 #s 11520081 n 0000 …

12352501 20 n 01 tree 0 189 @ 12351578 n 0000 #m 07926765 n 0000 …

In order to ease reading, the structure is illustrated as followed:

: WNForm : WNForm : WNForm lemmatisedForm = oak lemmatisedForm = oak tree lemmatisedForm = tree partOfSpeech = n partOfSpeech = n partOfSpeech = n

: WNSynSet offset = 11520081 textualDefinition = a deciduous tree of the genus Quercus; has acorn ...

: WNSemanticRelation type = hyponymy

: WNSynSet : WNSynSet offset = 11520753 offset = 12352501 textualDefinition = the hard durable wood of any oak textualDefinition = a tall perennial wood plant ...

B.6.6.3 Data within LMF

The data fits within LMF as illustrated:

ISO 24613:2005

: LexicalEntry : LexicalEntry : LexicalEntry partOfSpeech = noun partOfSpeech = noun partOfSpeech = noun

: Form : Form : Form lemmatisedForm = oak lemmatisedForm = oak tree lemmatisedForm = tree

: Sense : Sense : Sense

: SynSet : SynSet : SynSet id = 11520753 id = 11520081 id = 12352501

: SynSetRelation : Definition type = hyponymy text = the hard durable wood of any oak language = eng : Definition view text = a deciduous tree of the genus Quercus; has acorm ... language = eng view

: Definition text = a tall perennial wood plant ... language = eng view

B.6.7 LMF and FrameNet

B.6.7.1 Presentation

The Berkeley FrameNet is an on-line lexical resource for English based on frame semantics and supported by corpus evidence [35]. The aim is to document the range of semantic and syntactic combinatory valences of each word in each of its senses, through manual annotation of example sentences and automatic capture and organization of the annotation results. At the time of Fall, 2003 release (version 1.1) the database included 7,500 lexical units, for which there are about 130,000 annotated sentences.

A lexical unit is a pairing of a word with a meaning. Typically, each sense of a polysemous word belongs to a different semantic frame, a script-like structure of inferences that characterize a type of situation, object or event. In the case of predicates or governors, each annotation accepts one word in the sentence which fill in information about a given instance of the frame. These phrases are identified with frame elements (FEs). These FEs are classified in term of how central they are to a particular frame distinguishing four levels: core, peripheral, extra-thematic and core-unexpressed. A core FE is one that instantiates a conceptually necessary particular or prop of a frame, while making the frame unique and different from other frames. Peripheral FE marks notions as Time or Place. Extra-thematic FE situate an event against a backdrop of another event, as in iteration: “Lee called the office [again]”. Core-unexpressed is used to avoid blind inheritance.

A frame semantic description of a predicative word derives from such annotations, identifies the frames which underlie a given meaning and specifies the ways in which FEs are realized in structure headed by the word.

A predicate is a constellation of triples that make up the FE realization for each annotated sentence, each triple consisting of a FE (say, PATIENT), a grammatical function (say,

ISO 24613:2005

OBJECT) and a phrase type (say, NP). Valence descriptions of predicating words are generalizations over such structures.

FrameNet database consists in:

• Lexical units for individual word senses

• Descriptions of frames and FE with links from lexical units

• Annotation subcorpora.

B.6.7.2 Data within FrameNet

Morphological descriptions are very simple: they are organized in a two level structure: lexical unit level with a lemmatised form and a definition and lexeme level with component descriptions when the entry is a MWE. In this case, each component is described specifying whether the component is a head and whether the MWE permits an insertion with the “breakBefore” attribute.

For instance in: They called on the man. *They called the man on. compared with : The called up the man. They called the man up.

Thus, the “up” in “call up” is marked BreakBefore = 'Y', while the “on” in “call on” is marked BreakBefore = 'N'.

In order to ease reading: XML identifiers, inherited tags and book-keeping attributes (like creation date) have been removed.

<--The frame inherits from intentionally_act and is a subframe of Activity--> An Agent finishes an Activity, which can no longer logically continue. This FE identifies the Depictive phrase describing an actor or an undergoer of an action. This FE identifies the Result of finished Activity. This FE identifies the last Subevent of the finished Activity.Ex: The peace march ENDED ["Sub" with a prayer]. This FE identifies the Agent who has finished an Activity. This FE identifies the Activity that the Agent has finished.

ISO 24613:2005

FN: bring something to a satisfactory conclusion FN: finish FN: bringing or coming to an end

FrameNet Class model is illustrated by the following UML class diagram:

1 0..* LexicalUnit Fr ame FrameToFrameRelation 1 0..* -name 1..* 1 -name -type -pos -definition -definition 1 1

1..* 0..*

Lexeme FrameElement -name -name -pos -definition -headWord -type -breakBefore

Instances showed in the example are illustrated by the following UML object diagram:

ISO 24613:2005

: Lexeme : Lexeme name = tie name = up headWord = true headWord = false : FrameToFrameRelation pos = V pos = ADV type = inherit from breakBefore = false breakBefore = true : Frame : LexicalUnit name = Intentionally_act definition : LexicalUnit

: FrameToFrameRelation : Lexeme : Lexeme type = subframe of name = wrap name = up headWord = true headWord = false : Frame : Frame pos = V pos = PREP breakBefore = false breakBefore = true name = Activity_finish name = Activity definition definition : LexicalUnit

: Lexeme name = finish : FrameElement headWord = false : FrameElement pos = V name = Agent breakBefore = false name = Depictive definition definition type = Core type = Extra-Thematic

: FrameElement : FrameElement name = Result name = Activity definition definition type = Extra-Thematic type = Core

: FrameElement name = Subevent definition type = Extra-Thematic

B.6.7.3 Migration into LMF

Mapping of FrameNet against other models has already been studied like in [36] for MILE.

In FrameNet, the lexical unit is a complex notion that comprises the lemmatised form, the part of speech, components for MWE and semantic decomposition by the mean of a textual definition. Thus, within the LMF NLP extension, the FrameNet lexical unit is a sense and this sense is connected to a lexical entry.

The notion of Frame within FrameNet corresponds to a Predicate within LMF. The FrameToFrameRelation within FrameNet is a PredicateRelation within LMF.

The notion of FrameElement within FrameNet is an Argument within LMF. The attribute “name” corresponds to the attribute “semantic role”.

B.6.7.4 Data within LMF

The example is illustrated by the following diagram in which only one of the three lexical unit is drawn due to a lack of space.

ISO 24613:2005

: InflectionalParadigm : Form : LexicalEntry lemmatisedForm = tie up partOfSpeech = verb

: MorphologicalFeatureCombiner : Sense

: Composer : PredicativeRepresentation head = yes type = Master rank = 0 : ListOfComponents : PredicateRelation type = inherit from : Form : Composer lemmatisedForm = tie head = no : Predicate rank = 1 name = Intentionally_act : Form type lemmatisedForm = up vi ew

: Predicate : PredicateRelation : Predicate name = Activity_finish type = subframe of name = Activity type type vi ew vi ew : Argument restriction semanticRole = Depictive : Argument type = Extra-Thematic restriction semanticRole = Agent : Argument type = Core restriction semanticRole = Result : Argument type = Extra-Thematic restriction semanticRole = Activity : Argument type = Core restriction semanticRole = Subevent type = Extra-Thematic

B.6.8 LMF and BDéf

B.6.8.1 Presentation

BDéf is a formal database of lexicographic definitions fully derived from the four volumes of the Explanatory Dictionary of Contemporary French (ECD) [22]. The BDéf has two aims: firstly, it will make a representative subset of formal lexicographic definitions available for research in computational semantic. Secondly, building the BDéf permits to conduct a much needed research on the internal structuring of lexical meanings and how it should be modelized [18].

B.6.8.2 Data within BDéf

The following French example presents the sense of “défierI.1” as described in [18]. This sense is taken from DEC-4 and corresponds to “Marcel a défié ce pédant en duel à l’épée”. Each definition is a structured set of elementary propositions. Elementary proposition is the smallest unit to manage. These propositions are not based on smaller semantic primitives, but instead contain a formalized text. BDéf is chosen in order to exhibit how such textual definitions could fit inside the LMF classes Sense, Definition and Proposition.

Propositional form X défier Y en W à Z Central component 1 : X communiquer à Y que *2 2 : X vouloir *3 3 : Y prendre_part à Y avec Z

ISO 24613:2005

Specific differences /*objective*/ 4 : *1 dans_le_but *5 5 : X punir Y pour *9 /*means*/ 6 : façon de X de *5 être *7 7.1 : X blesser#il Y 7.2 : X tuer Y /*situation*/ 8 : X croire *9 9.1 : Y insulter X 9.2 : Y porter_atteinte à honneur de X 10 : *9 passé Variable typing X : individu Y : individu Z : arme W : combat Relation between actants W[X,Y]

ISO 24613:2005

B.6.8.3 Data within LMF

: Sense : Definition : Proposition label = défier.1 name = Forme propositionnelle text = X défier Y en W à Z name type : Proposition : Definition text = X communiquer à Y que *2 name = 1 name = Composante centrale type

: Proposition name = 2 text = X vouloir *3 type

: Proposition name = 3 text = X prendre_part à W avec Z type : Definition : Proposition name = Différences spécifiques name = 4 text = *1 dans_le_but *5 type = but

: Proposition name = 5 text = X punir Y pour *9 type = but

: Proposition name = 6 text = façon de X de *5 être *7 type = moyen

: Proposition name = 7 text = X blesser#il Y type = moyen

: Proposition name = 7.2 text = X tuer Y type = moyen

: Proposition name = 8 text = X croire *9 type = situation

: Proposition name = 9.1 text = Y insulter X type = situation

: Proposition name = 9.2 text = Y porter_atteinte à l'honneur de X type = situation

: Proposition name = 10 text = 9 passé type = situation : Proposition name : Definition text = W[X,Y] type name = Relations sémantiques entre variables actancielles

: Definition name = Relations entre actants : Proposition text = combat : Proposition : Proposition : Proposition name = W text = individu text = individu text = arme type = restriction name = X name = Y name = Z type = restriction type = restriction type = restriction

Figure B-10: Défier

ISO 24613:2005

B.7 Open questions

These problems are not resolved in the current document.

B.7.1 Have a core model able to represent multilingual databases

The monolingual vision presented in chap-5 is the way we managed monolingual lexicons and does not match to the current situation of multilingual lexicons. LexicalDB is presented as a singleton (according to pattern design, Gamma & al). It is not possible for instance to attach meta-data to an instance that is specific to a monolingual lexicon (like copyright mention) and meta-data valid for the whole database. We must propose a class for the whole database (as Global Information does right now) and a different class for the monolingual lexicon. And we must propose something directly understandable and usable, not a mapping.

B.7.2 Improve attribute specification in NLP extension

The classes in the NLP extension may have attributes with respect to the two following criteria: mandatory/optional and localized/non-localized.

This gives the three sorts of attributes: a) a mandatory attribute. The number of such attributes is very small. The /part of speech/ for a lexical entry is one of them. b) a localized attribute. This kind of attribute is to be used in a specific class but is optional. For instance, /syllabification/ is to be used on a Form not on a Sense or a Lexical database. c) a non-localized attribute. For instance, /modification date/ is an attribute that could be used everywhere.

We must improve the specification regarding to these criteria. For each class, we must specify the full set of type-a and type-b attributes. But how to ?

May be, by the means of generalization, if we state that all classes inherits from a common class that can be decorated by non-localized attributes.

B.7.3 ListOfComponents and ListOfStems

Is it interesting to keep these two classes ? They do not hold any attribute. We can have de direct association of cardinality 1 to many to the components (respec. to the stem). Pro: it is easier to explain. Con: It is more heavy.

B.7.4 SynSet and Senses

Like in MILE, it’s weird to have both synsets and senses in the same semantic layer. Each of these classes are accompanied with specific relations SynsetRelation for Synset and Relation for Senses.

May be, if we state that a sense can have multiple lexical entries, we can merge the notions of sense and synset ?

B.7.5 Collocation (comments from LexiquePourLeTAL network)

Show how to represent preference in collocation. For instance, in French, the adjective “aîné” is to be used with “frère”, “soeur”.

ISO 24613:2005

B.7.6 Surface/deep morphology (comments from LexiquePourLeTAL network)

For some words in French and some in Spanish and German, there is a problem concerning agglutination of two POS when the result is not really a POS.

“au” is the contraction of Preposition and Determiner.

In surface morphology, it is a single word.

In deep morphology, it is a two words expression: “à” + “le”

We must state as a general rule, or at least as a recommendation, whether :

Strategy-1: we allow for a multiple value on Lexical Entry

Strategy-2: we define combined data categories in the DCR

Let’s note that this kind of word is particularly tricky because very often: for one given lemmatised form, for a certain number of inflected forms, the word is a MWE (“à” + “la”, “à” + “les”) and for certain other forms, it is a single word (“au”).

B.7.7 Syntax (comments from LexiquePourLeTAL network)

The syntactic section needs explanation and example of how to represent MWE in syntax, with an example like “la moutarde me monte au nez” that varies regularly in syntax.

Add “subcategorization frame” (common in US) as synonym of “valency” (common in Europe) in the chapter dedicated to definitions. It is strange to define syntactic terms without using this so commonly used term.

B.7.8 Multilingual notations

What is missing is the possibility to represent something like: “transfer only valid for medicine”. In other terms, we need the possibility to express constraints.

The transfer axis should have an attribute that permits to specify that this transfer is bidirectional or unidirectional.

B.7.9 Derivation in morphology

Derivation in morphology for lexicons without semantics and possibly to record rules specific to a given language, specially important for languages where derivation is highly productive like Arabic.

ISO 24613:2005

Bibliography

[1] Isle references.

[2] ISO 9:1995, Information and documentation – Transliteration of Cyrillic characters into Latin characters – Slavic and non-Slavic languages

[3] ISO 233:1984, Documentation – Transliteration of Arabic characters into Latin characters

[4] ISO 233-2:1993, Information and documentation – Transliteration of Arabic characters into Latin characters –- Part 2: Arabic language – Simplified transliteration

[5] ISO 233-3:1999, Information and documentation – Transliteration of Arabic characters into Latin characters – Part 3: Persian language – Simplified transliteration (available in English only)

[6] ISO 259:1984, Documentation – Transliteration of Hebrew characters into Latin characters

[7] ISO 259-2:1994, Information and documentation – Transliteration of Hebrew characters into Latin characters – Part 2: Simplified transliteration

[8] ISO 3166-1:1997, Codes for the representation of names of countries and their subdivisions – Part 1: Country codes

[9] ISO 3166-2:1998, Codes for the representation of names of countries and their subdivisions – Part 2: Country subdivision code

[10] ISO 3166-3:1999, Codes for the representation of names of countries and their subdivisions – Part 3: Code for formerly used names of countries

[11] ISO 8879:1986, (SGML) as extended by TC2 (ISO/IEC JTC 1/SC 34 N 029:1998-12- 06) to allow for XML

[12] ISO 9984:1996, Information and documentation – Transliteration of Georgian characters into Latin characters

[13] ISO 9985:1996, Information and documentation – Transliteration of Armenian characters into Latin characters

[14] ISO 11940:1998, Information and documentation – Transliteration of Thai

[15] ISO/DIS 11940-2, Information and documentation – Transliteration of Thai characters into Latin characters – Part 2: Simplified transcription of Thai language

[16] TEI references

[17] Mel’cuk I., Clas A., Polguère A. 1995 Introduction à la lexicologie explicative et combinatoire. Duculot. Bruxelles.

[18] Altman J., Polguère A. 2003 La Bdéf : base de définitions dérivée du dictionnaire explicatif et combinatoire. MTT Paris.

ISO 24613:2005

[19] Jacobson I., Booch G., Rumbaugh J. 1999 The Unified Software Develoment Process: UML. Addison-Wesley. Reading.

[20] Calzolari N., Fillmore C., Grishman R., Ide N., Lenci A., MacLeod C., Zampolli A. 2002 Towards best practices for Multiword expressions in Computational lexicons. LREC.

[21] Mangeot M., Sérasset G. 2001 Papillon lexical databases project: monolingual dictionaries and interlingual links. NLPRS Tokyo.

[22] Mel'cuk I. et al., Dictionnaire explicatif et combinatoire du français contemporain: recherches lexico- sémantiques 1, [DEC] Presses de l'Université de Montréal, 1984,.

[23] GENELEX consortium : Report on the morphological layer.

[24] Antony-Lay M.H., Francopoulo G., Zaysser L. A generic model for reusable lexicons: The GENELEX project, Literary and Linguistic Computing 9(1): 47-54.

[25] Lombard Denys Introduction à l’indonésien Cahier d’Archipel, Paris, 1991

[26] Longman Dictionary of American English, Pearson Education 2002

[27] Rey A. Dictionnaire historique de la langue française, Le Robert 1998.

[28] Lieske C., McCormick S., Thurmair G. The Open Lexicon Interchange Format (OLIF) comes of ages. OLIF consortium 2002 www.olif.net

[29] McCormick OLIF guidelines for formulation canonical forms. OLIF consortium 2001 www.olif.net

[30] Emerson T. East Asian language support in OLIF-2. OLIF consortium 2001 www.olif.net

[31] Imbs P., Gorcy G. Trésor de la langue française, dictionnaire de la langue du 19iéme et du 20ième siècle. CNRS Gallimard 1971-1994

[32] Rumbaugh J., Jacobson I., Booch G. The unified modeling language reference manual, second edition Addison Wesley 2004

[33] Miller G., Beckwith R., Fellbaum G. Introduction to WordNet: an on-line lexical database 1993 http://wordnet.princeton.edu

[34] Beckwith R., Miller A., Tengi R. Design and implementation of the WordNet lexical database and searching software http://wordnet.princeton.edu

ISO 24613:2005

[35] Johnson C., Petruck M., Baker C., Ellsworth M., Ruppenhofer J., Fillmore C. FrameNet: Theory and Practice, version 1.1 2003 www.icsi.berkeley.edu/~framenet

[36] Bertagna F., Lenci A., Monachini M., Calzolari N. Content interoperability of lexical resources : open issues and MILE perspectives. LREC-2004 Lisboa.