HG8003 Technologically Speaking: The intersection of language and technology.

Words, and Ontologies

Francis Bond Division of Linguistics and Multilingual Studies http://www3.ntu.edu.sg/home/fcbond/ [email protected]

Lecture 4 Location: LT8

HG8003 (2014) Schedule

Lec. Date Topic 1 01-16 Introduction, Organization: Overview of NLP; Main Issues 2 01-23 Representing Language 3 02-06 Representing Meaning 4 02-13 , Lexicons and Ontologies 5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the Semantic Web Recess 7 03-13 Citation, Reputation and PageRank 8 03-20 Introduction to MT, Empirical NLP 9 03-27 Analysis, Tagging, Parsing and Generation Quiz 10 Video Statistical and Example-based MT 11 04-03 Transfer and Sense Disambiguation 12 04-10 Review and Conclusions Exam 05-06 17:00

➣ Video week 10

Words, Lexicons and Ontologies 1 Review of Meaning

Words, Lexicons and Ontologies 2 Review of Representing Meaning

➣ Three ways of defining meaning ➢ Attributional (Compositional) ➢ Relational ➢ Distributional

➣ the Syntax-Semantic Interface ➢ Usage ↽⇀ Meaning

Words, Lexicons and Ontologies 3 Attributional Meaning

➣ Give a semantic description of word use in isolation of the categorisation of other lexical items ➢ definitions ➢ decompositional (break down into primitives)

➣ Easy for humans to understand

➣ Hard to decide on sense boundaries (granularity: splitters vs. lumpers)

➣ Definitions are circular (the grounding problem)

➣ Hard to be consistent

Words, Lexicons and Ontologies 4 Relational Meaning

➣ Capture correspondences between lexical items by way of a finite set of pre-defined semantic relations

➣ Methodologies: ➢ lexical relations ➢ constructional relations

➣ Captures many generalizations usefully

➣ Hard to make complete

➣ Leads to large, complex graphs

Words, Lexicons and Ontologies 5 Distributional Meaning

➣ Capture word meanings as collections of contexts in which words appear ➢ n-grams ➢ syntactic relations ➢ sentences ➢ documents

➣ Good for synonymy, not so good for antonymy

➣ Computationally tractable

Words, Lexicons and Ontologies 6 Why are important?

➣ For humans ➢ find meaning of unknown words ➢ find more information about known words ➢ codify knowledge about word usage ()

➣ For machines ➢ store information about words ➢ link between text and knowledge

Words, Lexicons and Ontologies 7 Words

Words, Lexicons and Ontologies 8 Introduction to Words, Lexicons and Ontologies

➣ Design and implementation ➢ Machine Readable Dictionaries ➢ Morphological lexicons ➢ Syntactic lexicons ➢ Semantic lexicons ➢ Ontologies

➣ Construction and Maintenance ➢ Construction from scratch ➢ Boot-strapping from existing resources ➢ Ensuring consistency

Words, Lexicons and Ontologies 9 Machine Readable Dictionaries (MRDs)

➣ Human dictionaries made available on machine ➢ Electronic Dictionaries ➢ Applications ∗ often with automatic word lookup ➢ On-line dictionaries ∗ Sometimes with glosses

Words, Lexicons and Ontologies 10 A typical entry

definition (n) a concise explanation of the meaning of a word or phrase or symbol

: definition

➣ Part of Speech: n (noun)

➣ Definition: ➢ genus: explanation ➢ differentia: concise; of the meaning of a word or phrase or symbol

? Implied: countable (a), regular plural

Words, Lexicons and Ontologies 11 Parts-of-Speech (POS)

➣ Traditional Grammar has eight: Noun, Verb, Adjective, Adverb (open class) Conjunction, Preposition, Pronoun, Interjection (closed class)

➣ In the US, the Penn Treebank POS set is de-facto standard: ➢ http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html ➢ 45 tags (including punctuation)

➣ In Europe, CLAWS tagset is popular ➢ http://ucrel.lancs.ac.uk/claws7tags.html ➢ 137 tags (without punctuation)

Words, Lexicons and Ontologies 12 Penn Treebank Examples (14/45)

Tag Description Tag Description NN Noun,singularormass VB Verb,baseform NNS Noun,plural VBD Verb,pasttense NNP Propernoun,singular VBG Verb,gerundorpresentparticiple NNPS Propernoun,plural VBN Verb,pastparticiple PRP Personalpronoun VBP Verb,non-3rdpersonsingularpresent IN Preposition VBZ Verb,3rdpersonsingularpresent TO to . SentenceFinalpunct(.,?,!)

➣ The tags include inflectional information ➢ If you know the tag, you can generally find the

➣ Some tags are very specialized: I/PRP wanted/VBD to/TO go/VB ./.

Words, Lexicons and Ontologies 13 Good Definitions

➣ a definition should be simpler than the word being explained

➣ the definition should match the part of speech definition (n) a concise explanation of the meaning of a word or phrasel define (v) – give a definition for the meaning of a word; “Define ‘sadness”’

➣ the definition should not be circular

➣ all words in the definition should be defined (somewhere) ➢ prefer small defining ➢ only use metalanguage (NSM: Natural Semantic Metalanguage)

Words, Lexicons and Ontologies 14 Circular definitions beauty the state of being beautiful beautiful full of beauty bobcat a lynx lynx a bobcat

Words, Lexicons and Ontologies 15 Other useful information

http://en.wiktionary.org/wiki/lynx

➣ Pronunciation

➣ Usage Examples

➣ Illustrations

➣ Etymology (history of the word)

➣ Links to other resources

Easier to do without the space restrictions of a paper dictionary.

Words, Lexicons and Ontologies 16 Dictionaries for NLP

Minimize content in order to minimize acquisition problem.

Declarativity and human readability with compilation into a machine- friendly representation.

Modularity so components are reusable: e.g. distinct monolingual and transfer lexicons in an MT system.

Capture generalizations with inheritance (and lexical rules etc). Avoids errors, easier to maintain and expand.

Underspecification to reduce disambiguation for a particular application.

(Copestake, 1992) 17 Morphological Analysis

森 永 前 日 銀 総 裁 rin ei zen hi gin sou sai mori mae nichi morinaga zennichi gin sousai

morinaga zen nichigin sousai

➣ 森永 前 日銀 総裁 Morinaga former Bank of Japan President

Words, Lexicons and Ontologies 18 Morphological Lexicons

➣ Stem

➣ Inflectional Class

➣ Part of Speech (often 1-200)

➣ Arguments (?)

➣ For example ➢ Relations: 前 - 総裁 ➢ Arguments: 前(総裁) ➢ Abstraction: 前(title); 総裁 ⊂ title

Words, Lexicons and Ontologies 19 Morphological

➣ I fabricate for a living

➣ I make things for a living

➣ I fabricated yesterday

➣ I made things for a living

➣ These are differences in the inflectional class

Words, Lexicons and Ontologies 20 Inflection

➣ Inflection: In many languages, words appear in different forms to show small differences in meaning: for example number (dog/dogs; child/children) or tense/aspect (make/made/making/made; take/took/taking/taken)

➣ Many words pattern the same way, this is called an inflectional class (or paradigm). For example, one class of plurals in English is words that end in y: fly/flies; sky/skies.

➣ The inflectional class is normally not predictable from the meaning or syntax; and so must be stored for each word

➣ The root form (lemma) and the inflected form have the same meaning modulo the number/tense/...and the same basic part-of-speech

➣ Normally a word only undergoes one inflection

Words, Lexicons and Ontologies 21 Derivation

➣ New words can also be created by changing the form. If the part of speech or meaning changes, we call it derivation: (happy/happiness; happy/unhappy; happy/happily)

➣ You can also get zero derivation, where the meaning changes without a change in form (I butter the bread/I like butter.

➣ The root form in derivation is called the stem and the process of stripping off derivational affixes is stemming

➣ You can have multiple derivations: anti-dis-establish-ment-arian-ism: the stem is establish

➣ Derivation is largely but not entirely productive: employer, teacher, *studier, actor, contractor

Words, Lexicons and Ontologies 22 Syntactic Lexicon

➣ I fabricated the results

➣ I made up the results

➣ = I made the results up

➣ I walked down the road

➣ 6= I walked the road down

➣ These are differences in the syntactic lexical type

Words, Lexicons and Ontologies 23 Differences in Argument Structure

➣ These are also differences in the syntactic lexical type ➢ I gave the book to him ➢ I gave him the book ➢ Cats eat mice ➢ Cats eat ➢ Cats devour mice ➢ *Cats devour mice

➣ The information about what arguments a verb can take is also called subcategorization, valence or argument frame

Words, Lexicons and Ontologies 24 Semantic Lexicon

➣ I deposited the money in the bank (financial)

➣ The river overflowed its bank (riverside)

➣ I had lunch by the bank (???)

➣ These are differences in the semantic class

Words, Lexicons and Ontologies 25 All the possibilities combine

➣ I saw her duck ➢ see/saw, saw/sawed ➢ duckN , duckV ➢ duckN:cloth, duckN:bird

➣ Still useful to keep separate ➢ inflectional paradigm ➢ arguments (subcategorization) ➢ semantics (selectional preferences)

Words, Lexicons and Ontologies 26 Transfer Lexicons

➣ bank ↔ 銀行 ginkou

➣ bank ↔ 土手 dote

➣ 鼻 hana ↔ nose ➢ trunk [of elephant] ➢ muzzle [of horse] ➢ snout [of boar]

Words, Lexicons and Ontologies 27 Dictionaries in Processing

➣ Lexical lookup is slow (disk-based) ➢ Compile dictionaries into compressed format ➢ Index ➢ Cache the index cache = load it into memory ➢ Cache already accessed entries ➢ Keep a list of frequent entries and cache them the most frequent words are very frequent

➣ Batch check for consistency off-line

Words, Lexicons and Ontologies 28 Dictionaries and Intellectual Property Rights (IPR)

has along tradition of extending other’s work ➢ Johnson, Murray, . . . ➢ Language itself should not be restricted ➢ Dictionaries describe language

➣ Restricted resources lose to open resources ➢ Work on restricted resources is wasted

➣ It is hard to fund maintenance ➢ Users of the lexicons are the best developers

Words, Lexicons and Ontologies 29 Online Dictionaries

➣ Mouse-over lookup (http://www.polarcloud.com/rikaichan/)

➣ No space restrictions (JMDIct)

➣ Collaborative Construction (Wiktionary)

➣ Easy cross-referencing (WordNet)

➣ Easy to link dictionaries (Open Multilingual WordNet) http://compling.hss.ntu.edu.sg/omw/

Words, Lexicons and Ontologies 30 Ontologies

Words, Lexicons and Ontologies 31 What is an Ontology

➣ A set of statements in a formal language that describes/conceptualizes knowledge in a given domain ➢ What kinds of entities exist (in that domain) ➢ What kinds of relationships hold among them

➣ Ontologies usually assume a particular level of granularity ➢ doesn’t capture all details

32 In Other Words

➣ In theory “An ontology is a formal, explicit specification of a shared conceptualisation.” (Gruber, 1993)

➣ More generally An ontology provides a shared vocabulary, which can be used to model a domain, that is, the type of objects and/or concepts that exist, and their properties and relations

Words, Lexicons and Ontologies 33 Why use Ontologies?

➣ To make different markup terminologies transparent

➣ Placing the focus on the meaning of the markup (not its form)

➣ To make implicit knowledge explicit

➣ To ensure that the knowledge is consistent or at least consistently formatted;

Words, Lexicons and Ontologies 34 What does an Ontology consist of?

Classes abstract groups, sets, or collections of objects country, language

Individuals Actual things or concepts Japan, Japanese

Attributes Properties of classes or individuals∗ Japan AREA 377,873 km2

Relations Relations between classes or individuals Japanese SPOKEN-IN Japan

Words, Lexicons and Ontologies 35 Objects can be complex

➣ relation LOCATED-IN has the attribute TRANSITIVE ADD3 LOCATED-IN Bangkok Bangkok LOCATED-IN Thailand ⇒ ADD3 LOCATED-IN Thailand

Words, Lexicons and Ontologies 36 Some Examples

➣ Disease Ontology http://diseaseontology.sourceforge.net

➣ Dublin Core http://dublincore.org/

➣ GOLD (General Ontology for Linguistic Description) http://www.linguistics-ontology.org/gold.html

➣ WordNet (Fellbaum, 1998) http://wordnet.princeton.edu/

Words, Lexicons and Ontologies 37 Disease Ontology

➣ A controlled medical vocabulary

➣ developed at the Bioinformatics Core Facility

➣ designed to map diseases to medical codes such as ICD9CM, SNOMED and others

➣ an early version of the Disease Ontology ➢ doubled concept coverage ➢ reduced the overall misclassification error percentage

Words, Lexicons and Ontologies 38 Ontology linked to Terminology

Concept Term Strings C316301 T657210 bovine spongiform encephalopathy, BSE T657211 mad cow disease, Mad Cow Disease, MCD T734567 encephalopathy spongiforme bovine, ESB T734566 maladie de la vache folle, MVF T700345 encefalopatia espongiforma bovina, EEB T700346 enfermedad de la vaca loca, EVL

➣ Disease Ontology V2.1 2005 ➢ diseases and injuries ∗ Unspecified infectious and parasitic diseases · poliomyelitis and other non-arthropod-borne viral diseases + Unspecified slow virus infection of central nervous system

Words, Lexicons and Ontologies 39 Disease Ontology

➣ A lightweight ontology with a direct application

➣ Maintained by cooperative editing

➣ One of several linked medical ontologies

➣ Successful Application

Words, Lexicons and Ontologies 40 Dublin Core

➣ Goals ➢ Provides a semantic vocabulary for describing the “core” information properties of resources (electronic and “real” physical objects) ➢ Provide enough information to enable intelligent resource discovery systems

➣ History ➢ A collaborative effort started in 1995 ➢ Initiated by people from computer science, librarianship, on-line information services, abstracting and indexing, imaging and geospatial data, museum and archive control.

http://www.tutorialsonline.info/Common/DublinCore.html 41 Dublin Core - 15 Elements

➣ Content (7) ➢ Title, Subject, Description, Type, Source, Relation and Coverage

➣ Intellectual property (4) ➢ Creator, Publisher, Contributor, Rights

➣ Instantiation (4) ➢ Date, Language, Format, Identifier

Words, Lexicons and Ontologies 42 Dublin Core – discussion

➣ Widely used to catalog web data ➢ OLAC: Open Language Archives Community ➢ The MusicBrainz Project: http://www.musicbrainz.org The musicbrainz project is run by volunteers that are defining a metadata standard for music recordings. This metadata standard is an extension of the Dublin Core. The goal of the project is to define the metadata standard for music and to create a metadata catalog of all music recordings around the world. ➢ Australian Government Locator Service (ALGS) http://www.agls. gov.au/ AGLS was developed in late 1997 as the resource discovery metadata standard for Australian governments and was endorsed for use by all levels of government in Australia in November 1998. ➢ ...

Words, Lexicons and Ontologies 43 GOLD

➣ an upper ontology for descriptive linguistics

➣ providing a set of linguistic notions

➣ designed to link different grammatical descriptions

➣ PART-OF E-MELD ‘Electronic Metastructure for Endangered Languages Data’

➣ USED-IN ODIN http://www.csufresno.edu/odin/

Words, Lexicons and Ontologies 44 Singular Number (concept)

Definition:

A value of numberFeature. Singular quantifies the denotation of the nominal element so that:

1. it specifies that there is exactly one. In this English example below, singularNumber is shown by both the noun and the verb in (1): See example (Corbett2000 : 5 )

2. additionally, but not necessarily, this value may be assigned on the basis of formal properties (e.g. singularia tantum), or (health / *healths).

3. if singularNumber functions as generalNumber, it may specify a lack of commitment with regard to quantification. In this Japanese (jpn) example below, ’inu’ (dog) is not specified for number: See example

Words, Lexicons and Ontologies 45 Sg: Usage note

➣ On terminology: the term ’singulative’ is sometimes used for the concept singularNumber, especially when singularNumber is overtly expressed. ’Singulative’ has been used sometimes for singularNumber in systems where singularNumber is distinct from generalNumber.

➣ It is worth bearing in mind that the expression of number can differ cross-linguistically according to the animacy hierarchy. See numberAssignmentSystem.

➣ A note on minimal/augmented systems (and also minimal/unit-augmented/augmented). In some languages which have an inclusive/exclusive distinction in the first person, the firstPersonInclusive may use the morphology which otherwise expresses singularNumber, even though the semantics of firstPersonInclusive entail that it cannot be singular. There is an analysis of this in which the morphology is seen as representing the minimal number associated with the particular person value. Under such a system, the label ’minimal’ can be mapped onto the concept singularNumber, except if one is dealing with firstPersonInclusive minimal, which would be mapped onto the concept dualNumber. (Corbett2000 : 166-169 ) (Conklin1962 )

Words, Lexicons and Ontologies 46 ➣ There is an important theoretical question about whether minimal/unit-augmented/augmented should be considered separate concepts in the GOLD ontology. The main argument for this is that under such systems, the number values dual and trial are expressed only on the firstPersonInclusive by using the morphology otherwise associated with singular and dual respectively. However, as it is possible to specify a mapping from one system onto the other, we allow for a COPE to deal with this substantive issue while ensuring interoperability.

Words, Lexicons and Ontologies 47 Sg: Cross references

Parent Number Feature

Siblings Dual Number, General Number, Greater Paucal Number, Greater Plural Number, Paucal Number, Plural Number, Trial Number,

Words, Lexicons and Ontologies 48 GOLD — Discussion

➣ Main emphasis on definitions and references

➣ Not yet widely used

➣ Linked to by some grammars (e.g. JACY, ERG)

➣ Ontology framework makes it easy to ➢ validate ➢ refer-to

Words, Lexicons and Ontologies 49 WordNet

➣ Princeton WordNet R : is a large lexical database of English.

➣ Nouns, verbs, adjectives and adverbs grouped into sets of cognitive (synsets), each expressing a distinct concept.

➣ Synsets interlinked ➢ hypernym/hyponym/instance (is-a) ➢ meronym/part (has-a) ➢ domain

➣ Free license

Words, Lexicons and Ontologies 50 Multilingual

➣ Over twenty languages (Polish, Serbian, Croatian, Hindi, Telugu, Tail, Malayalam to come)

➣ Linking many different projects with a common core

➣ Created at NTU

http://casta-net.jp/ kuribayashi/multi/ 51 Noun Relations (WordNet) hypernyms: Y is a hypernym of X if every X is a (kind of) Y hyponyms: Y is a hyponym of X if every Y is a (kind of) X coordinate terms: Y is a coordinate term of X if X and Y share a hypernym holonym: Y is a holonym of X if X is a part of Y meronym: Y is a meronym of X if Y is a part of X

Words, Lexicons and Ontologies 52 Example: driver

1. (17) driver – (the operator of a motor vehicle)

2. driver – (someone who drives animals that pull a vehicle)

3. driver – (a golfer who hits the golf ball with a driver)

4. driver, device driver – ((computer science) a program that determines how a computer will communicate with a peripheral device)

5. driver, number one wood – (a golf club (a wood) with a near vertical face that is used for hitting long shots from the tee)

Words, Lexicons and Ontologies 53 Verb Relations (WordNet) hypernym: the verb Y is a hypernym of the verb X if the activity X is a (kind of) Y (travel to movement) troponym: the verb Y is a troponym of the verb X if the activity Y is doing X in some manner (lisp to talk) entailment: the verb Y is entailed by X if by doing X you must be doing Y (sleeping by snoring) coordinate terms: those verbs sharing a common hypernym

Words, Lexicons and Ontologies 54 Adjective and Adverb Relations (WN)

➣ Adjectives antonymy related nouns similar to participle of verb

➣ Adverbs root adjectives

Words, Lexicons and Ontologies 55 Other Relations domain: driver#n#3 “device driver” DOMAIN computing#n#1 derivationally related form: driver#n#3 “device driver” RELATED TO drive#n#20 “cause to function by supplying the force or power for or by controlling”

Words, Lexicons and Ontologies 56 Usability and Accessibility

Usability : ➣ Originally designed for psycholinguistic experiments ➣ Widely used in NLP ➢ PP attachment ➢ WSD - senseval

Accessibility : ➣ downloadable ➣ redistributable ➣ actively maintained

Words, Lexicons and Ontologies 57 There are many wordnets

➣ Because WordNet is both usable and accessible it has inspired the creation of wordnets in many languages. ➢ There are over 60 projects http://globalwordnet.org/wordnets-in-the-world/ ➢ Many have released open data (22 languages in 2013) http://compling.hss.ntu.edu.sg/omw/

➣ Here at NTU we are building several ➢ Japanese Wordnet (Isahara et al., 2008) http://nlpwww.nict.go.jp/wn-ja/index.en.html ➢ Wordnet Bahasa (Nurril Hirfana at al, 2011) http://wn-msa.sourceforge.net/ ➢ Chinese Open Wordnet (Wang & Bond, 2013) http://compling.hss.ntu.edu.sg/cow/

Words, Lexicons and Ontologies 58 The Japanese WordNet

http://nlpwww.nict.go.jp/wn-ja/index.en.html

(Isahara et al., 2008) 59 SUMO

➣ Suggested Upper Merged Ontology (Niles & Pease, 2001; Pease, 2006) http://www.ontologyportal.org/

➣ IEEE sponsored free ontology (with Domain Ontologies) ➢ 20,000 terms ➢ 70,000 axioms

➣ All WordNet synsets are mapped to the ontology

➣ Used as an upper ontology for many projects

Words, Lexicons and Ontologies 60 Curating Knowledge

Words, Lexicons and Ontologies 61 Lexicon Construction and Maintenance

➣ Hand construction

➣ Reuse of existing resources

➣ Machine readable dictionaries

➣ Corpus-based approaches

Words, Lexicons and Ontologies 62 Hand-coding

➣ Some one adds all the information about a word by hand

➣ Expensive (estimates from my time at NTT) ➢ 2,000 yen/verb ➢ 200 yen/noun

➣ The traditional technique for full-blown, linguistically-motivated systems.

➣ For applications, hand-coding should be corpus-driven.

➣ Techniques exist to help the grammar engineer construct the lexicon, or to (partially) allow non-expert user to build lexicons.

Words, Lexicons and Ontologies 63 Reusing resources

The primary problem is building usable lexicons: if it’s usable then it’s reusable! But there can be many problems:

➣ Lack of documentation, especially for semantics

➣ Cost/benefit - not worth reusing a 100 word lexicon

➣ Domain specificity

➣ Legal issues

There are many existing resources: wordlists (e.g. software companies), COMLEX, MRDs (machine readable dictionaries), Wordnet, EDR, ...

Words, Lexicons and Ontologies 64 Machine-readable dictionaries

➣ Long history of research

➣ Limited explicit syntactic information, except in English learners’ dictionaries (OALD (Oxford), LDOCE (Longman), Cobuild, CIDE (Collaborative International Dictionary of English))

➣ Noun definitions can be analysed to derive taxonomies: inheritance hierarchies for semantic information.

➣ Less published work on verbs

➣ Aquilex, MindNet, Lexeed

Words, Lexicons and Ontologies 65 Deriving semantic information

Categorise lexical entries using predefined linguistically motivated classes. E.g. automatically derived taxonomies.

Sauternes a type of sweet gold-coloured French wine

1. find genus term (wine)

2. disambiguate genus term wrt MRD senses: wine1

3. (possibly) refine classification using differentia

Words, Lexicons and Ontologies 66 MRDs: pros and cons

➣ Advantages: ➢ dictionaries are used for manual lexicon construction ➢ broad coverage ➢ a labour-saving resource to augment a manually constructed core lexicon

➣ Disadvantages: ➢ time-consuming to process initially ➢ highly variable quality ➢ different dictionaries require different strategies ➢ combining senses is non-trivial ➢ dictionaries inadequate even for human learners (ambiguities, little frequency information, obsolete or offensive terms, poor coverage of )

Words, Lexicons and Ontologies 67 ➢ publishers’ IPR (intellectual property rights) ➢ restricted by printed medium

➣ Many problems partially solved by wiktionary ➢ Collaborative Lexical Construction

Words, Lexicons and Ontologies 68 Corpus-based acquisition techniques

➣ Test corpus essential for hand-coded lexicons

➣ Semi-automatic acquisition techniques: ➢ show examples in context (concordances) ➢ filter automatically built entries

➣ You can learn monolingual information (typically not 100 ➢ part of speech of unknown words ➢ syntactic class (subcategorisation frames)

➣ Bilexicon acquisition from aligned corpora ➢ learn translations

More next week 69 Corpora: pros and cons

➣ Advantages: ➢ essential tool for human lexicographer ➢ domain-specific terminology and translations ➢ frequency information (some approaches) ➢ possibility of addressing ambiguity problem

Words, Lexicons and Ontologies 70 ➣ Disadvantages: ➢ automatically processing corpora is a research challenge in itself ➢ large scale corpus may not exist for particular domain or language – parallel corpora especially difficult ➢ For MT: large scale results only demonstrated with one-to-one mappings, so far, automatic extraction unproven in full systems ➢ Also unclear how well results transfer to different classes of text

Words, Lexicons and Ontologies 71 Lexicons: Ensuring Consistency

➣ Documentation is essential ➢ Lexicons/ontologies are built by groups of people ➢ Combine documentation with examples ∗ Automatically test examples (e.g COMLEX: Thursday) ∗ Link documentation to annotated corpora

➣ Exploit redundancy rules if there is a correlation between two different classes ➢ uncountable ⇒ singular (test it or enforce it)

Words, Lexicons and Ontologies 72 ➣ Allow different views ➢ All words with a certain property or properties ∗ POS ∗ Semantic Class ∗ Countability

➣ Add words class-by-class so that the meta-data is constant (POS, syntactic/semantic class) ➢ add all soccer player’s names (Diego, Best, ... ) ➢ add all quotative verbs (say, tell, think, know, . . . ) ➢ add all time expressions (today, this morning, yesterday morning, last night, tonight, tomorrow night, ... ) ➢ add all classifiers (匹, 人, 台, 個, 枚, 本,...)

Words, Lexicons and Ontologies 73 Some very concrete Examples

➣ Redefining the dictionary (by Erin McKean; TED Talk 2007) (http:// blog.ted.com/2007/08/30/redefining_the/)

➣ Building a simple NLP Lexicon from an MRD

➣ Building a new

Words, Lexicons and Ontologies 74 Case Study: NLP Lexicon from MRD

➣ Build an ontology of relations between word senses ➢ Information comes from machine readable dictionary

driver2: somebody who drives a car

➣ Extract genus term by parsing the definition sentence

➢ headword wh ⊂ genus term wg hHYPERNYM, somebody, driver2i

➣ Evaluate by comparing to Goi-Taikei

Nichols et al. (2005) 75 A Sample Entry: Driver1

Index ドライバー doraiba- POS noun  Familiarity 6.5 [1–7]       Lexical-type noun-lex        S1 ねじ/まわし/。      screw turn (screwdriver)        ′   Definition S1 ねじ/を/差し入れ/たり/ 、      Sense 1       /抜き取っ/た/する/道具/。         a tool for inserting and removing screws .          Hypernym 道具 equipment “tool”    1      Sem. Class h942:tooli (⊂ 893:equipment)          

Words, Lexicons and Ontologies 76 A Sample Entry: Driver2,3

S1 自動車/を/運転/する/人/。 Definition   " Someone who drives a car#  Sense 2  人   Hypernym 1 hito “person”        Sem. Class h292:driveri (⊂ 5:person)                S1 ゴルフ/で/ 、/遠/距離/用/の/クラブ/。         In golf, a long-distance club.    Definition    S 一番/ウッド/。/     2      Sense 3   A number one wood .           Hypernym クラブ2 kurabu “club”         Sem. Class (h921:leisure equipmenti (⊂ 921))      Domain ゴルフ gorufu “golf”    1       

Words, Lexicons and Ontologies 77 Parse Results for Driver2 (MRS)

hh, x1{h : prpstn rel(h1) hh, x1{h : prpstn rel(h0) h1 : hito(x1) h0 : person(x1) h2 : jidosha(x2) h1 : some(x1, h0, h4) h3 : unten(u1, x1, x2)}i h2 : car(x2) h3 : drive(u1, x1, x2)}i 「自動車を運転する人」 somebody who drives a car Minimal Recursion Semantics (simplified)

➣ Generally language independent

➣ Genus term is normally the highest scoping word (x1): doraiba¯ 2 ⊂ hito(x1) or driver 2 ⊂ person(x1)

Words, Lexicons and Ontologies 78 Extracting more from the MRS

ア: アルプス 、 または 日本アルプス の 略 a: arupusu , matawa nihon-arupusu no ryaku a: alps , or japan alps ADN abbreviation

a: an abbreviation for the Alps or the Japanese Alps

➣ Sometimes highest scoping word is an explicit relation e.g. habbreviation, kind, name, general termi

➣ Sometimes there is coordination

➢ hABBREVIATION, ア “a”, アルプス “Alps”i ➢ hABBREVIATION, ア “a”, 日本アルプス “Japanese alps”i

Words, Lexicons and Ontologies 79 Example of class condiment

トマトケチャップ1 tomato ketchup ホワイトソース1 white sauce ミートソース1 ソース2 meat sauce sauce トマトソース1 tomato sauce ケチャップ1 調味料1 ketchup condiment 塩1 salt カレー粉1 curry powder カレー1 香辛料1 curry spice スパイス1 spice

You can only extract what is in the dictionary definitions

Words, Lexicons and Ontologies 80 Case Study: Transfer Lexicons

➣ I (or my system) speaks language S

➣ I (or my system) want to understand language T

Q: What do I do if I have no S ⇔ T lexicon?

A: Look up S ⇔ I ⇔ T ➢ How can I do this accurately, especially if I don’t understand T ?

Bond & Ogura (2007) 81 Example

SI T mark anjing laut 印 in seal mohor stamp tera Figure 1: Matching through I

Words, Lexicons and Ontologies 82 Our Specific Problem

➣ Make a bilingual lexicon Japanese → Malay lexicon by crossing J → E and M → E ➢ Largest existing lexicons ≈ 7, 000 words

➣ The resulting lexicon will be used by a Ja-Ms MT system The lexicon should have: ➢ Appropriate Translation Equivalents ∗ Especially the first one ➢ Parts of Speech ➢ Semantic Classes (Semantic Transfer System)

Words, Lexicons and Ontologies 83 Two Kinds of Sense Distinctions

➣ Homonyms (Must disambiguate) ➢ Clearly different meanings (different Semantic Classes) seal ⇔ あざらし azarashi hanimali vs seal ⇔ 印 in htooli ➢ Distinguish using semantic classes

➣ Variants (near synonyms) (Should disambiguate) ➢ Finer grained differences (same Semantic Classes) 鳩 hato → doves or pigeons ➢ Distinguish using domains, collocations, n-grams, . . . ➢ As a fall-back, use ranked preferences 鳩 hato → (1) pigeon; (2) dove

Words, Lexicons and Ontologies 84 Shared Translations

SIT mark 印 in seal mohor stamp imprint tera gauge anjing laut 2 mohor 0.4 = 3+2 2 tera 0.286 = 3+4 1 anjing laut 0.25 = 3+1

Words, Lexicons and Ontologies 85 Our Approach

➣ Lexicons ➢ Japanese-English Lexicon (Goi-Taikei) ➢ Malay-English-Chinese Lexicon (KAMI) ➢ Japanese-Chinese Lexicon (Ri-Zhong Cidian)

➣ Scoring ➢ Syntactic matching (POS) ➢ Shared Translations ➢ Semantic matching (Semantic Classes) ➢ Second-language matching (Chinese)

➣ Finally hand checking

Words, Lexicons and Ontologies 86 Japanese-English lexicon

➣ 380,000 Japanese-English word pairs

➣ 3,000 semantic categories (human, inanimate, etc.) ➢ 2,710 common-noun classes ➢ 200 proper-noun classes ➢ 108 verb, event, and state classes

➣ Subcat frames and selectional restrictions for 15,000 verb senses

➣ Available as a book (five volumes) or CD-ROM ➢ Goi-Taikei — A Japanese Lexicon

Words, Lexicons and Ontologies 87 Semantic Transfer Dictionary

Japanese あざらし (azarashi) English seal  ➣ POS noun    Sem Classes hanimali      Rank 1      ➣ In the noun dictionary: ➢ 63,926 Japanese index words ➢ 71,818 Japanese-English pairs ➢ 49,205 different English entries ∗ 90% with 1 translation, 6.5% with 2, 2% with 3 ➢ average number of translations is 1.12

Words, Lexicons and Ontologies 88 The Malay-English lexicon: KAMI

Malay anjing laut  POS hnouni   Classifier ekor (27%) ➣     Sem Classes hanimali (30%)   English seal    Chinese 海豹 (hai bao ) (25%)  3 4  ➢ 67,658 Malay index words ➢ 91,426 Malay-English word pairs ∗ 79% with 1 translation, 14% with 2 4% with 3 ➢ average number of translations is 1.35

Words, Lexicons and Ontologies 89 Adding Semantic Classes to KAMI

1. Original syntactic-semantic codes → Goi-Taikei semantic classes (10,000) e.g. noun.city → city

2. CICC Indonesian dictionary classes (found for 14,784 entries) mapped to Goi-Taikei semantic classes (hand-made partial mapping)

Words, Lexicons and Ontologies 90 Adding Semantic Classes to KAMI

3. Malay numeral classifiers (found for 18,303 nouns) mapped onto Goi-Taikei semantic classes (hand-made partial mapping) e.g. ekor → animal; orang → human

4. Known word lists (ISO 639 languages; ISO 4217 currencies) ISO 4217 entry → currency

5. Manual addition

Words, Lexicons and Ontologies 91 Ja-Cn lexicon: Ri-Zhong Cidian

Japanese あざらし (azarashi)   Japanese Kanji 海豹 ➣ Example:   Chinese Hanzi 海豹      Pronunciation hai` baoˇ  ➢ 83,000 Japanese-Chinese word pairs

Words, Lexicons and Ontologies 92 Crossing

➣ For each pair in the Japanese-English lexicon ➢ Look up the Malay equivalent of the English if an entry with the same coarse POS exists ∗ create a Japanese-Malay pair ∗ store the intermediate English ∗ Calculate scores · shared translations · semantic matching · second-language matching ➢ else mark the Japanese-English pair

➣ For each Japanese index: rank Ja-Ms then Ja-En

Words, Lexicons and Ontologies 93 Example

SIT mark 印 in seal mohor stationary stamp tera tool imprint gauge anjing laut Figure 2: Matching through I and Sem

Words, Lexicons and Ontologies 94 Calculating the Scores

➣ Shared Translations for pair J and M, where E(W ) is the set of English translations of W :

|E(J) ∩ E(M)| shared translation score = (1) |E(J)| + |E(M)|

➣ Semantic matching score for word pair J and M, number of semantic classes of J which subsume or are subsumed by a semantic class of M

➣ Total score = 10 × semantic score + shared translation score − rank

Words, Lexicons and Ontologies 95 Results

➣ Crossed the Japanese-English common-nouns with the Malay-English nouns ➢ 22,658 out of 63,926 Japanese words linked ➢ 16,974 out of 67,658 Malay words ➢ 75,872 Japanese-Malay pairs ➢ Average number of translations was 3.4

➣ Tested 65 randomly selected linked Japanese index words (232 translations)

Words, Lexicons and Ontologies 96 Results (2)

50 50 46.2

40.1 40 40

33.8

30 30 25 22.8 20 20

12.1 10.8 10 10 9.2

0 0 Good OK Error Bad Good OK Error Bad

All Pairs First Ranked Pair

Words, Lexicons and Ontologies 97 Matching through two languages

Japanese English Chinese Malay mark 印章 印 seal tera stamp imprint mohor gauge anjing laut Figure 3: Matching through two languages

Words, Lexicons and Ontologies 98 Results (two pivots)

➣ 5,238 pairs matched both English and Chinese ➢ 97% good translations ➢ 8.1% of the original Japanese index words ➢ 1 in 4 matched Japanese index words

➣ High Precision/Low Recall

Words, Lexicons and Ontologies 99 Further Work

➣ Use a British/American filter ➢ Japanese-English dictionary uses American spelling ➢ Malay-English dictionary uses British spelling ➢ armor/armour don’t match

➣ Lemmatize more before matching ➢ expecially singular/plural

➣ Use an English to increase matches Sanfilippo and Steinberger (1997)

Words, Lexicons and Ontologies 100 Conclusions - building a bilingual lexicon

➣ Number of shared translations tells you something

➣ Semantic classes are even more useful in linking bilingual dictionaries ➢ word pairs with matching semantic classes are better translations

➣ Using two (or more) pivot languages gives even higher accuracy

➣ More information gives higher precision ➢ Link through a pivot language (≈ 65% precision) ➢ Add in semantic links (≈ 80% precision) ➢ Link through two pivot languages (≈ 97% precision)

Words, Lexicons and Ontologies 101 The Secret of Lexical Acquistion

➣ For a given word wu find the most similar known word wk and describe it in the same way

➣ Similarity can be ➢ Distributional ➢ Translation Equivalence ➢ Semantic Class ➢ Burstiness (appears on the same date) ➢ Sub- (same character)

Words, Lexicons and Ontologies 102 How to build Resources?

➣ Bootstrap ontologies from MRDs 1. Parse definitions to find the genus 2. Take it as hypernym or parse further if it is relational abbreviation [of x], nickname [for x], kind [of x], polite form [of x], ...

➣ Bootstrap bilingual dictionaries from other bilingual dictionaries ➢ Link through a pivot language (≈ 65% precision) ➢ Add in semantic links (≈ 80% precision) ➢ Link through two pivot languages (≈ 97% precision)

➣ Find people to build it ➢ Wiktionary, lexicographers, fans, . . .

bootstrap – help oneself, often through improvised means 103 * References

Bond, Francis & Kentaro Ogura. 2007. Combining linguistic resources to create a machine- tractable Japanese-Malay dictionary. Language Resources and Evaluation 42(2). 127– 136. URL http://dx.doi.org/10.1007/s10579-007-9038-4. (Special issue on Asian language technology).

Copestake, Ann. 1992. The representation of lexical semantic information. Brighton: University of Sussex dissertation.

Fellbaum, Christine (ed.). 1998. WordNet: An electronic lexical database. MIT Press.

Gruber, Thomas R. 1993. A translation approach to portable ontology specifications. Knowledge Acquisition 5(2). 199–200.

Words, Lexicons and Ontologies 104 Isahara, Hitoshi, Francis Bond, Kiyotaka Uchimoto, Masao Utiyama & Kyoko Kanzaki. 2008. Development of the Japanese WordNet. In Sixth international conference on language resources and evaluation (lrec 2008), Marrakech.

Mohamed Noor, Nurril Hirfana, Suerya Sapuan & Francis Bond. 2011. Creating the open Wordnet Bahasa. In Proceedings of the 25th pacific asia conference on language, information and computation (paclic 25), 258–267. Singapore.

Nichols, Eric, Francis Bond & Daniel Flickinger. 2005. Robust ontology acquisition from machine-readable dictionaries. In Proceedings of the international joint conference on artificial intelligence ijcai-2005, 1111–1116. Edinburgh.

Niles, Ian & Adam Pease. 2001. Towards a standard upper ontology. In Chris Welty & Barry Smith (eds.), Proceedings of the 2nd international conference on formal ontology in information systems (fois-2001), Maine.

Pease, Adam. 2006. Formal representation of concepts: The suggested upper merged

Words, Lexicons and Ontologies 105 ontology and its use in linguistics. In Andrea C Schalley & D. Zaefferer (eds.), Ontolinguistics. how ontological status shapes the linguistic coding of concepts, Mouton de Gruyter. URL http://www.adampease.org/Articulate/publications/ Ontolinguist%ics04.pdf.

Wang, Shan & Francis Bond. 2013. Building the Chinese Open Wordnet (COW): Starting from core synsets. In Proceedings of the 11th workshop on asian language resources, a workshop at ijcnlp-2013, 10–18. Nagoya.

Words, Lexicons and Ontologies 106