Words, Lexicons and Ontologies
Total Page:16
File Type:pdf, Size:1020Kb
HG8003 Technologically Speaking: The intersection of language and technology. Words, Lexicons and Ontologies Francis Bond Division of Linguistics and Multilingual Studies http://www3.ntu.edu.sg/home/fcbond/ [email protected] Lecture 4 Location: LT8 HG8003 (2014) Schedule Lec. Date Topic 1 01-16 Introduction, Organization: Overview of NLP; Main Issues 2 01-23 Representing Language 3 02-06 Representing Meaning 4 02-13 Words, Lexicons and Ontologies 5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the Semantic Web Recess 7 03-13 Citation, Reputation and PageRank 8 03-20 Introduction to MT, Empirical NLP 9 03-27 Analysis, Tagging, Parsing and Generation Quiz 10 Video Statistical and Example-based MT 11 04-03 Transfer and Word Sense Disambiguation 12 04-10 Review and Conclusions Exam 05-06 17:00 ➣ Video week 10 Words, Lexicons and Ontologies 1 Review of Meaning Words, Lexicons and Ontologies 2 Review of Representing Meaning ➣ Three ways of defining meaning ➢ Attributional (Compositional) ➢ Relational ➢ Distributional ➣ the Syntax-Semantic Interface ➢ Usage ↽⇀ Meaning Words, Lexicons and Ontologies 3 Attributional Meaning ➣ Give a semantic description of word use in isolation of the categorisation of other lexical items ➢ definitions ➢ decompositional semantics (break down into primitives) ➣ Easy for humans to understand ➣ Hard to decide on sense boundaries (granularity: splitters vs. lumpers) ➣ Definitions are circular (the grounding problem) ➣ Hard to be consistent Words, Lexicons and Ontologies 4 Relational Meaning ➣ Capture correspondences between lexical items by way of a finite set of pre-defined semantic relations ➣ Methodologies: ➢ lexical relations ➢ constructional relations ➣ Captures many generalizations usefully ➣ Hard to make complete ➣ Leads to large, complex graphs Words, Lexicons and Ontologies 5 Distributional Meaning ➣ Capture word meanings as collections of contexts in which words appear ➢ n-grams ➢ syntactic relations ➢ sentences ➢ documents ➣ Good for synonymy, not so good for antonymy ➣ Computationally tractable Words, Lexicons and Ontologies 6 Why are dictionaries important? ➣ For humans ➢ find meaning of unknown words ➢ find more information about known words ➢ codify knowledge about word usage (glossaries) ➣ For machines ➢ store information about words ➢ link between text and knowledge Words, Lexicons and Ontologies 7 Words Words, Lexicons and Ontologies 8 Introduction to Words, Lexicons and Ontologies ➣ Design and implementation ➢ Machine Readable Dictionaries ➢ Morphological lexicons ➢ Syntactic lexicons ➢ Semantic lexicons ➢ Ontologies ➣ Construction and Maintenance ➢ Construction from scratch ➢ Boot-strapping from existing resources ➢ Ensuring consistency Words, Lexicons and Ontologies 9 Machine Readable Dictionaries (MRDs) ➣ Human dictionaries made available on machine ➢ Electronic Dictionaries ➢ Dictionary Applications ∗ often with automatic word lookup ➢ On-line dictionaries ∗ Sometimes with glosses Words, Lexicons and Ontologies 10 A typical entry definition (n) a concise explanation of the meaning of a word or phrase or symbol ➣ Headword: definition ➣ Part of Speech: n (noun) ➣ Definition: ➢ genus: explanation ➢ differentia: concise; of the meaning of a word or phrase or symbol ? Implied: countable (a), regular plural Words, Lexicons and Ontologies 11 Parts-of-Speech (POS) ➣ Traditional Grammar has eight: Noun, Verb, Adjective, Adverb (open class) Conjunction, Preposition, Pronoun, Interjection (closed class) ➣ In the US, the Penn Treebank POS set is de-facto standard: ➢ http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html ➢ 45 tags (including punctuation) ➣ In Europe, CLAWS tagset is popular ➢ http://ucrel.lancs.ac.uk/claws7tags.html ➢ 137 tags (without punctuation) Words, Lexicons and Ontologies 12 Penn Treebank Examples (14/45) Tag Description Tag Description NN Noun,singularormass VB Verb,baseform NNS Noun,plural VBD Verb,pasttense NNP Propernoun,singular VBG Verb,gerundorpresentparticiple NNPS Propernoun,plural VBN Verb,pastparticiple PRP Personalpronoun VBP Verb,non-3rdpersonsingularpresent IN Preposition VBZ Verb,3rdpersonsingularpresent TO to . SentenceFinalpunct(.,?,!) ➣ The tags include inflectional information ➢ If you know the tag, you can generally find the lemma ➣ Some tags are very specialized: I/PRP wanted/VBD to/TO go/VB ./. Words, Lexicons and Ontologies 13 Good Definitions ➣ a definition should be simpler than the word being explained ➣ the definition should match the part of speech definition (n) a concise explanation of the meaning of a word or phrasel define (v) – give a definition for the meaning of a word; “Define ‘sadness”’ ➣ the definition should not be circular ➣ all words in the definition should be defined (somewhere) ➢ prefer small defining vocabulary ➢ only use metalanguage (NSM: Natural Semantic Metalanguage) Words, Lexicons and Ontologies 14 Circular definitions beauty the state of being beautiful beautiful full of beauty bobcat a lynx lynx a bobcat Words, Lexicons and Ontologies 15 Other useful information http://en.wiktionary.org/wiki/lynx ➣ Pronunciation ➣ Usage Examples ➣ Illustrations ➣ Etymology (history of the word) ➣ Links to other resources Easier to do without the space restrictions of a paper dictionary. Words, Lexicons and Ontologies 16 Dictionaries for NLP Minimize content in order to minimize acquisition problem. Declarativity and human readability with compilation into a machine- friendly representation. Modularity so components are reusable: e.g. distinct monolingual and transfer lexicons in an MT system. Capture generalizations with inheritance (and lexical rules etc). Avoids errors, easier to maintain and expand. Underspecification to reduce disambiguation for a particular application. (Copestake, 1992) 17 Morphological Analysis 森 永 前 日 銀 総 裁 rin ei zen hi gin sou sai mori mae nichi morinaga zennichi gin sousai morinaga zen nichigin sousai ➣ 森永 前 日銀 総裁 Morinaga former Bank of Japan President Words, Lexicons and Ontologies 18 Morphological Lexicons ➣ Stem ➣ Inflectional Class ➣ Part of Speech (often 1-200) ➣ Arguments (?) ➣ For example ➢ Relations: 前 - 総裁 ➢ Arguments: 前(総裁) ➢ Abstraction: 前(title); 総裁 ⊂ title Words, Lexicons and Ontologies 19 Morphological Lexicon ➣ I fabricate for a living ➣ I make things for a living ➣ I fabricated yesterday ➣ I made things for a living ➣ These are differences in the inflectional class Words, Lexicons and Ontologies 20 Inflection ➣ Inflection: In many languages, words appear in different forms to show small differences in meaning: for example number (dog/dogs; child/children) or tense/aspect (make/made/making/made; take/took/taking/taken) ➣ Many words pattern the same way, this is called an inflectional class (or paradigm). For example, one class of plurals in English is words that end in y: fly/flies; sky/skies. ➣ The inflectional class is normally not predictable from the meaning or syntax; and so must be stored for each word ➣ The root form (lemma) and the inflected form have the same meaning modulo the number/tense/...and the same basic part-of-speech ➣ Normally a word only undergoes one inflection Words, Lexicons and Ontologies 21 Derivation ➣ New words can also be created by changing the form. If the part of speech or meaning changes, we call it derivation: (happy/happiness; happy/unhappy; happy/happily) ➣ You can also get zero derivation, where the meaning changes without a change in form (I butter the bread/I like butter. ➣ The root form in derivation is called the stem and the process of stripping off derivational affixes is stemming ➣ You can have multiple derivations: anti-dis-establish-ment-arian-ism: the stem is establish ➣ Derivation is largely but not entirely productive: employer, teacher, *studier, actor, contractor Words, Lexicons and Ontologies 22 Syntactic Lexicon ➣ I fabricated the results ➣ I made up the results ➣ = I made the results up ➣ I walked down the road ➣ 6= I walked the road down ➣ These are differences in the syntactic lexical type Words, Lexicons and Ontologies 23 Differences in Argument Structure ➣ These are also differences in the syntactic lexical type ➢ I gave the book to him ➢ I gave him the book ➢ Cats eat mice ➢ Cats eat ➢ Cats devour mice ➢ *Cats devour mice ➣ The information about what arguments a verb can take is also called subcategorization, valence or argument frame Words, Lexicons and Ontologies 24 Semantic Lexicon ➣ I deposited the money in the bank (financial) ➣ The river overflowed its bank (riverside) ➣ I had lunch by the bank (???) ➣ These are differences in the semantic class Words, Lexicons and Ontologies 25 All the possibilities combine ➣ I saw her duck ➢ see/saw, saw/sawed ➢ duckN , duckV ➢ duckN:cloth, duckN:bird ➣ Still useful to keep separate ➢ inflectional paradigm ➢ arguments (subcategorization) ➢ semantics (selectional preferences) Words, Lexicons and Ontologies 26 Transfer Lexicons ➣ bank ↔ 銀行 ginkou ➣ bank ↔ 土手 dote ➣ 鼻 hana ↔ nose ➢ trunk [of elephant] ➢ muzzle [of horse] ➢ snout [of boar] Words, Lexicons and Ontologies 27 Dictionaries in Processing ➣ Lexical lookup is slow (disk-based) ➢ Compile dictionaries into compressed format ➢ Index ➢ Cache the index cache = load it into memory ➢ Cache already accessed entries ➢ Keep a list of frequent entries and cache them the most frequent words are very frequent ➣ Batch check for consistency off-line Words, Lexicons and Ontologies 28 Dictionaries and Intellectual Property Rights (IPR) ➣ Lexicography has along tradition of extending other’s work ➢ Johnson, Murray, . ➢ Language itself should not be restricted