<<

CS460/626 : Natural Language Processing/Speech, NLP and the Web

Lecture 24, 25, 26 Wordnet

Pushpak Bhattacharyya CSE Dept., IIT Bombay 17th and 19th (morning and night), 2013 NLP Trinity

Problem

Semantics NLP Trinity Parsing

Part of Speech Tagging

Morph Analysis Marathi French

HMM Hindi English Language CRF MEMM

Algorithm NLP Layer

Discourse and Corefernce Increased Extraction Complexity Of Processing Parsing

Chunking

POS tagging

Morphology Background Classification of

Word

Content Function Word

Verb Noun Adjective Adverb Prepo Conjun Pronoun Interjection sition ction NLP: Thy Name is Disambiguation

 A word can have multiple meanings

and

 A meaning can have multiple words Word with multiple meanings Where there is a will, Where there is a will,

There are hundreds of relatives Where there is a will

There is a way There are hundreds of relatives A meaning can have multiple words Proverb “A cheat never prospers” Proverb: “A cheat never prospers

but can get rich faster” WSD should be distinguished from structural ambiguity

 Correct groupings a must

 …

 Iran quake kills 87, 400 injured

 When it rains cats and dogs run for cover Should be distinguished from structural ambiguity

 Correct groupings a must

 …

 Iran quake kills 87, 400 injured

 When it rains, cats and dogs runs for cover

 When it rains cats and dogs, run for cover Groups of words (Multiwords) and names can be ambiguous

 Broken guitar for sale, no strings attached (Pun)

 Washington voted Washington to power

 pujaa ne pujaa ke liye phul todaa

 (Pujaa plucked flowers for worship)

 (deep world knowledge) The use of a shin bone is to locate furniture in dark room Stages of processing

and

 Lexical Analysis

 Syntactic Analysis

 Semantic Analysis

 Discourse Example of WSD

 Operation, surgery, surgical operation, surgical procedure, surgical process -- (a medical procedure involving an incision with instruments; performed to repair damage or arrest disease in a living body; "they will schedule the operation as soon as an operating room is available"; "he died while undergoing surgery") TOPIC->(noun) surgery#1  Operation, military operation -- (activity by a military or naval force (as a maneuver or campaign); "it was a joint operation of the navy and air force") TOPIC->(noun) military#1, armed forces#1, armed services#1, military machine#1, war machine#1  Operation -- ((computer science) data processing in which the result is completely specified by a rule (especially the processing that results from a single instruction); "it can perform millions of operations per second") TOPIC->(noun) computer science#1, computing#1  mathematical process, mathematical operation, operation -- ((mathematics) calculation by mathematical methods; "the problems at the end of the chapter demonstrated the mathematical processes involved in the derivation"; "they were learning the basic operations of arithmetic") TOPIC->(noun) mathematics#1, math#1, maths#1

IS WSD NEEDED IN LARGE APPLICATIONS? Word ambiguitytopic drift in IR

{case, container} Drifted topic due to inapplicable sense!!!

Query word: “Madrid bomb blast case”

{case, suit, lawsuit}

Drifted topic due to expanded term!!!

{suit, apparel} 50

Our observations 45 43.75 43.75 On error Percentages 40 Due to various Factors 35 CLEF 2007

31.25 30 Transliteration 25 Translation Disambiguation 25 Stemmer

Error Percentage Error 20 18.75 Ranking

15 12.5

10

6.25 5

0 0 Hindi-English Marathi-English How about WSD and MT?

Zaheer Khan, the India fast भारत के तेज गदबाज, जहर खान, इंलड bowler, has been ruled out of the के खलाफ ृ ंखला के शेष के बाहर शासन remainder of the series against कया गया है. (ruled in the England. administrative sense??)

He will return to India and will be वह भारत लौटने और बाएँ हाथ के तेज replaced by left-arm seamer RP गदबाज आरपी संह वारा तथापत Singh. कया जाएगा.

जहर लॉस म पहले टेट के दौरान Zaheer picked up a hamstring (lifted??) injury during the first Test at हैमिंग चोट उठाया. Lord's. वह भारत क वेट इंडीज म हाल ह म He had been withdrawn from the एक सह (correct??) टखने क चोट के squad for India's recent Test series कारण टेट ृ ंखला के लए टम से वापस in the West Indies due to a right ले लया गया था. ankle injury. Wordnet Psycholinguistic Theory

 Human lexical memory for nouns as a .  Can canary sing? - Pretty fast response.  Can canary fly? - Slower response.  Does canary have skin? – Slowest response.

Animal (can move, has skin)

Bird (can fly)

canary (can sing)

Wordnet - a lexical reference system based on psycholinguistic theories of human lexical memory. Essential Resource for WSD: Wordnet

Word Forms Word Meanings

F1 F2 F3 … Fn

(bank) (rely) (depend) M E 1 E1,2 1,3 E1,1

(embankme (bank) nt) M 2 E E2,2 2,…

(bank) M E 3 3,2 E3,3

… …

M m Em,n Wordnet: History

 The first in the world was for English developed at Princeton over 15 years.  The Eurowordnet- linked structure of European language was built in 1998 over 3 years with funding from the EC as a a mission mode project.  Wordnets for Hindi and Marathi being built at IIT Bombay are amongst the first IL wordnets.  All these are proposed to be linked into the IndoWordnet which eventually will be linked to the English and the Euro wordnets. Basic Principle

 Words in natural languages are polysemous.  However, when synonymous words are put together, a unique meaning often emerges.  Use is made of Relational Semantics. Lexical and Semantic relations in wordnet

1. Synonymy 2. Hypernymy / Hyponymy 3. Antonymy 4. Meronymy / Holonymy 5. Gradation 6. Entailment 7. 1, 3 and 5 are lexical (word to word), rest are semantic (synset to synset). WordNet Sub-Graph

Hyponymy

Dwelling,abode

Hypernymy Meronymy kitchen Hyponymy bckyard bedroom M e r house,home Gloss o veranda A place that serves as the living n Hyponymy y quarters of one or mor efamilies m y

study

guestroom hermitage cottage Fundamental Design Question

 Syntagmatic vs. Paradigmatic relations?  is the basis of the design.  When we hear a word, many words come to our mind by association.  For English, about half of the associated words are syntagmatically related and half are paradignatically related.  For cat  animal, mammal- paradigmatic  mew, purr, furry- syntagmatic Stated Fundamental Application of Wordnet: Sense Disambiguation

Determination of the correct sense of the word The crane ate the fish vs. The crane was used to lift the load bird vs. machine The problem of Sense tagging

 Given a corpora To Assign correct sense to the words.

 This is sense tagging. Needs Disambiguation (WSD)

 Highly important for Question Answering, Machine Translation, Text Mining tasks. Classification of Words

Word

Content Function Word Word

Verb Noun Adjective Adverb Prepo Conjun Pronoun Interjection sition ction Example of sense marking: its need

एक_4187 नए शोध_1138 के अनुसार_3123 िजन लोग_1189 का सामािजक_43540 जीवन_125623 यत_48029 होता है उनके दमाग_16168 के एक_4187 हसे_120425 म अधक_42403 जगह_113368 होती है।

(According to a new research, those people who have a busy social life, have larger space in a part of their brain).

नेचर यूरोसाइंस म छपे एक_4187 शोध_1138 के अनुसार_3123 कई_4118 लोग_1189 के दमाग_16168 के कैन से पता_11431 चला क दमाग_16168 का एक_4187 हसा_120425 एमगडाला सामािजक_43540 यतताओं_1438 के साथ_328602 सामंजय_166 के लए थोड़ा_38861 बढ़_25368 जाता है। यह शोध_1138 58 लोग_1189 पर कया गया िजसम उनक उ_13159 और दमाग_16168 क साइज़ के आँकड़े_128065 लए गए। अमरक_413405 टम_14077 ने पाया_227806 क िजन लोग_1189 क सोशल नेटवकग अधक_42403 है उनके दमाग_16168 का एमगडाला वाला हसा_120425 बाक_130137 लोग_1189 क तुलना_म_38220 अधक_42403 बड़ा_426602 है। दमाग_16168 का एमगडाला वाला हसा_120425 भावनाओं_1912 और मानसक_42151 िथत_1652 से जुड़ा हु आ माना_212436 जाता है। Ambiguity of लोग (People)

 लोग, जन, लोक, जनमानस, पिलक - एक से अधक यित "लोग के हत म काम करना चाहए"  (English synset) multitude, masses, mass, hoi_polloi, people, the_great_unwashed - the common people generally "separate the warriors from the mass" "power to the people"  दुनया, दुनयाँ, संसार, वव, जगत, जहाँ, जहान, ज़माना, जमाना, लोक, दुनयावाले, दुनयाँवाले, लोग - संसार म रहने वाले लोग "महामा गाँधी का समान पूर दुनया करती है / म इस दुनया क परवाह नहं करता / आज क दुनया पैसे के पीछे भाग रह है"  (English synset) populace, public, world - people in general considered as a whole "he is a hero in the eyes of the public” Basic Principle

 Words in natural languages are polysemous.  However, when synonymous words are put together, a unique meaning often emerges.  Use is made of Relational Semantics.  Componential Semantics where each word is a bundle of semantic features (as in the Schankian Conceptual Dependency system or Lexical Componential Semantics) is to be examined as a viable alternative. Componential Semantics

 Consider cat and tiger. Decide on componential attributes.

Furry Carnivorous Heavy Domesticable

 For cat (Y, Y, N, Y)  For tiger (Y,Y,Y,N) Complete and correct Attributes are difficult to design. Semantic relations in wordnet

1. Synonymy 2. Hypernymy / Hyponymy 3. Antonymy 4. Meronymy / Holonymy 5. Gradation 6. Entailment 7. Troponymy 1, 3 and 5 are lexical (word to word), rest are semantic (synset to synset). Synset: the foundation (house)

1. house -- (a dwelling that serves as living quarters for one or more families; "he has a house on Cape Cod"; "she felt she had to get out of the house") 2. house -- (an official assembly having legislative powers; "the legislature has two houses") 3. house -- (a building in which something is sheltered or located; "they had a large carriage house") 4. family, household, house, home, menage -- (a social unit living together; "he moved his family to Virginia"; "It was a good Christian household"; "I waited until the whole house was asleep"; "the teacher asked how many people made up his home") 5. theater, theatre, house -- (a building where theatrical performances or motion-picture shows can be presented; "the house was full") 6. firm, house, business firm -- (members of a business organization that owns or operates one or more establishments; "he worked for a brokerage house") 7. house -- (aristocratic family line; "the House of York") 8. house -- (the members of a religious community living together) 9. house -- (the audience gathered together in a theatre or cinema; "the house applauded"; "he counted the house") 10. house -- (play in which children take the roles of father or mother or children and pretend to interact like adults; "the children were playing house") 11. sign of the zodiac, star sign, sign, mansion, house, planetary house -- ((astrology) one of 12 equal areas into which the zodiac is divided) 12. house -- (the management of a gambling house or casino; "the house gets a percentage of every bet") Creation of Synsets Three principles:

 Minimality

 Coverage

 Replacability Synset creation (continued)

Home John’s home was decorated with lights on the occasion of Christmas. Having worked for many years abroad, John Returned home.

House John’s house was decorated with lights on the occasion of Christmas. Mercury is situated in the eighth house of John’s horoscope. Synsets (continued)

{house} is ambiguous. {house, home} has the sense of a social unit living together; Is this the minimal unit? {family, house , home} will make the unit completely unambiguous.

For coverage: {family, household, house, home} ordered according to frequency.

Replacability of the most frequent words is a requirement. Synset creation

From first principles

 Pick all the senses from good standard .

 Obtain for each sense.

 Needs hard and long hours of work. Synset creation (continued)

From the wordnet of another language in the same family

 Pick the synset and obtain the sense from the gloss.

 Get the words of the target language.

 Often same words can be used- especially for t%sama words.  Translation, Insertion and deletion. Synset+Gloss+Example Crucially needed for concept explication, wordnet building using another wordnet and wordnet linking.

English Synset: {earthquake, quake, temblor, seism} -- (shaking and vibration at the surface of the earth resulting from underground movement along a fault plane of from volcanic activity)

Hindi Synset: {भूकंप, भूचाल, भूडोल, जलजला, भूकप, भू-कंप, भू- कप, ज़लज़ला, भूमकंप, भूमकप - ाकृतक कारण से पृवी के भीतर भाग म कुछ उथल-पुथल होने से ऊपर भाग के सहसा हलने क या "२००१ म गुज़रात म आये भूकंप म काफ़ लोग मारे गये थे" (shaking of the surface of earth; many were killed in the earthquake in Gujarat) Marathi Synset: धरणीकंप,भूकंप - पृवीया पोटात योभ होऊन पृठभाग हालयाची या "२००१ साल गुजरातमये झालेया धरणीकंपात अनेक लोक मृयुमुखी पडले" Semantic Relations

 Hypernymy and Hyponymy

 Relation between word senses (synsets)

 X is a hyponym of Y if X is a kind of Y

 Hyponymy is transitive and asymmetrical

 Hypernymy is inverse of Hyponymy (lion->animal->animate entity->entity) Semantic Relations (continued)

 Part-whole relation, branch is a part of tree

 X is a meronymy of Y if X is a part of Y

 Holonymy is the inverse relation of Meronymy {kitchen} ………………………. {house} Lexical Relation

 Antonymy

 Oppositeness in meaning

 Relation between word forms

 Often determined by phonetics, word length etc. ({rise, ascend} vs. {fall, descend}) WordNet Sub-Graph

Hyponymy

Dwelling,abode

Hypernymy Meronymy kitchen Hyponymy bckyard bedroom M e r house,home Gloss o veranda A place that serves as the living n Hyponymy y quarters of one or mor efamilies m y

study

guestroom hermitage cottage Troponym and Entailment

 Entailment {snoring – sleeping}

 Troponym {limp, strut – walk} {whisper – talk} Entailment

Snoring entails sleeping. Buying entails paying.

 Proper Temporal Inclusion. Inclusion can be in any way. Sleeping temporally includes snoring. Buying temporally includes paying.

 Co-extensiveness. (Troponymy) Limping is a manner of walking. Opposition among verbs.

 {Rise,ascend} {fall,descend} Tie-untie (do-undo) Walk-run (slow,fast) Teach-learn (same activity different perspective) Rise-fall (motion upward or downward)

 Opposition and Entailment. Hit or miss (entail aim) . Backward presupposition. Succeed or fail (entail try.) The causal relationship.

Show- see. Give- have.

Causation and Entailment. Giving entails having. Feeding entails eating.

Kinds of Antonymy

Size Small - Big Quality Good – Bad State Warm – Cool Personality Dr. Jekyl- Mr. Hyde Direction East- West Action Buy – Sell Amount Little – A lot Place Far – Near Time Day - Night Gender Boy - Girl Kinds of Meronymy Component-object Head - Body Staff-object Wood - Table Member-collection Tree - Forest Feature-Activity Speech - Conference Place-Area Palo Alto - California Phase-State Youth - Life Resource-process Pen - Writing Actor-Act Physician - Treatment Gradation

State Childhood, Youth, Old age

Temperature Hot, Warm, Cold

Action Sleep, Doze, Wake Metonymy

 Associated with Metaphors which are epitomes of semantics

 Oxford Advanced Learners Dictionary definition: “The use of a word or phrase to mean something different from the literal meaning”

 Does it mean Careless Usage?! Insight from Sanskritic Tradition

 Power of a word

 Abhidha, Lakshana, Vyanjana

 Meaning of Hall:

 The hall is packed (avidha)

 The hall burst into laughing (lakshana)

 The Hall is full (unsaid: and so we cannot enter) (vyanjana) Metaphors in Indian Tradition

 upamana and upameya

 Former: object being compared

 Latter: object being compared with

 Puru was like a lion in the battle with Alexander (Puru: upameya; Lion: upamana) Upamana, rupak, atishayokti

 upamana: Explicit comparison

 Puru was like a lion in the battle with Alexander  rupak: Implicit comparison

 Puru was a lion in the battle with Alexander  Atishayokti (exaggeration): upamana and upameya dropped

 Puru’s army fled. But the lion fought on. Modern study (1956 onwards, Richards et. al.)

 Three constituents of metaphor  Vehicle (items used metaphorically)  Tenor (the metaphorical meaning of the former)  Ground (the basis for metaphorical extension)  “The foot of the mountain”  Vehicle: :foot”  Tenor: “lower portion”  Ground: “spatial parallel between the relationship between the foot to the human body and the lower portion of the mountain with the rest of the mountain” Interaction of semantic fields (Haas)

 Core vs. peripheral semantic fields

 Interaction of two words in metonymic relation brings in new semantic fields with selective inclusion of features

 Leg of a table

 Does not stretch or move

 Does stand and support Lakoff’s (1987) contribution

 Source Domain

 Target Domain

 Mapping Relations Mapping Relations: ontological correspondences

Heat Anger  Anger is heat (i) Container Body of fluid in (ii) Agitation of Agitation of container fluid mind (iii) Limit of Limit of ability resistence to suppress (iv) Explosion Loss of control Image Schemas

 Categories: Container Contained  Quantity  More is up, less is down: Outputs rose dramatically; accidents rates were lower  Linear scales and paths: Ram is by far the best performer  Time  Stationary event: we are coming to exam time  Stationary observer: weeks rush by  Causation: desperation drove her to extreme steps Patterns of Metonymy

 Container for contained

 The kettle boiled (water)  Possessor for possessed/attribute

 Where are you parked? (car)  Represented entity for representative

 The government will announce new targets  Whole for part

 I am going to fill up the car with petrol Patterns of Metonymy (contd)

 Part for whole

 I noticed several new faces in the class

 Place for institution

 Lalbaug witnessed the largest Ganapati

Question: Can you have part-part metonymy Purpose of Metonymy

 More idiomatic/natural way of expression  More natural to say the kettle is boiling as opposed to the water in the kettle is boiling  Economy  Room 23 is answering (but not *is asleep)  Ease of access to referent  He is in the phone book (but not *on the back of my hand)  Highlighting of associated relation  The car in the front decided to turn right (but not *to smoke a cigarette) Feature sharing not necessary

 In a restaurant:

 Jalebii ko abhi dudh chaiye (no feature sharing)

 The elephant now wants some coffee (feature sharing) Proverbs

 Describes a specific event or state of affairs which is applicable metaphorically to a range of events or states of affairs provided they have the same or sufficiently similar image- schematic structure IndoWordNet

Linked Indian Language Wordnets Linguistic Map of India INDOWORDNET

Urdu Bengali Dravidian Wordnet Wordnet Language Wordnet Kashmiri Wordnet Sanskrit Wordnet

Punjabi Oriya Wordnet Wordnet Hindi Wordnet

North East Marathi Language Wordnet Wordnet

Gujarati Konkani Wordnet Wordnet English Wordnet Size of Indian Language wordnets (June, 2012) 1/2 Assamese 14958 Guahati University, Guahati, Assam

Bengali 23765 Indian Statistical Institute, Kolkata, West Bengal

Bodo 15785 Guahati University, Guahati, Assam

Gujarati 26580 Dharmsingh Desai University, Nadiad, Gujarat

Kannada 4408 Mysore University, Mysore, Karnataka

Kashmiri 23982 Kashmir University, Srinagar, Jammu and Kashmir

Konkani 25065 Goa University, Panji, Goa

Malayalam 8557 Amrita University, Coimbatore, Tamilnadu

Manipuri 16351 Manipur University, Imphal, Manipur

Marathi 24954 IIT Bombay, Mumbai, Maharastra Size of Indian Language wordnets (June, 2012) 2/2

Nepali 11713 Assam University, Silchar, Assam

Oriya 31454 Hyderabad Central University, Hyderabad, Andhra Pradesh

Punjabi 22332 Thapar University and Punjabi University, Patiala, Punjab

Sanskrit 18980 IIT Bombay, Mumbai

Tamil 8607 Tamil University, Thanjavur, Tamilnadu

Telugu 14246 Dravidian University, Kuppam, Andhra Pradesh

Urdu 23071 Jawaharlal Nehru University, New Delhi Categories of Synsets (1/2)

•Universal: Synsets which have an indigenous in all the languages (e.g. Sun ,Earth).

•Pan Indian: Synsets which have indigenous lexeme in all the Indian languages but no English equivalent (e.g. Paapad).

•In-Family: Synsets which have indigenous lexeme in the particular language family (e.g. the term for Bhatija in Dravidian languages). Categories of Synsets (2/2)

•Language specific: Synsets which are unique to a language (e.g. Bihu in Assamese language)

•Rare: Synsets which express technical terms (e.g. ngram).

•Synthesized: Synsets created in the language due to influence of another language (e.g. Pizza). Expansion approach: linking is a subtle and difficult process

 To link or not to link

 While linking:

 face lexical and semantic chasms

 Syntactic divergences in the example sentences

 Change of POS

 Copula drop (HindiBangla) Recap: Synset creation by Expansion approach

From the wordnet of another language preferably in the same family

 Pick the synset and obtain the sense from the gloss.

 Get the words of the target language.

 Often same words can be used- especially for words with the same borrowed from the parent language in the typology.

 Translation, Insertion and deletion. Illustration of expansion approach with noun1

English French (wrong!)

 bank (sloping land  banque (les terrains en (especially the slope pente (en particulier la beside a body of water)) pente à côté d'un plan "they pulled the canoe up d'eau)) "ils ont tiré le on the bank"; "he sat on canot sur la rive», «il the bank of the river and était assis sur le bord de watched the currents" la rivière et j'ai vu les courants" Illustration of expansion No hypernymy approach with noun2 in the synset

English French  bank (sloping land  {rive, rivage, bord} (les (especially the slope terrains en pente (en beside a body of water)) particulier la pente à côté "they pulled the canoe up d'un plan d'eau)) "ils ont on the bank"; "he sat on tiré le canot sur la the bank of the river and rive», «il était assis sur le watched the currents" bord de la rivière et j'ai

English Wordnet French Wordnet vu les courants"

edge bord

cote bank Rive, rivage ? Illustration of expansion approach with verb3

English French

 trust, swear, rely, bank  compter_sur, (have confidence or faith avoir_confiance_en, in) "We can trust in God"; se_fier_a ’, "Rely on your friends"; faire_confiance_a’ (avoir "bank on your good confiance ou foi en) education" "Nous pouvons faire confiance en Dieu»,«Fiez- vous à vos amis",

Ordered by frequency Linking kinship relations and fine grained concepts

Relative

Uncle

Chacha Mama

पानी direct आब

पानी hypernym ेश Case of kashmiri Important decision

 TWO kinds of linkages

 Direct

 Hypernymy

पानी direct आब

पानी hypernym ेश Case of kashmiri How to express a concept not present in the language? Transliteration: often employed

 Synset ID : 39 POS : adjective Synonyms : सनाथ, (sanaatha)  Gloss : िजसका कोई पालन-पोषण या देखभाल करने वाला हो (orphan)  Example statement : "सनाथ बालक को अनाथ बालक क मदद करनी चाहए (children who are looked after should help the ones who are orphans)/ साधक भु का हो जाने पर अनाथ नहं रहता, सनाथ हो जाता है”

 Transliterated and adopted by Bangla and Gujarati Short phrase: often employed

Bangla

Urdu (meaning Inauspicious) Linking synsets across languages: Influence on Hindi Wordnet Hindi wordnet has to add new synsets to accommodate language specific concepts, e.g., in Gujarati ભૈરવજપ (bhairav jap) ID :: 103040 CAT :: NOUN CONCEPT :: मो के लए जप करते हु ए पवत पर से अपने आप को गराना (Taking God’s name and throwing oneself from atop a mountain to attain liberation) EXAMPLE :: गरनार के शखर पर से याक भैरवजप करते थे एसा माना जाता है। (it is though that pilgrms used to do bhairav jap atop girnar) SYNSET-HINDI :: भैरवजप Tools Underlying Architecture Basis Tools Developed

 Synset making and linking tool

 Quantitative analysis tool

 Sense marking tool

 Indowordnet browser tools

 Morphology Analyser Word Search Word completion and Multi lingual search Synset Comparison View synset in other languages Relationship Comparison

In Gujarati

In Hindi Ontology Browsing Synsets for Ontology Node Achievements so far

 Significant progress in the number and quality of synsets of all languages

 Project on track

 Deep insights obtained in the expansion approach

 Platform created for multilingual WSD, Multilingual sense based dictionary

 Paving way for CLIR, MT

 Students graduated/graduating (PhD, masters, bachelors)

 Significant publications (LREC, GWC)