The
the more
the less
The
Corpus Linguistics The
The
1 Today
1. WordNet
2. FrameNet
3. Course Recap
2 Example Question
Question • When was Barack Obama born?
Text available to the machine • Barack Obama was born on August 4, 1961
This is easy. • – just phrase a Google query properly: "Barack Obama was born on *" – syntactic rules that convert questions into statements are straight-forward
Lexical Semantics with Dictionaries 1 Example Question (2)
Question • What plants are native to Scotland?
Text available to the machine • A new chemical plant was opened in Scotland.
What is hard? • – words may have di↵erent meanings (senses) – we need to be able to disambiguate between them
Lexical Semantics with Dictionaries 2 Example Question (3)
Question • Where did David Cameron go on vacation?
Text available to the machine • David Cameron spent his holiday in Cornwall
What is hard? • – words may have the same meaning (synonyms) – we need to be able to match them
Lexical Semantics with Dictionaries 3 Example Question (4)
Question • Which animals love to swim?
Text available to the machine • Polar bears love to swim in the freezing waters of the Arctic.
What is hard? • – words can refer to a subset (hyponym)orsuperset(hypernym) of the concept referred to by another word – we need to have database of such Ais-aBrelationships, called an ontology
Lexical Semantics with Dictionaries 4 Example Question (5)
Question • What is a good way to remove wine stains?
Text available to the machine • Salt is a great way to eliminate wine stains
What is hard? • – words may be related in other ways, including similarity and gradation – we need to be able to recognize these to give appropriate responses
Lexical Semantics with Dictionaries 5 Example Question (6)
Question • Did Poland reduce its carbon emissions since 1989? Text available to the machine • Due to the collapse of the industrial sector after the end of communism in 1989, all countries in Central Europe saw a fall in carbon emissions. Poland is a country in Central Europe. What is hard? • – we need to do inference – a problem for sentential, not lexical, semantics
Lexical Semantics with Dictionaries 6 WordNet
Some of these problems can be solved with a good ontology, e.g., WordNet • WordNet (English) is a hand-built resource containing 117,000 synsets:sets • of synonymous words (See http://wordnet.princeton.edu/)
Synsets are connected by relations such as • – hyponym/hypernym (IS-A: chair-furniture) – meronym (PART-WHOLE: leg-chair) – antonym (OPPOSITES: good-bad)
globalwordnet.org now lists wordnets in over 50 languages (but variable • size/quality/licensing)
Lexical Semantics with Dictionaries 7 Word Sense Ambiguity
Not all problems can be solved by WordNet alone. • Two completely di↵erent words can be spelled the same (homonyms): • I put my money in the bank. vs. He rested at the bank of the river. You can do it! vs. She bought a can of soda.
More generally, words can have multiple (related or unrelated) senses • (polysemes)
Lexical Semantics with Dictionaries 8 How many senses?
5min.exercise:Howmanysensesdoesthewordinterest have? •
Lexical Semantics with Dictionaries 9 How many senses?
Lexical Semantics with Dictionaries 10 How many senses?
How many senses does the word interest have? • – She pays 3% interest on the loan. – He showed a lot of interest in the painting. – Microsoft purchased a controlling interest in Google. – It is in the national interest to invade the Bahamas. – I only have your best interest in mind. – Playing chess is one of my interests. – Business interests lobbied for the legislation.
Are these seven di↵erent senses? Four? Three? • Also note: distinction between polysemy and homonymy not always clear! •
Lexical Semantics with Dictionaries 11 Lexicography requires data
Lexical Semantics with Dictionaries 12 Lumping vs. Splitting
For any given word, lexicographer faces the choice: • – Lump usages into a small number of senses? or – Split senses to reflect fine-grained distinctions?
Lexical Semantics with Dictionaries 13 WordNet senses for interest
S1: a sense of concern with and curiosity about someone or something, • Synonym: involvement S2: the power of attracting or holding one’s interest (because it is unusual • or exciting etc.), Synonym: interestingness S3: a reason for wanting something done, Synonym: sake • S4: a fixed charge for borrowing money; usually a percentage of the amount • borrowed S5: a diversion that occupies one’s time and thoughts (usually pleasantly), • Synonyms: pastime, pursuit S6: a right or legal share of something; a financial involvement with • something, Synonym: stake S7: (usually plural) a social group whose members control some field of • activity and who have common aims, Synonym: interest group
Lexical Semantics with Dictionaries 14 Synsets and Relations in WordNet
Synsets (“synonym sets”, e↵ectively senses) are the basic unit of organization • in WordNet. – Each synset is specific to nouns (.n), verbs (.v), adjectives (.a, .s), or adverbs (.r). – Synonymous words belong to the same synset: car1 (car.n.01)= car,auto,automobile . – Polysemous{ words belong} to multiple synsets: car1 vs. car4 = car,elevator car . Numbered roughly in descending order of frequency. { } Synsets are organized into a network by several kinds of relations, including: • – Hypernymy (Is-A): hyponym ambulance is a kind of hypernym car1 – Meronymy (Part-Whole): meronym{ air bag} is a part of holonym car1 { }
Lexical Semantics with Dictionaries 15 Visualizing WordNet
Lexical Semantics with Dictionaries 16 Computing with WordNet
NLTK provides an excellent API for looking things up in WordNet: • >>> from nltk.corpus import wordnet as wn >>> wn.synsets(’car’) [Synset(’car.n.01’), Synset(’car.n.02’), Synset(’car.n.03’), Synset(’car.n.04’), Synset(’cable_car.n.01’)] >>> wn.synset(’car.n.01’).definition() u’a motor vehicle with four wheels; usually propelled by an internal combustion engine’ >>> wn.synset(’car.n.01’).hypernyms() [Synset(’motor_vehicle.n.01’)]
(WordNet uses an obscure custom file format, so reading the files directly is not recommended!) •
Lexical Semantics with Dictionaries 17 Polysemy and Coverage in WordNet
Online stats: • – 155k unique strings, 118k unique synsets, 207k pairs – nouns have an average 1.24 senses (2.79 if exluding monosemous words) – verbs have an average 2.17 senses (3.57 if exluding monosemous words)
Too fine-grained? • WordNet is a snapshot of the English lexicon, but by no means complete. • – E.g., consider multiword expressions (including noncompositional expressions, idioms): hot dog, take place, carry out, kick the bucket are in WordNet, but not take a break, stress out, pay attention – Neologisms: hoodie, facepalm – Names: Microsoft
Lexical Semantics with Dictionaries 18 Di↵erent sense = di↵erent translation
Another way to define senses: if occurrences of the word have di↵erent • translations, these indicate di↵erent sense
Example interest translated into German • – Zins: financial charge paid for load (WordNet sense 4) – Anteil: stake in a company (WordNet sense 6) – Interesse: all other senses
Other examples might have distinct words in English but a polysemous word • in German.
Lexical Semantics with Dictionaries 19 Word sense disambiguation (WSD)
For many applications, we would like to disambiguate senses • – we may be only interested in one sense – searching for chemical plant on the web, we do not want to know about chemicals in bananas
Task: Given a polysemous word, find the sense in a given context • Popular topic, data driven methods perform well •
Lexical Semantics with Dictionaries 20 WSD as classification
Given a word token in context, which sense (class) does it belong to? • Machine learning solution: We can train a supervised classifier (such as an • SVM), assuming sense-labeled training data: – She pays 3% interest/INTEREST-MONEY on the loan. – He showed a lot of interest/INTEREST-CURIOSITY in the painting. – Playing chess is one of my interests/INTEREST-HOBBY.
SensEval and later SemEval competitions provide such data • – held every 1-3 years since 1998 – provide annotated corpora in many languages for WSD and other semantic tasks
SemCor: 820K word corpus annotated with WordNet synsets •
Lexical Semantics with Dictionaries 21 Simple features
Directly neighboring words • – interest paid – rising interest – lifelong interest – interest rate – interest piqued
Any content words in a 50 word window • – pastime – financial – lobbied – pursued
What other features might be useful? •
Lexical Semantics with Dictionaries 22 Issues with WSD
Not always clear how fine-grained the gold-standard should be • Di cult/expensive to annotate corpora with fine-grained senses • Classifiers must be trained separately for each word • – Hard to learn anything for infrequent or unseen words – Requires new annotations for each new word
Lexical Semantics with Dictionaries 23 Summary: WordNet
Words can be ambiguous (polysemy), sometimes with related meanings, and • other times with unrelated meanings (homonymy).
Di↵erent words can mean the same thing (synonymy). • Computational lexical databases, notably WordNet, organize words in terms of • their meanings. – Synsets and relations between them such as hypernymy and meronymy.
Word sense disambiguation is the task of choosing the right sense for the • context.
Lexical Semantics with Dictionaries 24 Lexical Semantics: Beyond Vectors, Classes, Senses
• So far, we’ve seen approaches that concern the choice of individual words:
• sense disambiguation
• semantic relations in a lexicon or similarity space
• Next: words that are fully understood by “plugging in” information from elsewhere in the sentence.
• Specifically, understanding words that are (semantic) predicates, in relation to their arguments.
• Especially verbs.
• Who did what to whom?
4 Argument Structure Alternations • Mary opened the door. The door opened.
• John slices the bread with a knife. The bread slices easily. The knife slices easily.
• Mary loaded the truck with hay. Mary loaded hay onto the truck. The truck was loaded with hay (by Mary). Hay was loaded onto the truck (by Mary).
• John got Mary a present. John got a present for Mary. Mary got a present from John.
• Mary loaded the truck with hay.
nsubj obj obl
• Hay was loaded onto the truck by Mary.
nsubj:pass obl obl:agent✗ Syntax is not enough!
6 Syntax-Semantics Relationship
7 Outline • Syntax ≠ semantics
• The semantic roles played by different participants in the sentence are not trivially inferable from syntactic relations
• …though there are patterns!
• Frame semantics is a theory of semantic organization that is encapsulated in the FrameNet lexicon + corpus.
• (Other resources, such as PropBank and VerbNet, also define roles, but in a shallower and less knowledge- intensive fashion.)
8 Charles (“Chuck”) Fillmore
9 Thematic Roles
• In “The Case for Case” (1968), Fillmore posited underlying categories that have come to be known as thematic or theta roles. E.g.: ‣ Agent = animate entity who is volitionally acting ‣ Theme = participant that is undergoing motion, for example ‣ Patient = participant that undergoes some internal change of state (e.g., breaking) ‣ Destination = intended endpoint of motion ‣ Recipient = party to which something is transferred
• The VerbNet resource uses these and a couple dozen other roles.
• But it is hard to come up with a small list of these roles that will suffice for all verbs.
• And there are correspondences that these roles do not expose: e.g., that someone who buys is on the receiving end of selling.
10 Berkeley FrameNet https://framenet.icsi.berkeley.edu/
11 Paraphrase
• James snapped a photo of me with Sheila.
• Sheila and I had our picture taken by James.
12 What’s in common
• James snapped a photo of me with Sheila.
• Sheila and I had our picture taken by James.
13 What’s in common
• James snapped a photo of me with Sheila.
• Sheila and I had our picture taken by James.
14 Frame Semantics
“MEANINGS ARE RELATIVIZED TO SCENES”
(Fillmore 1977)
15 Frame Semantics
16 http://www.swisslark.com/2012/08/jana-manja-photography-special-offer_29.html 1. Photographer identifies Subject to be depicted in a Captured_imageFrame Semantics 2. Photographer puts the Subject in view of the Camera 3. Photographer operates the Camera to create the Captured_image
Photographer
Subject Camera
Captured_image
17 1. Photographer identifies Subject to be depicted in a Captured_image 2. Photographer puts the Subject in view of the Camera 3. Photographer operates the Camera to create the Captured_image
Photographer time manner Subject duration location Camera frequency reason Captured_image
photograph.v take ((picture)).v snap picture.v
18 frame name textual definition explaining the scene and how the frame elements relate to one another
Core non-core Frame FEs Elements
predicate1.v predicate2.n predicate3.a
19 20 Event
Create_representation Intentionally_act
Intentionally_create
Uses Create_physical_artwork
Uses Physical_artworks FrameNet 21 FrameNet: Lexicon
• ~1000 frames represent scenarios. Most are associated with lexical units (a.k.a. predicates). Berkeley FrameNet currently has 13k LUs (5k nouns, 5k verbs, 2k adjectives).
• Frame elements (a.k.a. roles) represent participants/ components of those scenarios. Core vs. non-core.
• Frames and their corresponding roles are linked together in the lexicon.
• Frames are explained with textual descriptions.
22
FrameNet Annotations
• Sheila and I had our picture taken by James.
Create_physical_artwork
Creator “James” Physical_artworks Representation “our picture” Creator ∅ Uses Artifact “picture”
Represented “our” 25 Languages with FrameNets
26 Frame-Semantic Analysis
• Given a predicate in a sentence, which frame does it evoke? Frame identification (a kind of word sense disambiguation!)
• Given a sentence and a predicate that evokes a particular frame, what phrases fill which roles in the frame? SRL
• Given a sentence with some predicates, Frame ID + SRL = Frame-semantic parsing
27 FrameNet Parsers
SEMAFOR system from CMU
http://www.ark.cs.cmu.edu/SEMAFOR/ http://demo.ark.cs.cmu.edu/parse
Framat system from Edinburgh
https://github.com/microth/mateplus
Ongoing research at CMU, Google, Edinburgh, University of Washington, …
28 Summary: FrameNet
• Fillmore’s theory of frame semantics claims that understanding linguistic meaning requires encyclopedic knowledge of scenes, or frames.
• Each frame specifies frame elements or roles, and crucial relations among them that characterize a kind of scene.
• Berkeley FrameNet documents a networked inventory of frames and the English words that can evoke them.
• Frames are created based on careful lexicographic work using corpora. FrameNet contains hundreds of thousands of sentences illustrating frame-evoking lexical items and their arguments.
• FrameNet data (or data from other predicate-argument resources) can be used to build automatic semantic role labelers that explain “who did what to whom” in a sentence.
29 Recap of CorpLing: Theory
• Why Corpora?
‣ Issues in assembling corpora
• Markup, tokenization, metadata
• Searching, frequency, collocations
• Syntactic treebanks
• Lexical semantics with and without dictionaries: vectors, classes, senses/synsets, frames
30 Recap of CorpLing: Hands-on
• Annotating a Reddit text for
‣ Sentence boundaries & tokenization
‣ Sentence type
‣ Part of speech (extended PTB)
‣ Syntactic Dependencies (UD)
31 Recap of CorpLing: Tricks of the Trade • Datasets
‣ PTB WSJ treebank
‣ WordNet/SemCor, FrameNet
• Tools
‣ CQP/regular expressions
‣ GitDox (XML markup, spreadsheet)
‣ Arborator
32 Recap of CorpLing: NLP tasks/tools • Tokenization
• POS tagging
• Syntactic parsing
• Word similarity
• Named entity recognition/ semantic class tagging
• Word sense disambiguation
• Semantic role labeling 33 Questions? Feedback?
in rented accommodation forever. Hi Bruno.
Using the CQPWeb interface
Armchair of Charles Dickens (Wikimedia)
34
Linguistic Institute 2017 - Corpus Linguistics 41