<<

The the people, the the crates

the more the

the less you can be the

The the Linguistic Institute 2017 the the temperature the the malt

Corpus The a database is, the Lexical with it becomes The the clay content, the more the Nathan Schneider soil The the [email protected] soot, the it needed Amir Zeldes some slides from Philipp Koehn the the [email protected] number, the & Sharon Goldwater the protection The you are the the chance

The the mesh the more the net

1 Today

1. WordNet

2. FrameNet

3. Course Recap

2 Example Question

Question • When was Barack Obama born?

Text available to the machine • Barack Obama was born on August 4, 1961

This is easy. • – just phrase a Google query properly: "Barack Obama was born on *" – syntactic rules that convert questions into statements are straight-forward

Lexical Semantics with Dictionaries 1 Example Question (2)

Question • What plants are native to Scotland?

Text available to the machine • A new chemical plant was opened in Scotland.

What is hard? • – may have di↵erent meanings (senses) – we need to be able to disambiguate between them

Lexical Semantics with Dictionaries 2 Example Question (3)

Question • Where did David Cameron go on vacation?

Text available to the machine • David Cameron spent his holiday in Cornwall

What is hard? • – words may have the same meaning (synonyms) – we need to be able to match them

Lexical Semantics with Dictionaries 3 Example Question (4)

Question • Which animals love to swim?

Text available to the machine • Polar bears love to swim in the freezing waters of the Arctic.

What is hard? • – words can refer to a subset (hyponym)orsuperset(hypernym) of the concept referred to by another – we need to have database of such Ais-aBrelationships, called an ontology

Lexical Semantics with Dictionaries 4 Example Question (5)

Question • What is a good way to remove wine stains?

Text available to the machine • Salt is a great way to eliminate wine stains

What is hard? • – words may be related in other ways, including similarity and gradation – we need to be able to recognize these to give appropriate responses

Lexical Semantics with Dictionaries 5 Example Question (6)

Question • Did Poland reduce its carbon emissions since 1989? Text available to the machine • Due to the collapse of the industrial sector after the end of communism in 1989, all countries in Central Europe saw a fall in carbon emissions. Poland is a country in Central Europe. What is hard? • – we need to do inference – a problem for sentential, not lexical, semantics

Lexical Semantics with Dictionaries 6 WordNet

Some of these problems can be solved with a good ontology, e.g., WordNet • WordNet (English) is a hand-built resource containing 117,000 synsets:sets • of synonymous words (See http://wordnet.princeton.edu/)

Synsets are connected by relations such as • – hyponym/hypernym (IS-A: chair-furniture) – meronym (PART-WHOLE: leg-chair) – antonym (OPPOSITES: good-bad)

globalwordnet.org now lists in over 50 languages (but variable • size/quality/licensing)

Lexical Semantics with Dictionaries 7 Word Sense Ambiguity

Not all problems can be solved by WordNet alone. • Two completely di↵erent words can be spelled the same (homonyms): • I put my money in the bank. vs. He rested at the bank of the river. You can do it! vs. She bought a can of soda.

More generally, words can have multiple (related or unrelated) senses • (polysemes)

Lexical Semantics with Dictionaries 8 How many senses?

5min.exercise:Howmanysensesdoesthewordinterest have? •

Lexical Semantics with Dictionaries 9 How many senses?

Lexical Semantics with Dictionaries 10 How many senses?

How many senses does the word interest have? • – She pays 3% interest on the loan. – He showed a lot of interest in the painting. – Microsoft purchased a controlling interest in Google. – It is in the national interest to invade the Bahamas. – I only have your best interest in mind. – Playing chess is one of my interests. – Business interests lobbied for the legislation.

Are these seven di↵erent senses? Four? Three? • Also note: distinction between polysemy and homonymy not always clear! •

Lexical Semantics with Dictionaries 11 requires data

Lexical Semantics with Dictionaries 12 Lumping vs. Splitting

For any given word, lexicographer faces the choice: • – Lump usages into a small number of senses? or – Split senses to reflect fine-grained distinctions?

Lexical Semantics with Dictionaries 13 WordNet senses for interest

S1: a sense of concern with and curiosity about someone or something, • Synonym: involvement S2: the power of attracting or holding one’s interest (because it is unusual • or exciting etc.), Synonym: interestingness S3: a reason for wanting something done, Synonym: sake • S4: a fixed charge for borrowing money; usually a percentage of the amount • borrowed S5: a diversion that occupies one’s time and thoughts (usually pleasantly), • Synonyms: pastime, pursuit S6: a right or legal share of something; a financial involvement with • something, Synonym: stake S7: (usually plural) a social group whose members control some field of • activity and who have common aims, Synonym: interest group

Lexical Semantics with Dictionaries 14 Synsets and Relations in WordNet

Synsets (“synonym sets”, e↵ectively senses) are the basic unit of organization • in WordNet. – Each synset is specific to nouns (.n), verbs (.v), adjectives (.a, .s), or adverbs (.r). – Synonymous words belong to the same synset: car1 (car.n.01)= car,auto,automobile . – Polysemous{ words belong} to multiple synsets: car1 vs. car4 = car,elevator car . Numbered roughly in descending order of frequency. { } Synsets are organized into a network by several kinds of relations, including: • – Hypernymy (Is-A): hyponym ambulance is a kind of hypernym car1 – Meronymy (Part-Whole): meronym{ air bag} is a part of holonym car1 { }

Lexical Semantics with Dictionaries 15 Visualizing WordNet

Lexical Semantics with Dictionaries 16 Computing with WordNet

NLTK provides an excellent API for looking things up in WordNet: • >>> from nltk.corpus import as wn >>> wn.synsets(’car’) [Synset(’car.n.01’), Synset(’car.n.02’), Synset(’car.n.03’), Synset(’car.n.04’), Synset(’cable_car.n.01’)] >>> wn.synset(’car.n.01’).definition() u’a motor vehicle with four wheels; usually propelled by an internal combustion engine’ >>> wn.synset(’car.n.01’).hypernyms() [Synset(’motor_vehicle.n.01’)]

(WordNet uses an obscure custom file format, so reading the files directly is not recommended!) •

Lexical Semantics with Dictionaries 17 Polysemy and Coverage in WordNet

Online stats: • – 155k unique strings, 118k unique synsets, 207k pairs – nouns have an average 1.24 senses (2.79 if exluding monosemous words) – verbs have an average 2.17 senses (3.57 if exluding monosemous words)

Too fine-grained? • WordNet is a snapshot of the English lexicon, but by no means complete. • – E.g., consider multiword expressions (including noncompositional expressions, idioms): hot dog, take place, carry out, kick the bucket are in WordNet, but not take a break, stress out, pay attention – Neologisms: hoodie, facepalm – : Microsoft

Lexical Semantics with Dictionaries 18 Di↵erent sense = di↵erent translation

Another way to define senses: if occurrences of the word have di↵erent • translations, these indicate di↵erent sense

Example interest translated into German • – Zins: financial charge paid for load (WordNet sense 4) – Anteil: stake in a company (WordNet sense 6) – Interesse: all other senses

Other examples might have distinct words in English but a polysemous word • in German.

Lexical Semantics with Dictionaries 19 Word sense disambiguation (WSD)

For many applications, we would like to disambiguate senses • – we may be only interested in one sense – searching for chemical plant on the web, we do not want to know about chemicals in bananas

Task: Given a polysemous word, find the sense in a given context • Popular topic, data driven methods perform well •

Lexical Semantics with Dictionaries 20 WSD as classification

Given a word token in context, which sense (class) does it belong to? • Machine learning solution: We can train a supervised classifier (such as an • SVM), assuming sense-labeled training data: – She pays 3% interest/INTEREST-MONEY on the loan. – He showed a lot of interest/INTEREST-CURIOSITY in the painting. – Playing chess is one of my interests/INTEREST-HOBBY.

SensEval and later SemEval competitions provide such data • – held every 1-3 years since 1998 – provide annotated corpora in many languages for WSD and other semantic tasks

SemCor: 820K word corpus annotated with WordNet synsets •

Lexical Semantics with Dictionaries 21 Simple features

Directly neighboring words • – interest paid – rising interest – lifelong interest – interest rate – interest piqued

Any content words in a 50 word window • – pastime – financial – lobbied – pursued

What other features might be useful? •

Lexical Semantics with Dictionaries 22 Issues with WSD

Not always clear how fine-grained the gold-standard should be • Dicult/expensive to annotate corpora with fine-grained senses • Classifiers must be trained separately for each word • – Hard to learn anything for infrequent or unseen words – Requires new annotations for each new word

Lexical Semantics with Dictionaries 23 Summary: WordNet

Words can be ambiguous (polysemy), sometimes with related meanings, and • other times with unrelated meanings (homonymy).

Di↵erent words can mean the same thing (synonymy). • Computational lexical databases, notably WordNet, organize words in terms of • their meanings. – Synsets and relations between them such as hypernymy and meronymy.

Word sense disambiguation is the task of choosing the right sense for the • context.

Lexical Semantics with Dictionaries 24 Lexical Semantics: Beyond Vectors, Classes, Senses

• So far, we’ve seen approaches that concern the choice of individual words:

• sense disambiguation

• semantic relations in a lexicon or similarity space

• Next: words that are fully understood by “plugging in” information from elsewhere in the sentence.

• Specifically, understanding words that are (semantic) predicates, in relation to their arguments.

• Especially verbs.

• Who did what to whom?

4 Argument Structure Alternations • Mary opened the door. The door opened.

• John slices the bread with a knife. The bread slices easily. The knife slices easily.

• Mary loaded the truck with hay. Mary loaded hay onto the truck. The truck was loaded with hay (by Mary). Hay was loaded onto the truck (by Mary).

• John got Mary a present. John got a present for Mary. Mary got a present from John.

5

• Mary loaded the truck with hay.

nsubj obj obl

• Hay was loaded onto the truck by Mary.

nsubj:pass obl obl:agent✗ is not enough!

6 Syntax-Semantics Relationship

7 Outline • Syntax ≠ semantics

• The semantic roles played by different participants in the sentence are not trivially inferable from syntactic relations

• …though there are patterns!

• Frame semantics is a theory of semantic organization that is encapsulated in the FrameNet lexicon + corpus.

• (Other resources, such as PropBank and VerbNet, also define roles, but in a shallower and less knowledge- intensive fashion.)

8 Charles (“Chuck”) Fillmore

9 Thematic Roles

• In “The Case for Case” (1968), Fillmore posited underlying categories that have come to be known as thematic or theta roles. E.g.: ‣ Agent = animate entity who is volitionally acting ‣ Theme = participant that is undergoing motion, for example ‣ Patient = participant that undergoes some internal change of state (e.g., breaking) ‣ Destination = intended endpoint of motion ‣ Recipient = party to which something is transferred

• The VerbNet resource uses these and a couple dozen other roles.

• But it is hard to come up with a small list of these roles that will suffice for all verbs.

• And there are correspondences that these roles do not expose: e.g., that someone who buys is on the receiving end of selling.

10 Berkeley FrameNet https://framenet.icsi.berkeley.edu/

11 Paraphrase

• James snapped a photo of me with Sheila.

• Sheila and I had our picture taken by James.

12 What’s in common

• James snapped a photo of me with Sheila.

• Sheila and I had our picture taken by James.

13 What’s in common

• James snapped a photo of me with Sheila.

• Sheila and I had our picture taken by James.

14 Frame Semantics

“MEANINGS ARE RELATIVIZED TO SCENES”

(Fillmore 1977)

15 Frame Semantics

16 http://www.swisslark.com/2012/08/jana-manja-photography-special-offer_29.html 1. Photographer identifies to be depicted in a Captured_imageFrame Semantics 2. Photographer puts the Subject in view of the Camera 3. Photographer operates the Camera to create the Captured_image

Photographer

Subject Camera

Captured_image

17 1. Photographer identifies Subject to be depicted in a Captured_image 2. Photographer puts the Subject in view of the Camera 3. Photographer operates the Camera to create the Captured_image

Photographer time manner Subject duration location Camera frequency reason Captured_image

photograph.v take ((picture)).v snap picture.v

18 frame textual definition explaining the scene and how the frame elements relate to one another

Core non-core Frame FEs Elements

predicate1.v predicate2.n predicate3.a

19 20 Event

Create_representation Intentionally_act

Intentionally_create

Uses Create_physical_artwork

Uses Physical_artworks FrameNet 21 FrameNet: Lexicon

• ~1000 frames represent scenarios. Most are associated with lexical units (a.k.a. predicates). Berkeley FrameNet currently has 13k LUs (5k nouns, 5k verbs, 2k adjectives).

• Frame elements (a.k.a. roles) represent participants/ components of those scenarios. Core vs. non-core.

• Frames and their corresponding roles are linked together in the lexicon.

• Frames are explained with textual descriptions.

22

FrameNet Annotations

• Sheila and I had our picture taken by James.

Create_physical_artwork

Creator “James” Physical_artworks Representation “our picture” Creator ∅ Uses Artifact “picture”

Represented “our” 25 Languages with FrameNets

26 Frame-Semantic Analysis

• Given a predicate in a sentence, which frame does it evoke? Frame identification (a kind of word sense disambiguation!)

• Given a sentence and a predicate that evokes a particular frame, what phrases fill which roles in the frame? SRL

• Given a sentence with some predicates, Frame ID + SRL = Frame-semantic

27 FrameNet Parsers

SEMAFOR system from CMU

http://www.ark.cs.cmu.edu/SEMAFOR/ http://demo.ark.cs.cmu.edu/parse

Framat system from Edinburgh

https://github.com/microth/mateplus

Ongoing research at CMU, Google, Edinburgh, University of Washington, …

28 Summary: FrameNet

• Fillmore’s theory of frame semantics claims that understanding linguistic meaning requires encyclopedic knowledge of scenes, or frames.

• Each frame specifies frame elements or roles, and crucial relations among them that characterize a kind of scene.

• Berkeley FrameNet documents a networked inventory of frames and the English words that can evoke them.

• Frames are created based on careful lexicographic work using corpora. FrameNet contains hundreds of thousands of sentences illustrating frame-evoking lexical items and their arguments.

• FrameNet data (or data from other predicate-argument resources) can be used to build automatic semantic role labelers that explain “who did what to whom” in a sentence.

29 Recap of CorpLing: Theory

• Why Corpora?

‣ Issues in assembling corpora

• Markup, tokenization, metadata

• Searching, frequency,

• Syntactic

• Lexical semantics with and without dictionaries: vectors, classes, senses/synsets, frames

30 Recap of CorpLing: Hands-on

• Annotating a Reddit text for

‣ Sentence boundaries & tokenization

‣ Sentence type

‣ Part of (extended PTB)

‣ Syntactic Dependencies (UD)

31 Recap of CorpLing: Tricks of the Trade • Datasets

‣ Reddit

‣ PTB WSJ

‣ WordNet/SemCor, FrameNet

• Tools

‣ CQP/regular expressions

‣ GitDox (XML markup, spreadsheet)

‣ Arborator

32 Recap of CorpLing: NLP tasks/tools • Tokenization

• POS tagging

• Syntactic parsing

• Word similarity

• Named entity recognition/ semantic class tagging

• Word sense disambiguation

33 Questions? Feedback?

in rented accommodation forever. Hi Bruno. . I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello , they have banned the "google translated" version

Using the CQPWeb interface

Armchair of Charles Dickens (Wikimedia)

34

Linguistic Institute 2017 - Corpus Linguistics 41