<<

EXERCISE FROM BEFORE

We did this exercise • Assume you: • are a computer • have database of logical/factual world knowledge • Meaning have lots of rules/statistics about English LECTURE 8: LEXICAL , • representation What steps are involved in: DISTRIBUTIONAL SEMANTICS AND • analysing this textual input? EMBEDDINGS • extracting & encoding relevant information? • answering the question? • Mark Granroth-Wilding A robotic co-pilot developed under What agency has created a DARPA’s ALIAS programme has computer that can pilot a plane? already flown a light aircraft.

1

EXERCISE In small groups (3-4)

Question What is a good way to remove wine stains? A robotic co-pilot developed under What agency has created a DARPA’s ALIAS programme has computer that can pilot a plane? Text available already flown a light aircraft. Salt is a great way to eliminate wine stains

Different may have similar/related meanings What types of meaning / meaning representation that we’ve • • Relationships between words can be highly complex seen are useful here? • Essential for using information from text What other aspects of meaning do we need? • • Difficult for broad-domain systems •

2 3

EXERCISE LEXICAL SEMANTICS In small groups (3-4)

A robotic co-pilot developed under What agency has created a DARPA’s ALIAS programme has computer that can pilot a plane? Representing word meaning already flown a light aircraft. • Some difficult issues: • Representation • Acquiring broad-domain knowledge • What aspects of meaning of these words must we capture? Polysemy: same word, multiple meanings • • Multiword expressions How might these be represented computationally? • • Think about difficulties in: • formalizing meaning • specifying/acquiring/learning representations •

4 5

THESAURUS SEMANTIC RELATIONS: HYPONYMY

Relationship between meanings of words • Dictionary of synonyms and • Hyponymy: Is-A relation closely related words • WordNet: large machine-readable dog is-a animal • (and more) dog hyponym of animal Complex network of relationships • • animal hypernym of dog between words • Forms a taxonomy Type of network MR • • Mainly works for concrete nouns • democracy, thought, sticky?

6 7 SEMANTIC RELATIONS: WORDNET: SYNSETS MERONYMY

Synset: ‘synonym set’ sense • ' Unit of organization in WordNet Another relationship: part-whole • • Synonyms grouped into synset: • air bag part-of car car.n.01 = car, auto, automobile { } Polysemous words belong to multiple: • air bag meronyn of car car.n.04 = car, elevator car • { } Connected by semantic relations: hyponymy, meronymy,... •

8 9

WORDNET PROBLEMS WITH WORDNET(S)

https://wordnet.princeton.edu/ Mainly English: limited other languages Hand-built ontology • • Hand-built: large, but still limited coverage • 117k synsets Many multiword expressions, but far from complete: • • Primarily for English take a break, pay attention • Language changes: neologisms Global WordNet • • greenwashing, Brexiteer, barista Lists WN-like resources in different languages • Names Currently 77 • • Mikkeli, Microsoft Varying quality, size, licenses • Open Multilingual WordNet Fine-grained senses • • Linked/merged WNs for different languages • Currently 34 •

10 11

EXERCISE: WORD SENSES PROBLEMS WITH WORDNET(S) In small groups

1 thing. She was talking at a party thrown at Daphne’s restaurant in Mainly English: limited other languages 2 turned it into the hot dinner-party topic. The comedy is the • 3 the 1983 general election for a party which was uncertain of its policies Hand-built: large, but still limited coverage • 4 to attack the Scottish National Party, who look set to seize Perth and Many multiword expressions, but far from complete: • 5 had been passed to a second party who made a financial decision take a break, pay attention 6 by-pass there will be a street party. “Then,” he says, “we are going Language changes: neologisms • 7 political tradition and the same party. They are both Anglophilic greenwashing, Brexiteer, barista 8 he told Tony Blair’s modernised party they must not retreat into “warm Names • 9 “Oh no, I’m just here for the party,” they said. “I think it’s terrible Mikkeli, Microsoft 10 A future obliges each party to the contract to fulfil it by Fine-grained senses 11 signed by or on behalf of each party to the contract.” Mr David N • How many senses here? • Can you define them? •

11 12

WORD SENSES TRANSLATION SENSES

thing. She was talking at a party thrown at Daphne’s restaurant in turned it into the hot dinner-party topic. The comedy is the the 1983 general election for a party which was uncertain of its policies WordNet senses defined by hand to attack the Scottish National Party, who look set to seize Perth and • Another way to define senses: had been passed to a second party who made a financial decision • different translations evidence for two senses by-pass there will be a street party. “Then,” he says, “we are going → political tradition and the same party. They are both Anglophilic E.g. interest German: • ⇒ he told Tony Blair’s modernised party they must not retreat into “warm Zins: charge for a loan • “Oh no, I’m just here for the party,” they said. “I think it’s terrible Anteil: stake in compant • A future obliges each party to the contract to fulfil it by Interesse: other senses signed by or on behalf of each party to the contract.” Mr David N • Senses of German word may be distinguished in English • Or in other languages How many senses? Difficult decision • • WordNet is quite fine-grained •

13 14 WORD SENSE DISAMBIGUATION CLUES TO WORD SENSE

beer, invite, host thing. She was talking at a party thrown at Daphne’s restaurant in thing. She was talking at a party0 thrown at Daphne’s restaurant in the 1983 general election for a party which was uncertain of its policies the 1983 general election for a party which was uncertain of its policies 1 political, minister, manifesto Some clues that might help: Which senses are used here? Surrounding words • • Often useful to distinguish Connected words (grammatically, same doc, . . . ) • • But, useful/important: Document topic • • Search: only interested in one sense • (Disambiguated) sense of same word used in same doc : different senses in different topics • • Etc. . . • Most also clues to word meaning: see later!

15 16

WORD SENSE DISAMBIGUATION WORD MEANING (WSD) (senses)

WordNet: attempt to encode meaning of words in • machine-usable form Have predefined senses for words • E.g. from WordNet Can discover/represent word meaning in other ways • • Word sense disambiguation (WSD) becomes supervised Statistical approach: distributional semantics • • classification task Learn meaning from word’s use in text • Features: context words, other clues seen No reliance on annotated resources • • Classes: known senses for word Broad coverage • • Apply to any language, domain, . . . •

17 18

REMINDER: MEANING REPRESENTATIONS MEANING FROM CONTEXT

Formal MRs: logical expressions over entities • Focus on logical relations/properties • Mostly assume representations of words, entities, . . . given • Network-based representations: semantic nets • A bottle of tezgino is on the table Network of semantic relations between concepts • Everybody likes tezgino E.g. WordNet, ConceptNet • Tezgino makes you drunk Feature-based representations We make tezgino out of corn • Attribute-value pairs for objects or concepts • Words, robjects, facts, events, . . . • What is tezgino? Latent feature-based representations • • You can probably work it out Vector representations • • E.g. distributional word vectors, word embeddings • E.g. deep learning-based reprs •

19 20

WORD CONTEXT DISTRIBUTIONAL HYPOTHESIS

Words that tend to appear in similar contexts have similar meanings Can infer meaning of word from contexts it typically occurs in • Measure word similarity by context similarity • Is word meaning anything more than contexts it occurs in? What is a word’s context? • • Distributional hypothesis: How to measure typical contexts? • • How to compare typical contexts? words that tend to appear in similar contexts have similar • meanings Lots of possible answers • Some capture different aspects of word meaning You shall know a word by the company it keeps! – Firth (1957) • A couple of examples here • (1) Context = nearby words (2) Context = syntactically related words

21 22 COUNTING CONTEXTS COUNTING CONTEXTS

Context = nearby words Context = nearby words, ignoring stopwords Word Count said 2306 loss 1 53 billion 1985 company said able cut losses scaleddlrs 573 53 billion in 1985 the company said it had been able mln 539 acquisition by bond of the company s existing bank loans and said negotiations continuing acquisition bond company existing bank loans restructuringpct 397 gold be repaid by the mining company in gold Word275 Count mln dlrs repaid mining company gold stock 306 exploration and development of the company s gold properties in the creation developmentcredit fund exploration 59 development company gold properties centralyear province 289 masbate mln vs 6 9 mln company name is bowater industries plc commodity 57 mln vs 6 9 mln company name bowater industries1986 plc lt 276 ccc 49 would 272 tax 30 shares 261 celanese canada inc hoechst celanese corporation and trade mountain pipe line celanesemln canada 28 inc hoechst celanese corporation trade mountain pipeshare line company 257 to iraq the commodity credit corporation has accepted a bid for saidwheat flouriraq 28 commodity credit corporation accepted bid export bonus cover of adjusting the commodity credit corporation s ccc discount and premium commentsaccepted question adjusting 21 commodity credit corporation ccc discount premium schedules improve jordan usda the commodity credit corporation ccc accepted eight bonus offers bonuswheat jordan usda 20 commodity credit corporation ccc accepted eight bonus offers iraq usda the commodity credit corporation ccc accepted a bid for stateflour iraq usda 20 commodity credit corporation ccc accepted bid export bonus pct 19

23 24

WORD VECTORS COLLECTING CONTEXT COUNTS Contexts – features Window: • Small number of words (5, 10?) 89 0 202 0 0 45 0 0 • Large number (100)  52 0 110 64 0 0 178 0  • 0 87 0 71 84 51 0 20

Whole document Words •  77 67 0 0 73 0 0 0  Exclude stopwords?   •   Weight words? Closer higher? Context word counts features to characterize word • → • → Normalize by word? Construct vector for word • • More than counts? • Can compare words by comparing vectors said was common for company • • Word embedded in vector space: But also for most other words • • Frequency relative to expectation also called word embeddings • One way: mutual information (e.g. PPMI) •

25 26

WORD MATRIX WORD MATRIX

Very high-dimensional Feature (context) counts sea melody sail shanty lyric oar roger chorus

89 0 202 0 0 45 0 0 boat 89 0 202 0 0 45 0 0  52 0 110 64 0 10 178 0  pirate  52 0 110 64 0 10 178 0  0 87 0 71 84 0 0 20 song 0 87 0 71 84 0 0 20 Words          77 67 0 0 73 0 0 0  Sparse: serenade  77 67 0 0 73 0 0 0  Sparse:       mostly 0s   mostly 0s

28 28

WORD SIMILARITY VECTOR SPACES

Features (N) boat0 point0 word0 89 0 202 0 0 45 0 0 (N dimensions)

52 0 110 64 0 0 178 0 1 Euclidean distance increased   1 boat 0 87 0 71 84 51 0 20 dim point1 pirate dim  77 67 0 0 73 0 0 0  Compare angles Data points   song   dim0 dim Method treats words as vectors (in a matrix) 0 • Compare two words to boat Vector space model (VSM) • • Imagine we just look at all boat’s contexts twice Representation method extremely general • • Knowledge of meaning not changed Used to formulate many representation tasks • • Angle compares relative values on dimensions Commonly applied to represent many things: next lecture • •

29 30 EXERCISE: WORD ASSOCIATION Individually

word0 https://presemo.helsinki.fi/nlp2020 0.97 (13◦) word2

1 Go to Presemo 0.87, 29◦ word1

dim • For each word, enter associated • dim0 words that come into your mind No specific relation • Cosine similarity: cos of angle between vectors 10 words • ∼ ~ ~ N ~ ~ NB: Words invisible when you’ve entered them! word0 word1 i=1 word0,i word1,i • cos(word~ 0, word~ 1) = · = word~ 0 word~ 1 N ~ 2 N ~ 2 (1) Duck | || | iP=1 word0,i i=1 word1,i q q (2) Submarine Commonly used measure of vectorP similarity P (3) Aeroplane

31 32

VECTORS IN NEURAL NETWORKS SPARSE VECTORS Aside Features, N – very large

0.8 0 2.0 0 0 0.4 0 0 0 Mostly zeroes  0.5 0 0 0 0 0 0 10.2 0  0 0 0 0.7 0 0.5 0 0 0 Feed-forward NNs: all representations are vectors •  0 0.6 0 0 0.7 0 0 0 0  Node contains numeric value   •   “Layer” of nodes: vector Vectors seen up to now: sparse • • Problem: Latent representations are vectors • • similar words may appear in related contexts Can sometimes inspect and compare in this way • • but have few exact contexts in common Sometimes capture these intuitions about e.g. similarity • • friend, friends, buddy Example later: word embeddings trained with NNs • Want to capture more abstract features • Single dimension for these •

33 34

SPARSE VECTORS DIMENSIONALITY REDUCTION chair table

word0 0.8 0 2.0 0 0 1.4 0 0 0 0 0 0 0 0 0 0 1.2 0   Sparse matrix dense matrix word2 0 0 1.6 0.7 0 1.5 0 0 0 • → word3 0 0.6 3.1 0 0.7 0 0 0 0  Fewer columns – features   •  ?  Mixtures of original dims mapped to new dims • One possibility: exploit correlations between features • Apply to word-context matrix chair and table often contexts for same word • • Standard methods: e.g. SVD Then word seen with chair, but not table • • Each word now has dense vector Probably still quite table-chairy? • • Might lose some important distinctions (chair vs table Maybe table unobserved by chance? • • features) Dimensionality reduction: removes redundancy in columns Often worth it for generalization • • Exploit these correlations • Correlated dims collapsed •

35 36

DIMENSIONALITY REDUCTION ON COMPARING WORD VECTORS COUNTS

word0 word1

company corporation 1 word1 Word Count Word Count dim Count vectors are sparse • said 2306 credit 59 Top words seem somewhat dlrs 573 commodity 57 dim0 • mln 539 ccc 49 characteristic pct 397 tax 30 stock 306 Apply dimensionality reduction But, not much actual overlap mln 28 • • year 289 said 28 Compare vectors using cosine similarity Generalize sparse dimensions 1986 276 accepted 21 • • would 272 bonus 20 Similar contexts similar features • → with dimensionality reduction shares 261 state 20 similar vectors share 257 pct 19 → ? similar meaning →

37 38 VECTOR = MEANING PROBLEMS WITH COUNT-BASED 6 VECTORS

Want very large vocabulary • Distributional word vectors capture some aspects of meaning Oxford English Dictionary: 600k words • • Exactly what depends on: More if separate vectors for eat, eating, ate,... • • features • Context features weighting • dimensionality reduction 0.89 0 2.02 0 0 0.45 0 • ! comparison metric, . . . 0.52 0 1.10 0.64 0 0 1.78 •   0 0.87 0 0.71 0.84 0.51 0 The meaning of a word is given by its usage ? many • ← Words (V)  0.77 0.67 0 0 0.73 0 0  Vectors also depend on factors in corpus:   •   culture • linguistic register Dimensionality reduction slow for many words • • document types Memory inefficient • • Hard to add new words •

39 40

WORD2VEC DON’T COUNT, PREDICT!

Learn: word vectors word-as-context vectors Positive example

Pulls vector and

sleep

Random embedding initialization out my of going context vector closer my oclock

• of to work and

of

twoevery if diversionfor

my eleven

Iterate over huge corpus abouttimemorning employed

myself not did

• work began

then three to

times I with it this rain Update vectors iteratively to predict context I till

with outof gun walked • order

gunhours 1 or then Prediction with neural network: embedding = layer time • morning Update using gradient descent: • adjust embeddings for each observation South Korea is leaning towards introducing ... and many others since •

Use word vectors , throw away word-as-context vectors

41 42

WORD2VEC WORD2VEC

Learn: word vectors word-as-context vectors Random example

Pushes vector and

sleep out my

context vector apart of going my oclock

of to

work and Training is fast: simple network

of

twoevery if

diversionfor •

my eleven

abouttimemorning employed

myself not did work began

then three to Train on many examples:

times I with it this rain

I till • 1

with outof 100B words news text gun walked order

gunhours 0 or then morningtime Vectors for many words: • 3M vectors South Korea is leaning towards introducing ...

Use word vectors , throw away word-as-context vectors 1Pre-trained vectors from authors

42 43

USING WORD EMBEDDINGS USING EMBEDDINGS: EXAMPLE

Classifier for named-entity recognition Choose NER label for each word Exploit word similarity measures, e.g.: • • Classes: O, IPERS, ILOC, . . . Query expansion with related words: nearest vector neighbours • • Matching similar words: good great Features: word, previous word, previous label, . . . • ' • Use word representation directly • Word: boat, pirate, ship,... V Use vectors instead of words as input • • Num dims Benefit: related words look similar to system Represented as V independent binary features • • But, boat ship • ≈ Use embedding as input: N real-valued features •

44 45 MORE WORD EMBEDDINGS MORE WORD EMBEDDINGS Context-sensitive embeddings

Many methods for training: • word2vec (skipgram), GloVe, fastText, ...... the Reserve Bank of New Zealand. . . Unsupervised: can train on any language (with raw text) = • 6 Pre-trained embeddings for 157 languages from fastText online . . . Hazardous Substance Data Bank (HSDB) 25. . . • = Multi-sense embeddings: bank (money) = bank (river) 6 • 6 . . . occupied the West Bank and Gaza. . . WSD, then train embeddings • Or try to learn multiple senses during training • BERT, ELMo •

46 47

MORE WORD EMBEDDINGS SUMMARY Multilingual embeddings

Different languages different words Lexical semantics: word meaning • → • Different languages different word contexts WordNet: network of meanings • → • Word senses & disambiguation (WSD) en:bank fi:pankki • 6≈ Distributional semantics: word vectors • Dictionary for some words align whole vector space Context counts Word embeddings • → • → Recent work: Vector-space model (VSM), vector similarity • • unsupervised cross-lingual alignment without dictionary Sparse dense: dimensionality reduction • → MUSE, Vecalign, . . . NN-based word embeddings and recent advances • •

48 49

READING MATERIAL

Vector Semantics: J&M3 ch 6 • Lexical semantics: 6.1 • Embeddings: 6.2 • Distributional word vectors: 6.3.2 • Similarity, cosine: 6.4 • Word2vec, dense vectors: 6.8 • Distributional semantics, word embeddings: Eisenstein ch 14 • Nice blog on embeddings, word2vec: The Illustrated Word2vec •

50