WORD SIMILARITY Assignment 5 out Soon David Kauchak CS159 Spring 2019

3/6/19

Admin

Assignment 4

Quiz #2 Wednesday ¤ Same rules as quiz #1 n First 30 minutes of class n Open book and notes

WORD SIMILARITY Assignment 5 out soon David Kauchak CS159 Spring 2019

Quiz #2 Text Similarity

Topics A common question in NLP is how similar are texts ¤ Linguistics 101 ¤ Parsing n Grammars, CFGs, PCFGs score: sim( , ) = ? n Top-down vs. bottom-up n CKY algorithm n Grammar learning n Evaluation n Improved models ¤ Text similarity rank: ? n Will also be covered on Quiz #3, though

1 3/6/19

Bag of words representation Vector based word

For now, let’s ignore word order: A a1: When 1 a2: the 2 Obama said banana repeatedly a3: defendant 1 Multi-dimensional vectors, last week on tv, “banana, a4: and 1 one dimension per word in banana, banana” a5: courthouse 0 … our vocabulary “Bag of words representation”: (4, 1, 1, 0, 0, 1, 0, 0, …) multi-dimensional vector, one B tv dimension per word in our said b1: When 1 across vocabulary wrong capital bananaobama b2: the 2 california b3: defendant 1 b4: and 0 Frequency of word occurrence b5: courthouse 1 …

TF-IDF Normalized distance measures

One of the most common weighting schemes Cosine n n a b TF = term frequency ∑i=1 i i simcos(A,B) = A⋅ B = ∑ a# ib #i = i=1 n 2 n 2 ∑ ai ∑ bi IDF = inverse document frequency i=1 i=1 a a logN /df L2 "i = i × i n 2 € distL2 (A, B) = ∑(ai! − bi!) i=1 TF IDF (word importance weight ) L1 a’ and b’ are length normalized versions of n We can then use this with any of our similarity the vectors € distL1(A, B) = ∑ ai! − bi! measures! i=1

2 3/6/19

Our problems Word overlap problems

Which of these have we addressed? ¤ word order A: When the defendant and his lawyer walked into the ¤ length court, some of the victim supporters turned their backs ¤ synonym to him. ¤ spelling mistakes ¤ word importance B: When the defendant walked into the courthouse with ¤ word frequency his attorney, the crowd truned their backs on him.

A model of word similarity!

Word similarity Word similarity applications

How similar are two words? General text similarity

Thesaurus generation score: sim(w1, w2) = ? Automatic evaluation w1 applications? Text-to-text rank: w ? w2 ¤ paraphrasing w3 ¤ summarization ¤ machine translation list: w and w are synonyms 1 2 information retrieval (search)

3 3/6/19

Word similarity Word similarity

How similar are two words? Four categories of approaches (maybe more) ¤ Character-based score: sim(w1, w2) = ? n turned vs. truned n cognates (night, nacht, nicht, natt, nat, noc, noch)

w1 ideas? useful ¤ Semantic web-based (e.g. WordNet) resources? rank: w ? w2 ¤ Dictionary-based w3

¤ Distributional similarity-based list: w1 and w2 are synonyms n similar words occur in similar contexts

Character-based similarity Edit distance (Levenshtein distance)

The edit distance between w1 and w2 is the minimum sim(turned, truned) = ? number of operations to transform w1 into w2 Operations: ¤ insertion How might we do this using only the words (i.e. ¤ deletion no outside resources? ¤ substitution EDIT(turned, truned) = ? EDIT(computer, commuter) = ? EDIT(banana, apple) = ? EDIT(wombat, worcester) = ?

4 3/6/19

Edit distance Better edit distance

EDIT(turned, truned) = 2 Are all operations equally likely? ¤ delete u ¤ No ¤ insert u

EDIT(computer, commuter) = 1 Improvement: give different weights to different ¤ replace p with m operations EDIT(banana, apple) = 5 ¤ replacing a for e is more likely than z for y ¤ delete b ¤ replace n with p ¤ replace a with p Ideas for weightings? ¤ replace n with l ¤ Learn from actual data (known typos, known similar words) ¤ replace a with e ¤ Intuitions: phonetics ¤ Intuitions: keyboard configuration EDIT(wombat, worcester) = 6

Vector character-based word similarity Vector character-based word similarity

sim(turned, truned) = ? sim(turned, truned) = ?

Generate a feature vector based on the characters a: 0 a: 0 (or could also use the set based measures at the character level) b: 0 b: 0 Any way to leverage our vector-based similarity approaches c: 0 c: 0 from last time? d: 1 d: 1 e: 1 e: 1 f: 0 f: 0 problems? g: 0 g: 0 … …

5 3/6/19

Vector character-based word similarity Vector character-based word similarity

sim(restful, fluster) = ? sim(restful, fluster) = ?

Character level loses a lot of Use character bigrams or information even trigrams a: 0 a: 0 aa: 0 aa: 0 b: 0 b: 0 ab: 0 ab: 0 c: 0 c: 0 ac: 0 ac: 0 d: 1 d: 1 … … e: 1 e: 1 es: 1 er: 1 f: 0 f: 0 ideas? … … g: 0 g: 0 fu: 1 fl: 1 … … … … re: 1 lu: 1 … …

Word similarity WordNet

Four general categories Lexical database for English ¤ 155,287 words

¤ 206,941 word senses ¤ Character-based ¤ 117,659 synsets (synonym sets) n turned vs. truned ¤ ~400K relations between senses ¤ Parts of speech: nouns, verbs, adjectives, adverbs n cognates (night, nacht, nicht, natt, nat, noc, noch) ¤ Semantic web-based (e.g. WordNet) Word graph, with word senses as nodes and edges as relationships

¤ Dictionary-based Psycholinguistics

¤ WN attempts to model human lexical memory ¤ Distributional similarity-based ¤ Design based on psychological testing n similar words occur in similar contexts Created by researchers at Princeton ¤ http://wordnet.princeton.edu/

Lots of programmatic interfaces

6 3/6/19

WordNet relations WordNet relations

synonym – X and Y have similar meaning ¨ synonym ¨ antonym antonym – X and Y have opposite meanings ¨ hypernyms hypernyms – subclass ¨ hyponyms ¤ beagle is a hypernym of dog ¨ holonym hyponyms – superclass ¨ meronym ¤ dog is a hyponym of beagle ¨ troponym holonym – contains part ¨ entailment ¤ car is a holonym of wheel ¨ (and a few others) meronym – part of ¤ wheel is a meronym of car

WordNet relations WordNet troponym – for verbs, a more specific way of doing an action Graph, where nodes ¤ run is a troponym of move are words and edges are ¤ dice is a troponym of cut relationships entailment – for verbs, one activity leads to the next There is some hierarchical ¤ sleep is entailed by snore information, for example with (and a few others) hyp-er/o-nomy

7 3/6/19

WordNet: dog WordNet: dog

WordNet-like Hierarchy WordNet-like Hierarchy

animal animal fish mammal reptile amphibian fish mammal reptile amphibian wolf horse dog cat wolf horse dog cat

mare stallion hunting dog mare stallion hunting dog

dachshund terrier dachshund terrier Rank the following based on similarity: SIM(wolf, dog) To utilize WordNet, we often want to think about some graph- SIM(wolf, amphibian) based measure. SIM(terrier, wolf) SIM(dachshund, terrier)

8 3/6/19

WordNet-like Hierarchy WordNet-like Hierarchy

animal animal fish mammal reptile amphibian fish mammal reptile amphibian wolf horse dog cat wolf horse dog cat SIM(dachshund, terrier) SIM(dachshund, terrier) mare stallion hunting dog SIM(wolf, dog) mare stallion hunting dog SIM(wolf, dog) SIM(terrier, wolf) SIM(terrier, wolf) dachshund terrier SIM(wolf, amphibian) dachshund terrier SIM(wolf, amphibian) - path length is important (but not the only thing) - words that share the same ancestor are related What information/heuristics did you use to rank these? - words lower down in the hierarchy are finer grained and therefore closer

WordNet similarity measures WordNet similarity measures path length doesn’t work very well Utilizing information content: ¤ information content of the lowest common parent Some ideas: (Resnik, 1995) ¤ path length scaled by the depth (Leacock and Chodorow, 1998) ¤ information content of the words minus information content of the lowest common parent (Jiang and With a little cheating: Conrath, 1997) ¤ Measure the “information content” of a word using a corpus: how specific is a word? n words higher up tend to have less information content ¤ information content of the lowest common parent n more frequent words (and ancestors of more frequent words) tend to divided by the information content of the words (Lin, have less information content 1998)

9 3/6/19

Word similarity Dictionary-based similarity

Four general categories Word Dictionary blurb a large, nocturnal, burrowing mammal, ¤ Character-based aardvark Orycteropus afer, ofcentral and southern Africa, n turned vs. truned feeding on ants and termites andhaving a long, extensile tongue, strong claws, and long ears. n cognates (night, nacht, nicht, natt, nat, noc, noch)

¤ Semantic web-based (e.g. WordNet) One of a breed of small hounds having long ¤ Dictionary-based ears, short legs, and a usually black, tan, and beagle white coat. ¤ Distributional similarity-based n similar words occur in similar contexts Any carnivore of the family Canidae, having prominent canine teeth and, in the wild state, a long and slender muzzle, a deep-chested dog muscular body, a bushy tail, and large, erect ears. Compare canid.

Dictionary-based similarity Dictionary-based similarity

Utilize our text similarity measures What about words that have sim(dog, beagle) = multiple senses/parts of speech?

One of a breed of small hounds having long ears, short legs, and a usually black, tan, and sim( white coat. ,

Any carnivore of the family Canidae, having prominent canine teeth and, in the wild state, a long and slender muzzle, a deep-chested ) muscular body, a bushy tail, and large, erect ears. Compare canid.

10 3/6/19

Dictionary-based similarity Dictionary + WordNet

WordNet also includes a “gloss” similar to a 1. part of speech tagging 2. word sense disambiguation dictionary definition 3. most frequent sense 4. average similarity between all senses Other variants include the overlap of the word senses 5. max similarity between all senses as well as those word senses that are related (e.g. 6. sum of similarity between all senses hypernym, hyponym, etc.) ¤ incorporates some of the path information as well ¤ Banerjee and Pedersen, 2003

Word similarity Corpus-based approaches

Four general categories Word ANY blurb with the word ¤ Character-based aardvark n turned vs. truned n cognates (night, nacht, nicht, natt, nat, noc, noch) ¤ Semantic web-based (e.g. WordNet)

¤ Dictionary-based beagle Ideas? ¤ Distributional similarity-based n similar words occur in similar contexts

dog

11 3/6/19

Corpus-based Corpus-based: feature extraction

The Beagle is a breed of small to medium-sized dog. A member of the Hound Group, The Beagle is a breed of small to medium-sized dog. A member of the Hound Group, it is similar in appearance to the Foxhound but smaller, with shorter leg it is similar in appearance to the Foxhound but smaller, with shorter leg

Beagles are intelligent, and are popular as pets because of their size, even temper, We’d like to utilize our vector-based approach and lack of inherited health problems.

Dogs of similar size and purpose to the modern Beagle can be traced in Ancient How could we we create a vector from these occurrences? Greece[2] back to around the 5th century BC. ¤ collect word counts from all documents with the word in it From medieval times, beagle was used as a generic description for the smaller ¤ collect word counts from all sentences with the word in it hounds, though these dogs differed considerably from the modern breed. ¤ collect all word counts from all words within X words of the word In the 1840s, a standard Beagle type was beginning to develop: the distinction ¤ collect all words counts from words in specific relationship: subject- between the North Country Beagle and Southern object, etc.

Word-context co-occurrence vectors Word-context co-occurrence vectors

The Beagle is a breed of small to medium-sized dog. A member of the Hound Group, it is similar in appearance to the Foxhound but smaller, with shorter leg The Beagle is a breed the: 2 Beagles are intelligent, and are popular as pets because of their size, even temper, is: 1 and lack of inherited health problems. Beagles are intelligent, and a: 2 breed: 1 Dogs of similar size and purpose to the modern Beagle can be traced in Ancient to the modern Beagle can be traced are: 1 Greece[2] back to around the 5th century BC. intelligent: 1 From medieval times, beagle was used as and: 1 From medieval times, beagle was used as a generic description for the smaller to: 1 hounds, though these dogs differed considerably from the modern breed. 1840s, a standard Beagle type was beginning modern: 1 … In the 1840s, a standard Beagle type was beginning to develop: the distinction between the North Country Beagle and Southern Often do some preprocessing like lowercasing and removing stop words

12 3/6/19

Corpus-based similarity Web-based similarity

sim(dog, beagle) =

sim(context_vector(dog), context_vector(beagle)) the: 5 the: 2 is: 1 is: 1 a: 4 a: 2 breeds: 2 breed: 1 are: 1 are: 1 intelligent: 5 intelligent: 1 … and: 1 Ideas? to: 1 modern: 1 …

Web-based similarity Web-based similarity

beagle Concatenate the snippets for the top N results

Concatenate the web page text for the top N results

13 3/6/19

Another feature weighting Another feature weighting

TF- IDF weighting takes into account the general importance of a feature Feature weighting ideas given this additional information? For distributional similarity, we have the feature (fi), but we also have the word itself (w) that we can use for information

sim(context_vector(dog), context_vector(beagle)) sim(context_vector(dog), context_vector(beagle))

the: 5 the: 2 the: 5 the: 2 is: 1 is: 1 is: 1 is: 1 a: 4 a: 2 a: 4 a: 2 breeds: 2 breed: 1 breeds: 2 breed: 1 are: 1 are: 1 are: 1 are: 1 intelligent: 5 intelligent: 1 intelligent: 5 intelligent: 1 … and: 1 … and: 1 to: 1 to: 1 modern: 1 modern: 1 … …

Another feature weighting Mutual information

count how likely feature fi and word w are to occur together A bit more probability J ¤ incorporates co-occurrence p(x,y) ¤ but also incorporates how often w and fi occur in other I(X,Y) = ∑∑ p(x,y)log instances x y p(x)p(y)

sim(context_vector(dog), context_vector(beagle)) € When will this be high and when will this be low?

Does IDF capture this?

Not really. IDF only accounts for fi regardless of w

14 3/6/19

Mutual information Mutual information

A bit more probability J A bit more probability J

p(x,y) p(x,y) I(X,Y) = ∑∑ p(x,y)log I(X,Y) = ∑∑ p(x,y)log x y p(x)p(y) x y p(x)p(y)

if x and y are independent (i.e. one occurring doesn’t impact the if x and y are independent (i.e. one occurring doesn’t impact the other occurring) then: other occurring) then: € €

p(x, y) = p(x, y) = p(x)p(y)

What does this do to the sum?

Mutual information Mutual information

A bit more probability J p(y | x) I(X,Y ) = p(x, y)log p(x,y) ∑∑ p(y) I(X,Y) = ∑∑ p(x,y)log x y x y p(x)p(y) What is this asking? if they are dependent then: When is this high? € p(x, y) = p(x)p(y | x) = p(y)p(x | y)

How much more likely are we to see y given x has a particular value! p(y | x) I(X,Y ) = ∑∑ p(x, y)log x y p(y)

15 3/6/19

Point-wise mutual information PMI weighting

Mutual information is often used for feature selection in many problem areas Mutual information PMI weighting weights co-occurrences based on their correlation (i.e. high p(x,y) How related are two I(X,Y) = ∑∑ p(x,y)log variables (i.e. over all PMI) x y p(x)p(y) possible values/events) p(beagle,the) context_vector(beagle) log p(beagle)p(the) Point-wise mutual information the: 2 How do we is: 1 calculate these? How related are two p(beagle,breed) € a: 2 log p(x,y) particular € p(beagle)p(breed) PMI(x,y) = log events/values breed: 1 p(x)p(y) are: 1 intelligent: 1 and: 1 € to: 1 modern: 1 € …