VECTOR SPACE MODELS: • Relies on NLP Methods Already Seen INFORMATION RETRIEVAL and LEXICAL • Big ﬁeld: Small Taster Here SEMANTICS • Link to Today’S General Topic: •

Find documents in large collection • Match information need DAY 7: VECTOR SPACE MODELS: • Relies on NLP methods already seen INFORMATION RETRIEVAL AND LEXICAL • Big ﬁeld: small taster here SEMANTICS • Link to today’s general topic: •

Mark Granroth-Wilding vector representations

1 4

INFORMATION RETRIEVAL SYSTEM IR TERMS

Documents chapters Large paragraphs Digital text Information need: • web pages Not (explicitly) user’s desire to locate information to satisfy a need news articles Document structured May be unconscious ... collection • Query: • attempt to convey information need to system Relevant document: • IR system considered by user to have value with regard to information Search query need Relevant documents

5 6

IR SYSTEM IR: WEB SEARCH Relevant?

Document World wide web Information need collection Information need

IR system Search query Search query Result (documents) Result list Input (string)

7 8

IR: SMALLER SCALE NAIVE SEARCH

Personal document Information need collection

Naive method: all docs containing query terms • IR system Build index of terms in docs Search query • Result list

9 10 BUILDING AN INDEX BUILDING AN INDEX

Ladies Ladies

gentlemen gentlemen captain captain speaking Depends on: speaking words tokenization words spoken spoken human human pilot pilot Index each term

11 12

BUILDING AN INDEX NORMALIZATION: LEMMATIZATION

Ladies gentlemen Lots of NLP can help • Text normalization Map variant forms to base form captain • • Truecasing • Saw methods on day 3 speaking Lemmatization • • Punctuation words • am, are, is be Diactritics → • robot, robots, robot’s robot spoken Compound splitting • → Multiword expressions human • “Ladies and gentlemen, this is your captain speaking.” Named entity recognition lady and gentleman this be you captain speak pilot • →

13 14

DOCUMENT-TERM MATRIX DOCUMENT-TERM MATRIX

Very high-dimensional

Terms Ladies gentlemen captain speaking words spoken human pilot

1 0 1 0 0 1 0 1 2 doc0 1 0 1 0 0 1 0 1 2  1 0 0 1 0 0 1 0  0 Proximity doc1  1 0 0 1 0 0 1 0  0 Proximity 0 1 0 1 1 1 0 0 0 Use to rank doc 0 1 0 1 1 1 0 0 0 Use to rank   2  

Documents      1 1 0 0 1 0 0 1  1 Sparse: doc3  1 1 0 0 1 0 0 1  1 Sparse:       mostly 0s   mostly 0s Query: Query: 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 captain pilot captain pilot

15 15

Bag (= multiset): RANKING DOCUMENTS BAG OF WORDS set with repeated elements

Implicitly made an assumption: • order of words unimportant Query: captain pilot Ladies gentlemen captain speaking words spoken human pilot Also, absolute position in document unimportant • Identical representations: 5 0 7 0 0 2 0 6 13 •  1 0 1 4 0 0 5 0  1 Alice, who saw the man, pushed Bob 0 0 1 0 0 0 0 1 0 8 0 6 7 3 0 1 1 Bob pushed the man, who saw Alice      3 3 0 0 4 0 0 9  9 Bag of words model   •   Obvious limitations: word order is important! Doc with lots about captains and pilots more relevant • • Document position might be too (And not much about other things) • • Used for many things: we’ll see this again Term counts in matrix • •

16 17 RANKING DOCUMENTS RELATIVE IMPORTANCE OF WORDS

Query: captain pilot

Ladies gentlemen captain speaking words spoken human pilot 0 0 0.25 0 0.35 0 0 0.1 0 0.3  0.5  0.33 0.09 0 0.09 0.36 0 0 0.45 0  0  0.05     =   0 0.32 0 0.24 0.28 0.12 0 0.04 ·  0  0.02 0.16 0.16 0 0 0.21 0 0 0.47  0  0.24          0           0.5      Some more important within doc: e.g. repeated often Dot product sums counts of relevant terms Some less important even if frequent: e.g. frequent across corpus • Result: relevance per doc Idea: weight words using stats • Normalize per doc: account for doc length • 18 19

RELATIVE IMPORTANCE OF WORDS WEIGHTING WORDS: TF

Term frequency, tf : count of term t in doc d • t,d Search for robot, doc with higher tf more relevant • robot,d But, relevance not proportional to frequency • tfrobot,0 = 10 tfrobot,1 = 1 Doc 0 not 10x more relevant than doc 1 • Instead use: Some more important within doc: e.g. repeated often • Some less important even if frequent: e.g. frequent across corpus (1 + log(tft,d )) Idea: weight words using stats

19 20

WEIGHTING WORDS: IDF WEIGHTING WORDS: TF-IDF

Rare terms typically more informative • autopilot > technology > good • Combine tf and idf to weight terms good tells us little about doc (but still something) • • Frequent in doc higher autopilot tells us more • → • But, frequent in other docs lower • → Document frequency, df : Commonly used method: td-idf • t • number of documents term t occurs in N (1 + log(tft,d )) log Inverse document frequency: × dft • idf = log( N ) N: size of collection t dft Higher idf term t more informative • t →

21 22

TF-IDF WEIGHTING DOCUMENTS AS VECTORS Query: Terms captain pilot 0.89 0 2.02 0 0 0.45 0 2.31 Ladies gentlemen captain speaking words spoken human pilot 0 0.52 0 1.10 0.64 0 0 1.78 0 0     0 0.87 0 0.71 0.84 0.51 0 1.30 0.89 0 2.02 0 0 0.45 0 2.31 1 4.34 0.77 0.67 0 0 0.73 0 0 2.54

Documents   0.52 0 1.10 0.64 0 0 1.78 0  0  1.10       =     0 0.87 0 0.71 0.84 0.51 0 1.30 ·  0  1.30 0.77 0.67 0 0 0.73 0 0 2.54  0  2.54 Method treats documents as vectors (in a matrix)       •    0    Query also a vector       •  1    Vector space model (VSM)   • Useful way to formulate search Use tf-idf scores instead of word counts • • Generic representation: Compare query vector to document vectors as before • • apply diﬀerent normalizations, comparisons, weightings,...

23 25 COMPARING VECTORS VECTOR SPACES

doc0 Features (N) doc0 Euclidean distance increased doc0 0.89 0 2.02 0 0 0.45 0 N dimensions 1 doc0 doc1 0.52 0 1.10 0.64 0 0 1.78 1 query   term 0 0.87 0 0.71 0.84 0.51 0 term doc Compare angles 1 doc  0.77 0.67 0 0 0.73 0 0  1 Data points     term0 term0 Representation method extremely general Compare two docs to query • • Commonly applied to represent many things Imagine we append doc0 to itself: doc0 • • We’ll see several more today Content hasn’t changed: should match same query • • Angle compares relative values on dimensions •

26 27

COSINE SIMILARITY VECTORS IN NEURAL NETWORKS Aside

doc0 0.97 (13◦) query 1 0.87, 29◦ doc1

term Feed-forward NNs: all representations are vectors • Node contains numeric value term0 • “Layer” of nodes: vector Cosine similarity: cos of angle between vectors • Latent representations are vectors • ~ ~ N ~ ~ doc0 doc1 i=1 doc0,i doc1,i Can sometimes inspect and compare in this way cos(doc~ 0, doc~ 1) = · = • doc~ 0 doc~ 1 N ~ 2 N ~ 2 Sometimes capture these intuitions about e.g. similarity | || | iP=1 doc0,i i=1 doc1,i • q q Example later: word embeddings Commonly used measure ofP vector similarityP •

28 29

COMPARING DOCUMENTS COMPARING VECTORS doc0 0.89 0 2.02 0 0 0.45 0 doc 0.52 0 1.10 0.64 0 0 1.78 1   Same idea applied to comparing other entities 0 0.87 0 0.71 0.84 0.51 0 • Represent entity as vector: encode features numerically  0.77 0.67 0 0 0.73 0 0  •   Compare vectors to find similar entities   • Query 0 0 1 0 0 0 0 E.g. product similarity on Amazon • Compared query to doc • Collaborative filtering • Doc vector captures aspects of doc’s meaning or content Uses techniques seen today • • Similar vectors similar content And those coming up • • → Find similar entities: products, films,... Compare docs using same methods • • E.g. nearest neighbours to find related docs •

30 31

SPARSE VECTORS SPARSE VECTORS Features, N – very large chair table

0.8 0 2.0 0 0 0.4 0 0 0 Mostly zeroes 0.8 0 2.0 0 0 1.4 0 0 0 0.5 0 0 0 0 0 0 10.2 0   0 0 0 0 0 0 0 1.2 0 0 0 0 0.7 0 0.5 0 0 0  0 0 1.6 0.7 0 1.5 0 0 0   0 0.6 0 0 0.7 0 0 0 0     0 0.6 3.1 0 0.7 0 0 0 0     ?  Vectors seen up to now: sparse   • One possibility: exploit correlations between features Problem: • • chair and table often in same doc similar docs may contain related words • • Doc uses chair, but not table but have few exact words in common • • Probably still quite table-chairy? • friend, friends, buddy Dimensionality reduction: removes redundancy in columns • Want to capture more abstract features Exploit these correlations • • Single dimension for these Correlated dims collapsed • •

33 34 DIMENSIONALITY REDUCTION TOPIC MODELLING

Change of subject for a bit. . . Sparse matrix dense matrix • • → But it will lead back to VSMs soon! Fewer columns – features • • Original dims mapped mixture of new dims • Aim: ﬁnd themes (‘topics’) in large collection of documents Apply to doc-term matrix • • Standard methods: e.g. SVD Use for searching / exploring collection • • Each doc now has dense vector Large corpus: don’t know beforehand what topics covered • • Might lose some important distinctions (chair vs table) • Take unsupervised approach using type of clustering Often worth it for generalization • • Topic modelling •

35 38

TOPIC MODELLING TOPIC MODELLING

Reminder One model dominates the ﬁeld: One model dominates the ﬁeld: • • Bag of words Latent Dirichlet Allocation (LDA) Latent Dirichlet Allocation (LDA) Bayesian modelling: based on generative probability BayesianOrder modelling:of words based unimportant on generative probability • • • distribution distributionAbsolute position in document unimportant Simplifying assumptions: Simplifying• assumptions: • • Identical representations: Each document generated by simple process Each• document generated by simple process • • Fixed number of topics in corpus Fixed numberAlice, of whotopics sawin the corpus man, pushed Bob • • Bag of words model Bag of wordsBob pushedmodel the man, who saw Alice • • Document covers a few topics Document covers a few topics • • Topic associated (mainly) with small number of words Topic associated (mainly) with small number of words • • Words that often appear in same doc belong to same topic Words that often appear in same doc belong to same topic • •

39 39

LDA INTUITION LDA INTUITION

Examine large corpus Find set of ‘topics’ that eﬃciently covers all docs

40 40

LDA INTUITION LDA INTUITION

Words that co-occur often in same topic Each doc described by small number of distinct topics

40 40 LDA: ANALYSING DOCUMENTS EXERCISE: TOPIC MODELLING In small groups (2-4)

New doc: find small set of topics that Download zip containing news articles • • accounts for words Look quickly at all • Variety of Bayesian estimation Don’t read everything • • Glance at headline, keywords, etc. techniques to find coherent set • Come up with Look at which topics get used • • small set of broad topics ( 3) to categorize Infer doc’s topic mixture (distribution) • ∼ • larger set of specific topics covered Vector representation of document • • One doc can have multiple topics •

41 42

VECTOR SPACES DOCUMENT TOPIC VECTORS Reduced dimensions (N’) doc0 Topics (N’) 0.23 0.12 0.37 0.45 0.46

0.87 0.94 0.94 0.41 0.35 1 N’ dimensions   0.14 0.07 0.23 0.28 0.28 doc0 0.61 0.66 0.61 0.73 0.32 dim N’ topics doc1 0.25 0.27 0.27 0.12 0.10 1 0.87 0.98 0.86 0.91 0.58  

Documents   0.21 0.23 0.21 0.25 0.11   topic   0.21 0.23 0.20 0.22 0.14 doc1

Documents   dim0     Earlier: doc vectors made up of raw features topic0 • Word counts • Now: document-topic vectors Tf-idf • • Topic modelling: type of dimensionality reduction Dimensionality reduction: dense vecs, generalization • • Topic dims capture slightly diﬀerent generalization • N0 << N

43 44

TOPIC VECTORS LEXICAL SEMANTICS

As before, use vectors to compare docs • N’ topics Similar techniques again • Doc-doc similarity, nearest neighbours, doc0 •

1 search doc2 Another change of subject: word meanings • topic doc1 Vector is a probability dist But it will lead back to VSMs again! • • More appropriate metric than cosine: • topic0 Jensen-Shannon divergence (JSD) Or other metrics for prob dists • More details of LDA in Advanced stat models, tomorrow. . .

45 47

LEXICAL SEMANTICS LEXICAL SEMANTICS

Question What is a good way to remove wine stains? Representing word meaning • Text available Some difficult issues: • Salt is a great way to eliminate wine stains Representation • Acquiring broad-domain knowledge • Polysemy: same word, multiple meanings Different words may have similar/related meanings • • Multiword expressions Relationships between words can be highly complex • • Essential for using information from text • Difficult for broad-domain systems •

48 49 THESAURUS SEMANTIC RELATIONS: HYPONYMY

Relationship between meanings of words • Hyponymy: Is-A relation Dictionary of synonyms and • • closely related words dog is-a animal WordNet: large machine-readable • dog hyponym of animal thesaurus (and more) • Complex network of relationships animal hypernym of dog • • between words Forms a taxonomy • Mainly works for concrete nouns • democracy, thought, sticky?

50 51

SEMANTIC RELATIONS: WORDNET: SYNSETS MERONYMY

Synset: ‘synonym set’ sense • ' Unit of organization in WordNet Another relationship: part-whole • • Synonyms grouped into synset: • air bag part-of car car.n.01 = car, auto, automobile { } Polysemous words belong to multiple: • air bag meronyn of car car.n.04 = car, elevator car • { } Connected by semantic relations: hyponymy, meronymy,... •

52 53

WORDNET PROBLEMS WITH WORDNET(S)

https://wordnet.princeton.edu/ Mainly English: limited other languages Hand-built ontology • • Hand-built: large, but still limited coverage • 117k synsets Many multiword expressions, but far from complete: • • Primarily for English take a break, pay attention • Language changes: neologisms Global WordNet • • greenwashing, Brexiteer, barista Lists WN-like resources in diﬀerent languages • Names Currently 77 • • Mikkeli, Microsoft Varying quality, size, licenses • Open Multilingual WordNet Fine-grained senses • • Linked/merged WNs for diﬀerent languages • Currently 34 •

54 55

WORD SENSES PROBLEMS WITH WORDNET(S)

thing. She was talking at a party thrown at Daphne’s restaurant in turned it into the hot dinner-party topic. The comedy is the Mainly English: limited other languages the 1983 general election for a party which was uncertain of its policies • Hand-built: large, but still limited coverage to attack the Scottish National Party, who look set to seize Perth and • had been passed to a second party who made a financial decision Many multiword expressions, but far from complete: • by-pass there will be a street party. “Then,” he says, “we are going take a break, pay attention political tradition and the same party. They are both Anglophilic Language changes: neologisms • he told Tony Blair’s modernised party they must not retreat into “warm greenwashing, Brexiteer, barista “Oh no, I’m just here for the party,” they said. “I think it’s terrible Names • A future obliges each party to the contract to fulfil it by Mikkeli, Microsoft signed by or on behalf of each party to the contract.” Mr David N Fine-grained senses • How many senses? Difficult decision • WordNet is quite fine-grained •

55 57 TRANSLATION SENSES WORD SENSE DISAMBIGUATION

WordNet senses defined by hand • thing. She was talking at a party0 thrown at Daphne’s restaurant in Another way to define senses: • the 1983 general election for a party which was uncertain of its policies different translations evidence for two senses 1 → E.g. interest German: • ⇒ Zins: charge for a loan Which senses are used here? • • Anteil: stake in compant Often useful to distinguish • • Interesse: other senses Search: only interested in one sense • • Topic model: different senses in different topics Senses of German word may be distinguished in English • • Etc. . . Or in other languages • •

58 59

CLUES TO WORD SENSE WORD SENSE DISAMBIGUATION (WSD) beer, invite, host thing. She was talking at a party thrown at Daphne’s restaurant in the 1983 general election for a party which was uncertain of its policies Have predeﬁned senses for words political, minister, manifesto • Some clues that might help: E.g. from WordNet • Surrounding words Word sense disambiguation (WSD) becomes supervised • • Connected words (grammatically, same doc, . . . ) classiﬁcation task • Document topic Features: context words, other clues seen • • (Disambiguated) sense of same word used in same doc Classes: known senses for word • • Most also clues to word meaning: see later!

60 61

WORD MEANING MEANING FROM CONTEXT (senses)

WordNet: attempt to encode meaning of words in • machine-usable form A bottle of tezgino is on the table Can discover/represent word meaning in other ways • Everybody likes tezgino Statistical approach: distributional semantics Tezgino makes you drunk • Learn meaning from word’s use in text We make tezgino out of corn • No reliance on annotated resources • What is tezgino? Broad coverage • • You can probably work it out Apply to any language, domain, . . . • •

62 64

WORD CONTEXT DISTRIBUTIONAL HYPOTHESIS

Words that tend to appear in similar contexts have similar meanings Can infer meaning of word from contexts it typically occurs in • Measure word similarity by context similarity • Is word meaning anything more than contexts it occurs in? What is a word’s context? • • Distributional hypothesis: How to measure typical contexts? • • How to compare typical contexts? words that tend to appear in similar contexts have similar • meanings Lots of possible answers • Some capture diﬀerent aspects of word meaning You shall know a word by the company it keeps! – Firth (1957) • A couple of examples here • (1) Context = nearby words (2) Context = syntactically related words

65 66 COUNTING CONTEXTS COUNTING CONTEXTS

Context = nearby words Context = nearby words, ignoring stopwords Word Count said 2306 loss 1 53 billion 1985 company said able cut losses scaleddlrs 573 53 billion in 1985 the company said it had been able mln 539 acquisition by bond of the company s existing bank loans and said negotiations continuing acquisition bond company existing bank loans restructuringpct 397 gold be repaid by the mining company in gold Word275 Count mln dlrs repaid mining company gold stock 306 exploration and development of the company s gold properties in the creation developmentcredit fund exploration 59 development company gold properties centralyear province 289 masbate mln vs 6 9 mln company name is bowater industries plc commodity 57 mln vs 6 9 mln company name bowater industries1986 plc lt 276 ccc 49 would 272 tax 30 shares 261 celanese canada inc hoechst celanese corporation and trade mountain pipe line celanesemln canada 28 inc hoechst celanese corporation trade mountain pipeshare line company 257 to iraq the commodity credit corporation has accepted a bid for saidwheat flouriraq 28 commodity credit corporation accepted bid export bonus cover of adjusting the commodity credit corporation s ccc discount and premium commentsaccepted question adjusting 21 commodity credit corporation ccc discount premium schedules improve jordan usda the commodity credit corporation ccc accepted eight bonus offers bonuswheat jordan usda 20 commodity credit corporation ccc accepted eight bonus offers iraq usda the commodity credit corporation ccc accepted a bid for stateflour iraq usda 20 commodity credit corporation ccc accepted bid export bonus pct 19

67 68

WORD VECTORS COLLECTING CONTEXT COUNTS Contexts – features

0.89 0 2.02 0 0 0.45 0 Window: •  0.52 0 1.10 0.64 0 0 1.78  Small number of words (5, 10?) 0 0.87 0 0.71 0.84 0.51 0

• Words Large number (100) 0.77 0.67 0 0 0.73 0 0 •   Whole document   •   Exclude stopwords? Context word counts features to characterize word • • → Weight words? Closer higher? Construct vector for word • → • More than counts? Look familiar? • • said was common for company • Can compare words as we compared documents: But also for most other words • • vector similarity Frequency relative to expectation • Word embedded in vector space: One way: mutual information (e.g. PPMI) • • also called word embeddings

69 70

SPARSE VECTORS COMPARING WORD VECTORS

word0 word company corporation 1 1 Count vectors are sparse Word Count Word Count word1 • dim said 2306 credit 59 Lots of 0s dlrs 573 commodity 57 • dim Top words seem somewhat mln 539 ccc 49 0 • pct 397 tax 30 characteristic stock 306 mln 28 Apply dimensionality reduction year 289 • But, not much actual overlap said 28 • 1986 276 accepted 21 Compare vectors using cosine similarity would 272 • Generalize sparse dimensions bonus 20 Similar contexts similar features • shares 261 state 20 • → with dimensionality reduction share 257 pct 19 similar vectors → ? similar meaning →

71 72

VECTOR = MEANING PROBLEMS WITH COUNT-BASED 6 VECTORS

Want very large vocabulary • Distributional word vectors capture some aspects of meaning Oxford English Dictionary: 600k words • • Exactly what depends on: More if separate vectors for eat, eating, ate,... • • features • Context features weighting • dimensionality reduction 0.89 0 2.02 0 0 0.45 0 • ! comparison metric, . . . 0.52 0 1.10 0.64 0 0 1.78 •   0 0.87 0 0.71 0.84 0.51 0 The meaning of a word is given by its usage ? many • ← Words (V)  0.77 0.67 0 0 0.73 0 0  Vectors also depend on factors in corpus:   •   culture • linguistic register Dimensionality reduction slow for many words • • document types Memory ineﬃcient • • Hard to add new words •

73 74 WORD2VEC DON’T COUNT, PREDICT!

Learn: word vectors word-as-context vectors Positive example

Pulls vector and sleep out my

of going context vector closer my oclock

of to work and

twoevery if Random embedding initialization diversionfor

my eleven

• abouttimemorning employed

myself not did work began

then three to

Iterate over huge corpus times I with it this rain

I till

• with outof

gun walked order

gunhours Update vectors iteratively to predict context 1 or then • morningtime Prediction with neural network: embedding = layer • word2vec and many others since • South Korea is leaning towards introducing ...

Use word vectors , throw away word-as-context vectors

75 76

WORD2VEC WORD2VEC

Learn: word vectors word-as-context vectors Random example

Pushes vector and

sleep out my

context vector apart of going my oclock

of to

work and Training is fast: simple network

twoevery if

diversionfor •

my eleven

abouttimemorning employed

myself not did work began

then three to Train on many examples:

times I with it this rain

I till • 1

with outof 100B words news text gun walked order

gunhours 0 or then morningtime Vectors for many words: • 3M vectors South Korea is leaning towards introducing ...

Use word vectors , throw away word-as-context vectors 1Pre-trained vectors from authors

76 77

SUMMARY SUMMARY

IR systems, terms • Topic modelling: LDA more details tomorrow Document-term matrices • ← • Lexical semantics: WordNet Bag of words • • Word sense disambiguation (WSD) Word importance: tf-idf • • Distributional semantics Vector-space model (VSM), vector similarity • • Word embeddings Sparse dense: dimensionality reduction • • →

78 79

READING MATERIAL NEXT UP

After lunch: Vector Semantics: J&M3 ch 6 • Practical assignments in BK107 Lexical semantics: 6.1 • Embeddings: 6.2 9:15 – 12:00 Lectures • Document VSMs: 6.3.1 • 12:00 – 13:15 Lunch Distributional word vectors: 6.3.2 • 13:15 – 13:30 Introduction Similarity, cosine: 6.4 ∼ • 13:30 – 16:00 Practical assignments Tf-idf: 6.5 • Word2vec, dense vectors: 6.8 • Building word vectors Distributional semantics, word embeddings: Eisenstein ch 14 • • Embeddings from diﬀerent corpora Nice blog on embeddings, word2vec: The Illustrated Word2vec • • Using word embeddings •

80 81