Multilingual NLP via Cross-Lingual Word Embeddings

Ivan Vuli¢ LTL, University of Cambridge & PolyAI

FoTran Workshop

Helsinki; September 28, 2018 Joint Work with...

...plus eorts of many other researchers... 1 / 70 M

A cold open...

2 / 70 Embeddings are Everywhere (Monolingual)

Intrinsic (semantic) tasks:

Semantic similarity and synonym detection, word analogies Concept categorization Selectional preferences, language grounding

Extrinsic (downstream) tasks

Named entity recognition, part-of-speech tagging, chunking , dependency , discourse parsing

Beyond NLP:

Information retrieval, Web search , dialogue systems 3 / 70 Embeddings are Everywhere (Cross-lingual)

Intrinsic (semantic) tasks:

Cross-lingual and multilingual Bilingual lexicon learning, multi-modal representations

Extrinsic (downstream) tasks:

Cross-lingual SRL, POS tagging, NER Cross-lingual dependency parsing and sentiment analysis Cross-lingual annotation and model transfer Statistical/Neural

Beyond NLP:

(Cross-Lingual) Information retrieval (Cross-Lingual) Question answering 4 / 70 M

A slow intro...

5 / 70 Motivation

We want to understand and model the meaning of...

Source: dreamstime.com

...without manual/human input and without perfect MT 6 / 70 Motivation

The NLP community has developed useful features for several tasks but nding features that are...

1. task-invariant (POS tagging, SRL, NER, parsing, ...)

(monolingual word embeddings)

2. language-invariant (English, Dutch, Chinese, Spanish, ...)

(cross-lingual word embeddings → this talk)

...is non-trivial and time-consuming (20+ years of feature engineering...)

7 / 70 Motivation

The NLP community has developed useful features for several tasks but nding features that are...

1. task-invariant (POS tagging, SRL, NER, parsing, ...)

(monolingual word embeddings)

2. language-invariant (English, Dutch, Chinese, Spanish, ...)

(cross-lingual word embeddings → this talk)

...is non-trivial and time-consuming (20+ years of feature engineering...)

Learn word-level features which generalise across tasks and languages

8 / 70 The World Existed B.E. (Before Embeddings)

9 / 70 B.E. Example 1: Cross-lingual (Brown) clusters [Tackstrom et al., NAACL 2012, Faruqui and Dyer, ACL 2013] The World Existed B.E.

B.E. Example 2a: Traditional count-based cross-lingual spaces... [Gaussier et al., ACL 2004; Laroche and and Langlais, COLING 2010]

10 / 70 The World Existed B.E.

B.E. Example 2b: ...and bootstrapping from a limited bilingual signal [Peirsman and Padó, NAACL 2010; Vuli¢ and Moens, EMNLP 2013]

11 / 70 The World Existed B.E.

B.E. Example 3: Cross-lingual topical spaces [Mimno et al., EMNLP 2009; Vuli¢ et al., ACL 2011] 12 / 70 Motivation Revisited

Cross-lingual word embeddings:

- simple: quick and ecient to train - state-of-the-art and omnipresent - lightweight and cheap Cross-lingual models are to MT what (monolingual) word embedding models are to language modeling.

by Sebastian Ruder

Are they?

They support cross-lingual NLP: enabling cross-lingual modeling and cross-lingual transfer

13 / 70 Learning Word Representations

Key idea Distributional hypothesis words with similar meanings are likely to appear in similar contexts→

[Harris, Word 1954]

shouts: Meaning as use!

calmly states: You shall know a word by the company it keeps. 14 / 70 Word Embeddings

Representation of each word w V : ∈

vec(w) = [f1, f2, . . . , fdim]

Word representations in the same semantic (or embedding) space!

Image courtesy of [Gouws et al., ICML 2015]

15 / 70 Cross-Lingual Word Embeddings (CLWEs)

Representation of a word wS V S: 1 ∈

S 1 1 1 vec(w1 ) = [f1 , f2 , . . . , fdim]

Exactly the same representation for wT V T : 2 ∈

T 2 2 2 vec(w2 ) = [f1 , f2 , . . . , fdim]

Language-independent word representations in the same shared semantic (or embedding) space!

16 / 70 Cross-Lingual Word Embeddings

[Luong et al., NAACL 2015] 17 / 70 Cross-Lingual Word Embeddings

Monolingual vs. Cross-lingual

Q1 Algorithm Design: How to align semantic spaces in two dierent→ languages?

Q2 Data Requirements: Which bilingual signals are used for the alignment?→

18 / 70 Data Requirements: Bilingual Signals

Parallel Comparable Word Dictionaries Images Sentence Translations Captions Document - Wikipedia

Nature and alignment level of data sources required by CLWE models

Data requirements are more important for the nal CLWE model performance than the actual underlying architecture [Levy et al., EACL 2017]

19 / 70 Data Requirements: Bilingual Signals

[Upadhyay et al., ACL 2016]

Illustration of dierent data requirements and bilingual signals

1. Word-level: word alignments and translation dictionaries 2. Sentence-level: sentence alignments 3. Document-level: Wikipedia/news articles 4. Other: visual alignment: image captions, eye-tracking data

5. Without any bilingual signal?

20 / 70 Welcome to the CLWE Jungle I

Parallel Comparable Word Mikolov et al. (2013) Kiela et al. (2015) Faruqui & Dyer (2014) Vuli¢ et al. (2016) Xing et al. (2015) Vuli¢ et al. (2017) Dinu et al. (2015) Zhang et al. (2017) Lazaridou et al. (2015) Hauer et al. (2017) Zhang et al. (2016) Zou et al. (2013) Ammar et al. (2016) Artexte et al. (2016) Xiao & Guo (2014) Gouws and Søgaard (2015) Duong et al. (2016) Gardner et al. (2015) Smith et al. (2017) Mrk²i¢ et al. (2017) Artetxe et al. (2017)

21 / 70 Welcome to the CLWE Jungle II

Parallel Comparable Sentence alignments Guo et al. (2015) Calixto et al. (2017) Vyas and Carpuat (2016) Gella et al. (2017) Hermann and Blunsom (2013) Lauly et al. (2013) Ko£iský et al. (2014) Chandar et al. (2014) Pham et al. (2015) Klementiev et al. (2012) Soyer et al. (2015) Luong et al. (2015) Gouws et al. (2015) Coulmance et al. (2015) Shi et al. (2015) Rajendran et al. (2016) Levy et al. (2017) Document Hermann and Blunsom (2014) Vuli¢ & Korhonen (2016) Vuli¢ and Moens (2016) Søgaard et al. (2015) Mogadala and Rettinger (2016)

22 / 70 General (Simplied) Methodology

(Bilingual) data sources are more important than the chosen algorithm

Most algorithms are formulated as: 1(W) + 2(V) + Ω(W, V ) L L

23 / 70 Learning from Word-Level Information

1. Mapping-based or o ine approaches: train monolingual word representations independently and then learn a transformation matrix from one language to the other

2. Pseudo-bilingual approaches: train a monolingual WE model on automatically constructed corpora containing words from both languages

3. Joint online approaches: take as input and minimize the source and target language losses jointly with the regularization term

These approaches are equivalent, modulo optimization strategies

24 / 70 Learning from Word-Level Information: Example 1

The geometric relations that hold between words are similar across languages: e.g., numbers and animals in English show a similar geometric constellation as their Spanish counterparts

[Mikolov et al., arXiv 2013]

25 / 70 Learning from Word-Level Information: Example 1

Basic mapping approach

[Mikolov et al., arXiv 2013]

Learn to transform the pre-trained source language embeddings into a space where the distance between a word and its translation pair is minimised 26 / 70 Learning from Word-Level Information: Example 1

Basic mapping approach

Based on given word translation pairs s t , (xi , xi) i = 1,...,N Mean-squared error between the Euclidean distances (of the actual and transformed vector)

n X s t 2 ΩMSE = Wx x k i − ik i=1

s t 2 ΩMSE = WX X k − kF Standard (re)formulation:

1 2 J = SGNS + SGNS + ΩMSE L L 27 / 70 Bootstrapping from Small Seed Dictionaries

The latest instance of the bootstrapping idea [Artetxe et al., ACL 2017]

Self-learning iterative framework: starting from cognates or numerals

28 / 70 Bootstrapping from Small Seed Dictionaries

Lexicon induction results from [Artetxe et al., ACL 2017]

Bootstrapping helps with really small seed dictionaries

A performance plateau with simple mapping approaches has been reached? 29 / 70 Learning without any Bilingual Signal?

The rst instance of the fully  idea [Conneau et al., ICLR 2018; Artetxe et al., ACL 2018]

Adversarial training: playing the two-way game between a generator (mapping) and a discriminator

Alignment at the level of distributions

Supporting unsupervised NMT [Lample et al., ICLR 2018] and unsupervised CLIR [Litschko et al., SIGIR 2018] 30 / 70 Learning without any Bilingual Signal?

But is it really that great?

[Søgaard Ruder, Ruder, Vuli¢; ACL 2018]

31 / 70 Mapping Approach: Bilingual to Multilingual

Fixing one space (English) and learning transformations for all monolingual word embeddings in other languages

This constructs a multilingual space from input monolingual spaces

[Smith et al., ICLR 2017]:

A unied multilingual space for 89 languages using fastText vectors and 5k Google Translate pairs for each language pair

32 / 70 Learning from Word-Level Information: Example 2

Pseudo-bilingual training

[Gouws and Søgaard, NAACL 2015] Barista: Bilingual Adaptive Reshu ing with Individual Stochastic Alternatives

1. Build a pseudo-bilingual corpus (or corrupt the original training data) using a set of predened equivalences (e.g., POS tags, word translation pairs)

2. Train a monolingual WE model on the corrupted/shu ed corpus

Original: we build the house

POS: we build la voiture/ they run la house

Translations: we construire the maison/ nous build la house

33 / 70 Learning from Word-Level Information: Example 2

[Gouws and Søgaard, NAACL 2015] Barista: Bilingual Adaptive Reshu ing with Individual Stochastic Alternatives

1 Input: Source corpus Cs, target corpus Ct, dictionary D

2 Concatenate Cs and Ct and then shu e the concatenated corpus

3 Pass over the corpus, for each word w: if {w0|(w, w0) ∨ (w0, v) ∈ D} is non-empty and of cardinality k, replace w with w0 with probability 1/2k; keep w otherwise

4 Train a standard WE model on the pseudo-bilingual corpus C0

34 / 70 Learning from Word-Level Information: Example 2

The chosen equivalence class dictates the shape of the output space

Cross-lingual embedding space with POS equivalences 35 / 70 Learning from Word-Level Information: Example 3

Joint (Online) Training Bilingual Skip-Gram (BiSkip) proposed by Luong et al., NAACL 2015

Very similar to the original mapping approach, but formulated as online monolingual and cross-lingual training on pseudo-bilingual corpora

1 2 J = SGNS + SGNS + ΩL ,L L L 1 2

ΩL1,L2 = SGNSL1→L2 + SGNSL2→L1 36 / 70 Learning from Word-Level Information: Example 3

Multilingual Skip-Gram (MultiSkip) by Ammar et al., NAACL 2016

Simple extension (again) summing up bilingual objectives for all available parallel corpora →

1 2 3 J = SGNS + SGNS + SGNS + Ω + Ω + Ω L L L L1,L2 L2,L3 L1,L3

37 / 70 Learning from Word-Level Information: Example 4

Post-processing/specialisation

State-of-the-art: Paragram and Attract-Repel [Wieting et al., TACL 2015; Mrk²i¢ et al., TACL 2017]

If S is the set of synonymous word pairs, the procedure iterates over mini-batches of such constraints BS , optimising the following cost function:

X S(BS ) = (ReLU (δsim + xltl − xlxr)

(xl,xr )∈BS

+ ReLU (δsim + xrtr − xlxr))

where δsim is the similarity margin and tl and tr are negative examples for the

given word pair (xl, xr).

38 / 70 CLWEs via Semantic Specialisation

Cross-lingual supervision: again word-level linguistic constraints (e.g., synonyms)

Constraint Relation Source (en_sweet, it_dolce) SYN BabelNet (en_work, fr_travail) SYN BabelNet (en_school, de_schule) SYN BabelNet (fr_montagne, de_gebirge) SYN BabelNet (sh_gradona£elnik, en_mayor) SYN BabelNet (nl_vrouw, it_donna) SYN BabelNet (en_sour, it_dolce) ANT BabelNet (en_asleep, fr_éveillé) ANT BabelNet (en_cheap, de_teuer) ANT BabelNet (de_langsam, es_rápido) ANT BabelNet (sh_obeshrabriti, en_encourage) ANT BabelNet (fr_jour, nl_nacht) ANT BabelNet

39 / 70 CLWEs via Semantic Specialisation

Negative Examples for each Synonymy Pair

For each synonymy pair (xl, xr), the negative example pair (tl, tr) is chosen from the remaining in-batch vectors so that tl is the one closest (cosine similarity) to xl and tr is closest to xr.

40 / 70 Learning from Aligned Sentences: Example 1

Trans-gram model: very similar to BiSkip, but it relies on full sentential contexts

[Coulmance et al., EMNLP 2015]

1 2 J = LSGNS + LSGNS + ΩL1,L2

ΩL1,L2 = SGNSL1→L2 + SGNSL2→L1

Deja vu, right? 41 / 70 Learning from Aligned Sentences: Example 2

BilBOWA model: very similar to online models, but using sentence-level cross-lingual information [Gouws et al., ICML 2017] Deja vu, right?

1 2 J = LSGNS + LSGNS + ΩBilBOW A m n 1 X s 1 X t 2 Ω = k x − x k BilBOW A m i n j ws∈sents wt ∈sentt i j 42 / 70 Learning from Aligned Documents: Example 1

Again, pseudo-bilingual training

43 / 70 Learning from Aligned Documents: Example 2

Inverted indexing

[Søgaard et al. ACL 2015]; slide courtesy of Anders Søgaard

44 / 70 Summary

Slide courtesy of Anders Søgaard 45 / 70 Summary

Slide courtesy of Anders Søgaard 46 / 70 Evaluation

Quality of Embeddings

Intrinsic Extrinsic

How good are they in How good are they as representing word meaning? features in downstream tasks? Evaluation

Intrinsic >> Extrinsic?

Intrinsic << Extrinsic?

Intrinsic Extrinsic?

Most Ideal but hardly true!

[Schnabel et al, EMNLP 2015; Tsvetkov et al, EMNLP 2015] Similarity-Based vs Feature-Based?

Parallel to the division to intrinsic and extrinsic tasks, there is another perspective:

Similarity-based use of embeddings: applications that rely on (constrained) nearest neighbour search in the cross-lingual word embedding graph

Feature-based use of embeddings: use cross-lingual embeddings directly as features: information retrieval, cross-lingual transfer? Cross-Lingual Lexicon Induction and Cross-Lingual Word Similarity

Semantic similarity = measuring distance in the induced space

Bilingual lexicons: (en_morning, it_mattina), (en_carpet, pt_tapete), ... 50 / 70 Cross-Lingual Lexicon Induction

Similarity-based vs. classication-based lexicon induction

Word embeddings used as features for classication

[Irvine and Callison-Burch, CompLing 2017; Heyman et al., EACL 2017] Extrinsic Evaluation Cross-lingual Document Classication

Quite popular in the CLWE literature

Cross-lingual knowledge transfer?

[Klementiev et al., COLING 2012; Heyman et al., DAMI 2015] Cross-Lingual Document Classication

53 / 70 Extrinsic Evaluation Cross-lingual Document Classication

Using word-level information composed into a document-level representation

[Hermann and Blunsom, ACL 2014]

1. Is this really an optimal way to approach this task? 2. Is this really an optimal way to evaluate word embeddings? Cross-Lingual Dependency Parsing

55 / 70 Other Applications Cross-lingual Task X, Cross-lingual Task Y, ...

Cross-lingual supersense tagging

[Gouws and Søgaard, NAACL 2015]

Word alignment

[Levy et al., EACL 2017]

POS tagging

[Søgaard et al., ACL 2015; Zhang et al., NAACL 2016]

Entity linking: wikication

[Tsai and Roth, NAACL 2016]

Relation prediction

[Glava² and Vuli¢, NAACL 2018]

56 / 70 Other Applications Cross-lingual Task X, Cross-lingual Task Y, ...

Cross-lingual frame-semantic parsing and discourse parsing

[Johannsen et al., EMNLP 2015, Braud et al., EACL 2017]

Cross-lingual lexical entailment

[Vyas and Carpuat, NAACL 2016, Upadhyay et al., NAACL 2018]

Semantic role labeling

[Kozhevnikov and Titov, ACL 2013; Akbik et al., ACL 2016]

Sentiment analysis

[Zhou et al., ACL 2016; Abdalla and Hirst, arXiv 2017]

57 / 70 Extrinsic Evaluation and Application Cross-lingual Information Retrieval

Semantic representations for cross-lingual information IR

[Vuli¢ and Moens, SIGIR 2015; Mitra and Craswell, arXiv 2017] Extrinsic Evaluation and Application Cross-lingual Information Retrieval

Document and Query Embeddings

Composition of word embeddings: [Mitchell and Lapata, ACL 2008]

−→d = w1 + w2 + ... + w N −→ −→ −−−→| d|

The dim-dimensional document embedding in the same cross-lingual word embedding space:

−→d = [fd,1, . . . , fd,k, . . . , fd,dim] Extrinsic Evaluation and Application Cross-lingual Information Retrieval

Document and Query Embeddings

The same principles with queries →

−→Q = −→q1 + −→q2 + ... + −→qm

The dim-dimensional query embedding in the same bilingual word embedding space:

−→Q = [fQ,1, . . . , fQ,k, . . . , fQ,dim] Extrinsic Evaluation and Application Language Understanding: Multilingual Dialog State Tracking

61 / 70 Extrinsic Evaluation and Application Language Understanding: Multilingual Dialog State Tracking

The Neural Belief Tracker is a novel DST model/framework which aims to satisfy the following design goals:

1 End-to-end learnable (no SLU modules or semantic dictionaries).

2 Generalisation to unseen slot values.

3 Capability of leveraging the semantic content of pre-trained word vector spaces without human supervision.

[Mrk²i¢ et al., ACL 2017; Mrk²i¢ and Vuli¢, ACL 2018]

62 / 70 Extrinsic Evaluation and Application Language Understanding: Multilingual Dialog State Tracking

Representation Learning + Label Embedding + Separate Binary Decisions To overcome data sparsity, NBT models use label embedding to decompose multi-class classication into many binary ones. Extrinsic Evaluation and Application Language Understanding: Multilingual Dialog State Tracking

DST in Italian and German

Multilingual WOZ 2.0 (Mrk²i¢ et al., 2017); Italian and German The 1,200 dialogues in WOZ 2.0 were translated by native Italian and German speakers instructed to consider preceding dialogue context.

Word Vector Space EN IT DE Best Baseline Vector Space 81.6 71.8 50.5 Monolingual Distributional Vectors 77.6 71.2 46.6 + Monolingual Specialisation 80.9 72.7 52.4 ++ Cross-Lingual Specialisation 80.3 75.3 55.7 +++ Multilingual DST Model 82.8 77.1 57.7 Extrinsic Evaluation and Application Language Understanding: Multilingual Dialog State Tracking

DST Models for Resource-Poor Languages

Ontology Grounding: Multilingual DST Models The domain ontology (i.e. the concepts it expresses) is language agnostic, which means that `labels' persist across languages. Using training data for two (or more) languages, and cross-lingual vectors of high quality, we train the rst-ever multilingual DST model. 3. Data is Crucial (more than the Chosen Algorithm) Optimization and various modeling tricks matter, but the chosen multilingual signal is the most important factor

4. CL Word Embeddings Support Cross-Lingual NLP They are useful across a wide variety of cross-lingual NLP tasks, for cross-lingual (re-lexicalized) transfer, language understanding, IR, ...

Take-Home Messages

1. Cross-Lingual Word Embedding Models are More Similar to Old(er) NLP Ideas than It Seems But they are simple, ecient, and state-of-the-art...

2. CLWE Models are More Similar than It Seems We have addressed similarities between a plethora of cross-lingual word embedding models 4. CL Word Embeddings Support Cross-Lingual NLP They are useful across a wide variety of cross-lingual NLP tasks, for cross-lingual (re-lexicalized) transfer, language understanding, IR, ...

Take-Home Messages

1. Cross-Lingual Word Embedding Models are More Similar to Old(er) NLP Ideas than It Seems But they are simple, ecient, and state-of-the-art...

2. CLWE Models are More Similar than It Seems We have addressed similarities between a plethora of cross-lingual word embedding models

3. Data is Crucial (more than the Chosen Algorithm) Optimization and various modeling tricks matter, but the chosen multilingual signal is the most important factor Take-Home Messages

1. Cross-Lingual Word Embedding Models are More Similar to Old(er) NLP Ideas than It Seems But they are simple, ecient, and state-of-the-art...

2. CLWE Models are More Similar than It Seems We have addressed similarities between a plethora of cross-lingual word embedding models

3. Data is Crucial (more than the Chosen Algorithm) Optimization and various modeling tricks matter, but the chosen multilingual signal is the most important factor

4. CL Word Embeddings Support Cross-Lingual NLP They are useful across a wide variety of cross-lingual NLP tasks, for cross-lingual (re-lexicalized) transfer, language understanding, IR, ... Embeddings are Everywhere (Cross-lingual)

Intrinsic (semantic) tasks:

Cross-lingual and multilingual semantic similarity Bilingual lexicon learning, multi-modal representations

Extrinsic (downstream) tasks:

Cross-lingual SRL, POS tagging, NER Cross-lingual dependency parsing and sentiment analysis Cross-lingual annotation and model transfer Statistical/Neural machine translation

Beyond NLP:

(Cross-Lingual) Information retrieval (Cross-Lingual) Question answering 69 / 70 Cross-Lingual Space of Thankyous

This talk was based on a recent tutorial: http://people.ds.cam.ac.uk/iv250/tutorial/xlingrep-tutorial.pdf

Book under review: Morgan & Claypool Synthesis Lectures with Anders Søgaard, Sebastian Ruder, Manaal Faruqui 70 / 70