<<

Character-Level Models

Hinrich Sch¨utze

Center for Information and Language Processing, LMU

2019-08-29

1 / 70 Overview

1 Motivation

2 fastText

3 CNNs

4 FLAIR

5 Summary

2 / 70 Outline

1 Motivation

2 fastText

3 CNNs

4 FLAIR

5 Summary

3 / 70 Typical NLP pipeline: Tokenization

Mr. O’Neill thinks that Brazil’s capital is Rio. Mr.|O’Neill|thinks|that|Brazil|’s|capital|is|Rio|.

4 / 70 Typical NLP pipeline: Tokenization

Mr. O’Neill thinks that Brazil’s capital is Rio. Mr.|O’Neill|thinks|that|Brazil|’s|capital|is|Rio|.

5 / 70 Typical NLP pipeline: Tokenization

Mr. O’Neill thinks that Brazil’s capital is Rio. Mr.|O’Neill|thinks|that|Brazil|’s|capital|is|Rio|.

6 / 70 Typical NLP pipeline: Tokenization

Mr. O’Neill thinks that Brazil’s capital is Rio. Mr.|O’Neill|thinks|that|Brazil|’s|capital|is|Rio|.

7 / 70 Typical NLP pipeline: Tokenization

Mr. O’Neill thinks that Brazil’s capital is Rio. Mr.|O’Neill|thinks|that|Brazil|’s|capital|is|Rio|.

8 / 70 Typical NLP pipeline: Tokenization

Mr. O’Neill thinks that Brazil’s capital is Rio. Mr.|O’Neill|thinks|that|Brazil|’s|capital|is|Rio|.

9 / 70 Typical NLP pipeline: Morphological analysis

For example: lemmatization Mr. O’Neill knows that the US has fifty states Mr. O’Neill know that the US have fifty state

10 / 70 Preprocessing in the typical NLP pipeline

Tokenization Morphological analysis Later today: BPEs What is the problem with this?

11 / 70 Problems with typical preprocessing in NLP

Rules do not capture structure within tokens. Regular morphology, e.g., compounding: “Staubecken” can mean “Staub-Ecken” (dusty corners) or “Stau-Becken” (dam reservoir) Non-morphological, semi-regular productivity: cooooooooooool, fancy-shmancy, Watergate/Irangate/Dieselgate Blends: Obamacare, mockumentary, brunch Onomatopoeia, e.g., “oink”, “sizzle”, “tick tock” Certain named entity classes: What is “lisinopril”? Noise due to spelling errors: “signficant” Noise that affects token boundaries, e.g., in OCR: “run fast” → “runfast”

12 / 70 Problems with typical preprocessing in NLP

Rules do not capture structure across tokens. Noise that affects token boundaries, e.g., in OCR: “gumacamole” → “guaca” “mole” recognition of names / multiphrase expressions “San Francisco-Los Angeles flights” “Nonsegmented” languages: Chinese, Thai, Burmese

13 / 70 Pipelines in deep learning (and StatNLP in general)

We have a pipeline consisting of two differ- ent subsystems: A preprocessing component: tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is not optimal for the objective and there are many cases where it’s outright harmful. If we replace the preprocessing component with a character-level layer, we can train the architecture end2end and get rid of the pipeline.

14 / 70 Advantages of end2end vs. pipeline

End2end optimizes all parameters of a deep learning model directly for the learning objective, including “first-layer” parameters that connect the raw input representation to the first layer of internal representations of the network. Pipelines generally don’t allow “backtracking” if an error has been made in the first element of the pipeline. In character-level models, there is no such thing as an out-of-vocabulary word. (OOV analysis) Character-level models can generate words / units that did not occur in the training set (OOV generation). End2end can deal better with human productivity (e.g., “brunch”), misspellings etc.

15 / 70 Three character-level models

fastText Bag of character ngrams Character-aware CNN (Kim, Jernite, Sontag, Rush, 2015) CNN FLAIR Character-level BiLSTM

16 / 70 Outline

1 Motivation

2 fastText

3 CNNs

4 FLAIR

5 Summary

17 / 70 fastText

FastText is an extension of word2vec. It computes embeddings for character ngrams A word’s embedding is the sum of its character ngram embeddings. Parameters: minimum ngram length: 3, maximum ngram length: 6 The embedding of “dendrite” will be the sum of the following ngrams: @dendrite@ @de den end ndr dri rit ite te@ @den dend endr ndri drit rite ite@ @dend dendr endri ndrit drite rite@ @dendr dendri endrit ndrite drite@

18 / 70 fastText: Example for benefits

Embedding for character ngram “dendri” → “dentrite” and “dentritic” are similar word2vec: no guarantee, especially for rare words

19 / 70 fastText paper

20 / 70 fastText objective

T X X − log p(wc |wt ) t=1 c∈Ct T length of the training corpus in tokens Ct words surrounding word wt

21 / 70 Probability of a context word: softmax?

exp(s(wt , wc )) p(wc |wt )= W , Pj=1 exp(s(wj wc ))

s(wt , wc ) scoring function that maps word pair to R Problems: too expensive

22 / 70 Instead of softmax: Negative sampling and binary logistic loss

log(1 + exp(−s(wt , wc ))) + X log(1 + exp(s(wt , wn))) n∈Nt,c

ℓ(s(wt , wc )) + X ℓ(−s(wt , wn)) n∈Nt,c

Nt,c set of negative examples sampled from the vocabulary ℓ(x) log(1 + exp(−x)) (logistic loss)

23 / 70 Binary logistic loss for corpus

T X  X ℓ(s(wt , wc )) + X ℓ(−s(wt , wn)) t=1 c∈Ct n∈Nt,c ℓ(x) log(1 + exp(−x))

24 / 70 Scoring function

u⊺ v s(wt , wc )= wt wc u wt the input vector of wt v wc the output vector (or context vector) of wc

25 / 70 Subword model

1 z⊺ v s(wt , wc )= X g wc |Gwt | g∈Gwt

Gwt set of ngrams of wt and wt itself

26 / 70 fastText: Summary

Basis: word2vec skipgram Objective: includes character ngrams as well as word itself Result: word embeddings that combine word-level and character-level information We can compute an embedding for any unseen word (OOV).

27 / 70 Letter n-gram generalization can be good

word2vec 1.000 automobile 779 mid-size 770 armored 763 seaplane 754 bus 754 jet 751 submarine 750 aerial 744 improvised 741 anti-aircraft

fastText 1.000 automobile 976 automobiles 929 Automobile 858 manufacturing 853 motorcycles 849 Manufacturing 848 motorcycle 841 automotive 814 manufacturer 811 manufacture

28 / 70 Letter n-gram generalization can be bad

word2vec 1.000 Steelers 884 Expos 865 Cubs 848 Broncos 831 Dinneen 831 Dolphins 827 Pirates 826 Copley 818 Dodgers 814 Raiders

fastText 1.000 Steelers 893 49ers 883 Steele 876 Rodgers 857 Colts 852 Oilers 851 Dodgers 849 Chalmers 849 Raiders 844 Coach

29 / 70 Letter n-gram generalization: no-brainer for unknowns (OOVs)

word2vec (“video-conferences” did not occur in corpus)

fastText 1.000 video-conferences 942 conferences 872 conference 870 Conferences 823 inferences 806 Questions 805 sponsorship 800 References 797 participates 796 affiliations

30 / 70 fastText extensions (Mikolov et al, 2018)

31 / 70 fastText extensions (Mikolov et al, 2018)

Position-dependent features Phrases (like word2vec) cbow Pretrained word vectors for 157 languages

32 / 70 fastText evaluation

33 / 70 Code

fastText https://fasttext.cc gensim https://radimrehurek.com/gensim/

34 / 70 Pretrained fasttext embeddings

Afrikaans, Albanian, Alemannic, Amharic, Arabic, Aragonese, Armenian, Assamese, Asturian, Azerbaijani, Bashkir,

Basque, Bavarian, Belarusian, Bengali, Bihari, Bishnupriya Manipuri, Bosnian, Breton, Bulgarian, Burmese,

Catalan, Cebuano, Central Bicolano, Chechen, Chinese, Chuvash, Corsican, Croatian, Czech, Danish, Divehi, Dutch,

Eastern Punjabi, Egyptian Arabic, Emilian-Romagnol, English, Erzya, Esperanto, Estonian, Fiji Hindi, Finnish,

French, Galician, Georgian, German, Goan Konkani, Greek, Gujarati, Haitian, Hebrew, Hill Mari, Hindi, Hungarian,

Icelandic, Ido, Ilokano, Indonesian, Interlingua, Irish, Italian, Japanese, Javanese, Kannada, Kapampangan, Kazakh,

Khmer, Kirghiz, Korean, Kurdish (Kurmanji), Kurdish (Sorani), Latin, Latvian, , Lithuanian, Lombard,

Low Saxon, , Macedonian, Maithili, Malagasy, Malay, Malayalam, Maltese, Manx, Marathi,

Mazandarani, Meadow Mari, Minangkabau, Mingrelian, Mirandese, Mongolian, Nahuatl, Neapolitan, Nepali,

Newar, North Frisian, Northern Sotho, Norwegian (Bokm˚al), Norwegian (), Occitan, Oriya, Ossetian,

Palatinate German, Pashto, Persian, Piedmontese, Polish, Portuguese, Quechua, Romanian, Romansh, Russian,

Sakha, Sanskrit, Sardinian, Scots, Scottish Gaelic, Serbian, Serbo-Croatian, Sicilian, Sindhi, Sinhalese, Slovak,

Slovenian, Somali, Southern Azerbaijani, Spanish, Sundanese, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar,

Telugu, Thai, Tibetan, Turkish, Turkmen, Ukrainian, , Urdu, Uyghur, Uzbek, Venetian, Vietnamese,

Volap¨uk, Walloon, Waray, Welsh, West , West Frisian, Western Punjabi, , Yoruba, Zazaki,

35 / 70 fastText skipgram parameters

-input training file path -output output file path -lr (0.05) learning rate -lrUpdateRate (100) rate of updates for the learning rate -dim (100) dimensionality of word embeddings -ws (5) size of the context window -epoch (5) number of epochs

36 / 70 fastText skipgram parameters

-minCount (5) minimal number of word occurrences -neg (5) number of negatives sampled -wordNgrams (1) max length of word ngram -loss (ns) loss function ∈ { ns, hs, softmax } -bucket (2,000,000) number of buckets -minn (3) min length of char ngram -maxn (6) max length of char ngram

37 / 70 fastText skipgram parameters

-threads (12) number of threads -t (0.0001) sampling threshold -label labels prefix

38 / 70 Outline

1 Motivation

2 fastText

3 CNNs

4 FLAIR

5 Summary

39 / 70 Convolutional Neural Networks (CNNs): Basic idea

We learn feature detectors. Each feature detector has a fixed size, e.g., a width of three characters. We slide the feature detector over the input (e.g., an input word). The feature detector indicates for each point in the input the activation of the feature at that point. Then we pass to the next layer the highest activation we’ve found. Example task in following slides: detect

40 / 70 Convolution&pooling architecture

pooling layer

convolution layer

@ M c C a i n @ l o s e s @ input layer

41 / 70 Input layer

pooling layer

convolution layer

@ M c C a i n @ l o s e s @ input layer

42 / 70 Convolution layer

pooling layer

convolution layer

@ M c C a i n @ l o s e s @ input layer

43 / 70 Convolution layer (filter size 3)

a = g(H ⊙ X ) pooling layer

0.1 convolution layer

g(H⊙X ) @ M c C a i n @ l o s e s @ input layer

44 / 70 Convolutional layer: configuration

Convolutional filter: a = g(H ⊙ X + b) g: nonlinearity (e.g., sigmoid) H: filter parameters X is the input to the filter, of dimensionality D × k Kernel size (or filter size) k: length of subsequence D is the dimensionality of the embeddings. ⊙ is the (Frobenius) inner product: H ⊙ X = P(i,) Hij Xij H also has dimensionality D × k. Number of kernels/filters Usually: mix of filters of different sizes

45 / 70 dos Santos and Zadrozny (2014)

46 / 70 CNN for generating word embeddings

character embeddings, dimensionality d chr

convolutional filter W0 (here: width kchr = 3)

input to filter: zm, concatentation of kchr = 3 character embeddings

output of filter: g(W0 ⊙ zm + b0) one output vector per position here: M =9 − kchr +1 maxpooling: wch [r ]j = max1≤m≤M [g(W0 ⊙ zm + b0)]j r wch is the character-based embedding of the input word “clearly” 47 / 70 Examples

Character-based word embeddings are trained end2end – here for a part-of-speech (POS) tagging task. It’s apparent that the word embeddings reflect similarity for the POS task.

48 / 70 Hyperparameters

d chr 10 dimensionality of character embeddings kchr 5 width of convolutional filters n 50 number of convolutional filters (= dimensionality of character-based word embeddings)

49 / 70 POS performance of character-based word embeddings

If regular word embeddings (WNN; e.g., word2vec) are available (OOSV), then character-based word embeddings do not help. If not (OOUV), then character-based word embeddings perform best. Overall performance is also the best. Hand-engineered features slightly worse than character-based word embeddings.

50 / 70 Kim et al. 2016 (AAAI)

51 / 70 Kim et al. 2016 (AAAI)

52 / 70 Extensions

Convolutional filters of many sizes Highway network

53 / 70 Hyperparameters

d chr 15 dimensionality of character embeddings kchr 1, 2, 3, 4, 5, 6, 7 width of convolutional filters n min(200, 50 · kchr) number of filters per width 1100 dimensionality char-based word embeddings

54 / 70 Language modeling results (perplexity)

Character-based model is on par with state of the art at the time But has a smaller number of parameters

55 / 70 Examples

Here the objective is language modeling, so we get more than POS similarity (“advertised” / “advertising”) Big difference before/after highway Highway copies over useful character information (computer-aided / computer-guided) and filters out misleading character-based similarity (“loooook” / “cook”).

56 / 70 Outline

1 Motivation

2 fastText

3 CNNs

4 FLAIR

5 Summary

57 / 70 FLAIR: Akbik et al (2018)

58 / 70 FLAIR embeddings

First layer: biLSTM Second layer: FLAIR word embedding: concatenation of fLM hidden state after last character and bLM hidden state before first character (fLM = forward language model) (bLM = backward language model)

59 / 70 Motivation for FLAIR embeddings

FLAIR embeddings are contextual. The same word has different FLAIR embeddings in different contexts! Hope: context (e.g., “George”) incorporated into FLAIR embedding (e.g., “Washington”) in the context “George Washington” Pretraining: In contrast to character-embeddings learned for a specific task, FLAIR embeddings can be pretrained on huge unlabeled corpora Plus: FLAIR embeddings have all the other advantages of character-based word embeddings, e.g., robustness against noise and no out-of-vocabulary words.

60 / 70 Typical use of FLAIR embeddings: Sequence labeling

61 / 70 Extensions

Stacking embeddings: a FLAIR embedding can be extended with other word embeddings: word2vec, fastText etc. Also: a FLAIR embedding can be extended with a task-trained embedding, i.e., trained end2end on the training set of the task

62 / 70 Nearest neighbors in FLAIR embedding space

This demonstrates that FLAIR embeddings indeed capture valuable context – so they are contextualized embeddings.

63 / 70 Performance of FLAIR

FLAIR beats three strong baselines – new state of the art. Word embeddings give a big boost for some tasks. Stacked embeddings are better than FLAIR embeddings only.

64 / 70 How much information do FLAIR Embeddings contain? Only use a linear map from embeddings

NER English NER German chunking POS FLAIR, map 81.42 73.90 90.50 97.26 FLAIR, full model 91.97 85.78 96.68 97.73 word, map 48.79 57.43 65.01 89.58 word, full model 88.54 82.32 95.40 96.94 Using FLAIR embeddings directly, without a sequence labeling model, performs surprisingly well (but big gap to full model). In particular, FLAIR/map is much better that word/map.

65 / 70 https://github.com/zalandoresearch/flair

66 / 70 Outline

1 Motivation

2 fastText

3 CNNs

4 FLAIR

5 Summary

67 / 70 fastText CNN FLAIR architecture ngram embed’s CNN biLSTM pipeline end2end mixed language modeling BOW/fixed pos. filters sequential efficienttotrain? + − − pretrained available? + − + expressivity − + + combinable with word embeddings + + + within-token,OOVs + + + cross-token inflexible − +

68 / 70 Resources

See “references.pdf” document

69 / 70 Questions?

70 / 70