Character-Level Models
Total Page:16
File Type:pdf, Size:1020Kb
Character-Level Models Hinrich Sch¨utze Center for Information and Language Processing, LMU Munich 2019-08-29 1 / 70 Overview 1 Motivation 2 fastText 3 CNNs 4 FLAIR 5 Summary 2 / 70 Outline 1 Motivation 2 fastText 3 CNNs 4 FLAIR 5 Summary 3 / 70 Typical NLP pipeline: Tokenization Mr. O’Neill thinks that Brazil’s capital is Rio. Mr.|O’Neill|thinks|that|Brazil|’s|capital|is|Rio|. 4 / 70 Typical NLP pipeline: Tokenization Mr. O’Neill thinks that Brazil’s capital is Rio. Mr.|O’Neill|thinks|that|Brazil|’s|capital|is|Rio|. 5 / 70 Typical NLP pipeline: Tokenization Mr. O’Neill thinks that Brazil’s capital is Rio. Mr.|O’Neill|thinks|that|Brazil|’s|capital|is|Rio|. 6 / 70 Typical NLP pipeline: Tokenization Mr. O’Neill thinks that Brazil’s capital is Rio. Mr.|O’Neill|thinks|that|Brazil|’s|capital|is|Rio|. 7 / 70 Typical NLP pipeline: Tokenization Mr. O’Neill thinks that Brazil’s capital is Rio. Mr.|O’Neill|thinks|that|Brazil|’s|capital|is|Rio|. 8 / 70 Typical NLP pipeline: Tokenization Mr. O’Neill thinks that Brazil’s capital is Rio. Mr.|O’Neill|thinks|that|Brazil|’s|capital|is|Rio|. 9 / 70 Typical NLP pipeline: Morphological analysis For example: lemmatization Mr. O’Neill knows that the US has fifty states Mr. O’Neill know that the US have fifty state 10 / 70 Preprocessing in the typical NLP pipeline Tokenization Morphological analysis Later today: BPEs What is the problem with this? 11 / 70 Problems with typical preprocessing in NLP Rules do not capture structure within tokens. Regular morphology, e.g., compounding: “Staubecken” can mean “Staub-Ecken” (dusty corners) or “Stau-Becken” (dam reservoir) Non-morphological, semi-regular productivity: cooooooooooool, fancy-shmancy, Watergate/Irangate/Dieselgate Blends: Obamacare, mockumentary, brunch Onomatopoeia, e.g., “oink”, “sizzle”, “tick tock” Certain named entity classes: What is “lisinopril”? Noise due to spelling errors: “signficant” Noise that affects token boundaries, e.g., in OCR: “run fast” → “runfast” 12 / 70 Problems with typical preprocessing in NLP Rules do not capture structure across tokens. Noise that affects token boundaries, e.g., in OCR: “gumacamole” → “guaca” “mole” recognition of names / multiphrase expressions “San Francisco-Los Angeles flights” “Nonsegmented” languages: Chinese, Thai, Burmese 13 / 70 Pipelines in deep learning (and StatNLP in general) We have a pipeline consisting of two differ- ent subsystems: A preprocessing component: tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is not optimal for the objective and there are many cases where it’s outright harmful. If we replace the preprocessing component with a character-level layer, we can train the architecture end2end and get rid of the pipeline. 14 / 70 Advantages of end2end vs. pipeline End2end optimizes all parameters of a deep learning model directly for the learning objective, including “first-layer” parameters that connect the raw input representation to the first layer of internal representations of the network. Pipelines generally don’t allow “backtracking” if an error has been made in the first element of the pipeline. In character-level models, there is no such thing as an out-of-vocabulary word. (OOV analysis) Character-level models can generate words / units that did not occur in the training set (OOV generation). End2end can deal better with human productivity (e.g., “brunch”), misspellings etc. 15 / 70 Three character-level models fastText Bag of character ngrams Character-aware CNN (Kim, Jernite, Sontag, Rush, 2015) CNN FLAIR Character-level BiLSTM 16 / 70 Outline 1 Motivation 2 fastText 3 CNNs 4 FLAIR 5 Summary 17 / 70 fastText FastText is an extension of word2vec. It computes embeddings for character ngrams A word’s embedding is the sum of its character ngram embeddings. Parameters: minimum ngram length: 3, maximum ngram length: 6 The embedding of “dendrite” will be the sum of the following ngrams: @dendrite@ @de den end ndr dri rit ite te@ @den dend endr ndri drit rite ite@ @dend dendr endri ndrit drite rite@ @dendr dendri endrit ndrite drite@ 18 / 70 fastText: Example for benefits Embedding for character ngram “dendri” → “dentrite” and “dentritic” are similar word2vec: no guarantee, especially for rare words 19 / 70 fastText paper 20 / 70 fastText objective T X X − log p(wc |wt ) t=1 c∈Ct T length of the training corpus in tokens Ct words surrounding word wt 21 / 70 Probability of a context word: softmax? exp(s(wt , wc )) p(wc |wt )= W , Pj=1 exp(s(wj wc )) s(wt , wc ) scoring function that maps word pair to R Problems: too expensive 22 / 70 Instead of softmax: Negative sampling and binary logistic loss log(1 + exp(−s(wt , wc ))) + X log(1 + exp(s(wt , wn))) n∈Nt,c ℓ(s(wt , wc )) + X ℓ(−s(wt , wn)) n∈Nt,c Nt,c set of negative examples sampled from the vocabulary ℓ(x) log(1 + exp(−x)) (logistic loss) 23 / 70 Binary logistic loss for corpus T X X ℓ(s(wt , wc )) + X ℓ(−s(wt , wn)) t=1 c∈Ct n∈Nt,c ℓ(x) log(1 + exp(−x)) 24 / 70 Scoring function u⊺ v s(wt , wc )= wt wc u wt the input vector of wt v wc the output vector (or context vector) of wc 25 / 70 Subword model 1 z⊺ v s(wt , wc )= X g wc |Gwt | g∈Gwt Gwt set of ngrams of wt and wt itself 26 / 70 fastText: Summary Basis: word2vec skipgram Objective: includes character ngrams as well as word itself Result: word embeddings that combine word-level and character-level information We can compute an embedding for any unseen word (OOV). 27 / 70 Letter n-gram generalization can be good word2vec 1.000 automobile 779 mid-size 770 armored 763 seaplane 754 bus 754 jet 751 submarine 750 aerial 744 improvised 741 anti-aircraft fastText 1.000 automobile 976 automobiles 929 Automobile 858 manufacturing 853 motorcycles 849 Manufacturing 848 motorcycle 841 automotive 814 manufacturer 811 manufacture 28 / 70 Letter n-gram generalization can be bad word2vec 1.000 Steelers 884 Expos 865 Cubs 848 Broncos 831 Dinneen 831 Dolphins 827 Pirates 826 Copley 818 Dodgers 814 Raiders fastText 1.000 Steelers 893 49ers 883 Steele 876 Rodgers 857 Colts 852 Oilers 851 Dodgers 849 Chalmers 849 Raiders 844 Coach 29 / 70 Letter n-gram generalization: no-brainer for unknowns (OOVs) word2vec (“video-conferences” did not occur in corpus) fastText 1.000 video-conferences 942 conferences 872 conference 870 Conferences 823 inferences 806 Questions 805 sponsorship 800 References 797 participates 796 affiliations 30 / 70 fastText extensions (Mikolov et al, 2018) 31 / 70 fastText extensions (Mikolov et al, 2018) Position-dependent features Phrases (like word2vec) cbow Pretrained word vectors for 157 languages 32 / 70 fastText evaluation 33 / 70 Code fastText https://fasttext.cc gensim https://radimrehurek.com/gensim/ 34 / 70 Pretrained fasttext embeddings Afrikaans, Albanian, Alemannic, Amharic, Arabic, Aragonese, Armenian, Assamese, Asturian, Azerbaijani, Bashkir, Basque, Bavarian, Belarusian, Bengali, Bihari, Bishnupriya Manipuri, Bosnian, Breton, Bulgarian, Burmese, Catalan, Cebuano, Central Bicolano, Chechen, Chinese, Chuvash, Corsican, Croatian, Czech, Danish, Divehi, Dutch, Eastern Punjabi, Egyptian Arabic, Emilian-Romagnol, English, Erzya, Esperanto, Estonian, Fiji Hindi, Finnish, French, Galician, Georgian, German, Goan Konkani, Greek, Gujarati, Haitian, Hebrew, Hill Mari, Hindi, Hungarian, Icelandic, Ido, Ilokano, Indonesian, Interlingua, Irish, Italian, Japanese, Javanese, Kannada, Kapampangan, Kazakh, Khmer, Kirghiz, Korean, Kurdish (Kurmanji), Kurdish (Sorani), Latin, Latvian, Limburgish, Lithuanian, Lombard, Low Saxon, Luxembourgish, Macedonian, Maithili, Malagasy, Malay, Malayalam, Maltese, Manx, Marathi, Mazandarani, Meadow Mari, Minangkabau, Mingrelian, Mirandese, Mongolian, Nahuatl, Neapolitan, Nepali, Newar, North Frisian, Northern Sotho, Norwegian (Bokm˚al), Norwegian (Nynorsk), Occitan, Oriya, Ossetian, Palatinate German, Pashto, Persian, Piedmontese, Polish, Portuguese, Quechua, Romanian, Romansh, Russian, Sakha, Sanskrit, Sardinian, Scots, Scottish Gaelic, Serbian, Serbo-Croatian, Sicilian, Sindhi, Sinhalese, Slovak, Slovenian, Somali, Southern Azerbaijani, Spanish, Sundanese, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Tibetan, Turkish, Turkmen, Ukrainian, Upper Sorbian, Urdu, Uyghur, Uzbek, Venetian, Vietnamese, Volap¨uk, Walloon, Waray, Welsh, West Flemish, West Frisian, Western Punjabi, Yiddish, Yoruba, Zazaki, Zeelandic 35 / 70 fastText skipgram parameters -input <path> training file path -output <path> output file path -lr (0.05) learning rate -lrUpdateRate (100) rate of updates for the learning rate -dim (100) dimensionality of word embeddings -ws (5) size of the context window -epoch (5) number of epochs 36 / 70 fastText skipgram parameters -minCount (5) minimal number of word occurrences -neg (5) number of negatives sampled -wordNgrams (1) max length of word ngram -loss (ns) loss function ∈ { ns, hs, softmax } -bucket (2,000,000) number of buckets -minn (3) min length of char ngram -maxn (6) max length of char ngram 37 / 70 fastText skipgram parameters -threads (12) number of threads -t (0.0001) sampling threshold -label <string> labels prefix 38 / 70 Outline 1 Motivation 2 fastText 3 CNNs 4 FLAIR 5 Summary 39 / 70 Convolutional Neural Networks (CNNs): Basic idea We learn feature detectors. Each feature detector has a fixed size, e.g., a width of three characters. We slide the feature detector over the input (e.g., an input word). The feature detector indicates for each point in the input the activation of the feature at that point. Then we pass to the next layer