NLP: The bread and the butter
Andi Rexha • 07.11.2019 Agenda
Traditional NLP Word embeddings
● Text preprocessing ● First Generation (W2v)
● Features’ type ● Second Generation (ELMo, Bert..)
● Bag-of-words model ● Multilinguality
● External Resources ○ Space transformation
● Sequential classification ○ Multilingual Bert, MultiFiT
● Other tasks (MT, LM, Sentiment) Preprocessing
How to preprocess text?
● Divide et impera approach:
○ How do we(humans) split the text to analyse it?
■ Word split
■ Sentence split
■ Paragraphs, etc
○ Is there any other information that we can collect? Preprocessing (2)
Other preprocessing steps
● Stemming/(Lemmatization usually too expensive)
● Part of Speech Tagging (PoS)
● Chunking/Constituency Parsing
● Dependency Parsing Preprocessing (3)
● Stemming ○ The process of bringing the inflected words to their common root: ■ Producing => produc; are =>are ● Lemmatization ○ Bringing the words to the same lemma word: ■ am , is, are => be ● Part of Speech Tagging (PoS) ○ Assign to each word a grammatical tag Preprocessing Example
Sentence “There are different examples that we might use!”
Preprocessing
PoS Tagging
Lemmatization
Definition of PoS tags from Penn Treebank: https://sites.google.com/site/partofspeechhelp/#TOC-VBZ Preprocessing (4)
Other preprocessing steps
● Shallow parsing (Chunking):
○ It is a morphological parsing that adds a tree structure to the POS-tags
○ First identifies its constituents and then their relation
● Deep Parsing (Dependency Parsing):
○ Parses the sentence in its grammatical structure
○ “Head” - “Dependent” form
○ It is an acyclic directed graph (mostly implemented as a tree) Preprocessing Example (2)
Sentence Constituency parsing
● “There are different examples that we might use!”
Dependency Parsing
Run examples under: https://corenlp.run/ Core NLP library: https://stanfordnlp.github.io/CoreNLP/ Dependency Parser tags(UD): https://nlp.stanford.edu/software/dependencies_manual.pdf Bag-of-words Model
A major paradigm in NLP and IR
● The text is considered to be a set of its words
● Grammatical dependencies are ignored Representation of features
● Dictionary based (Nominal features)
● One hot encoded/Frequency encoded
● What is the difference with the standard Data Mining problems? Bag-of-words Model (2)
Sentences
Features
Features representation for ML
Source: https://en.wikipedia.org/wiki/Bag-of-words_model Feature encoding PoS tagging
● Option (Word + PoS tag) as part of dictionary:
○ Example: John-PN Dependency-Parsing
● Dependency between two words:
○ Use it as an n-gram feature
● Option (Word + Dependency Path) as part of dictionary:
○ Example: use-nsubj-acl:relcl External Resources
There are linguistic resources that we miss
● Synonyms, antonyms, hyponyms, hypernyms,... External resources can help us to enrich our vocabulary
● Wordnet: A lexical database for English, which groups words in synsets
● Wiktionary: A free multilingual dictionary enriched with relations between words
● We can enrich the feature of our examples with their synonyms Sequential Classification Cases when we need to classify a sequence of tokens ● Information Extraction:
○ Example: Extract the names of companies from documents (open domain) How do we model? ● Classify each token as part or not part of the information: ○ The classification of the current token depends on the classification of the previous one ○ Sequential classifier ○ Still not enough; we need to encode the output ○ We need to know where every “annotation” starts and ends Sequential Classification (2) Two schemas for encoding the output
● BILOU: Beginning, Inside, Last, Outside, Unit
● BIO(most used): Beginning, Inside, Outside
● BILOU has shown to perform better in some datasets
● Example: “The Know Center GmbH is a spinoff of TUGraz.”
○ BILOU: The-O; Know-B; Center-I; GmbH-L; is-O; a-O; spinoff-O; of- O; TUGraz-U
○ BIO: The-O; Know-B; Center-I; GmbH-I; is-O; a-O; spinoff-O; of-O; TUGraz-B
● Sequential classifiers: Hidden Markov Model, CRF, etc
1: Named Entity recognition: https://www.aclweb.org/anthology/W09-1119.pdf Machine Translation
Task presentation
● Translate text from one language to another
● Different approaches:
○ Rule based (Usually by using a dictionary)
○ Statistical (Involving a bilingual aligned corpora)
■ IBM models (1-6) for aligning and training
○ Hybrid (The use of the two previous techniques) Sentiment Analysis
Task presentation
● Assign a sentiment to a piece of text:
○ Binary (like/dislike)
○ Rating based (eg. 1-5)
● Assign the sentiment to a target phrase:
○ Usually involving features around the target
● External resources:
○ SentiWordnet http://sentiwordnet.isti.cnr.it/ Language model Task presentation
● Generating the next token of a sequence
● Usually based on the collection of co-occurrence of words in a window:
○ Statistics are collected and the next word is predicted based on these information
○ Mainly, models the probability of the sequence:
■
● In traditional approaches solved as an n-gram approximation:
○ Usually solved by combining different sizes of n-grams and weighting them Dense word representation From sparse to dense Topic modeling
● Since LSA(Latent Semantic Analysis): ○ These methods utilize low-rank approximations to decompose large matrices that capture statistical information about a corpus ● Other Methods later: ○ pLSA (Probability Latent Semantic Analysis) ■ Uses probability instead of SVD (Single Value Decomposition) ● LDA (Latent Dirichlet Allocation): ○ A Bayesian version of pLSA Neural embeddings
A Neural Probabilistic Language Model
● Language models suffer from the “Curse of dimensionality”: ○ The word sequence that we want to predict is likely to be different from the ones we have seen in the training ○ Seeing in the “The cat is walking in the bedroom” => should help us generate: “A dog was running in the room”: ■ Similar semantics and grammatical role ● Benjo et al. implemented in 2003 the idea of Mnih and Hinton (1989): ○ Learned a language model and embeddings for the words Neural embeddings (2)
Benjo’s architecture
●
● Approximate a function with a window approach
● Model the approximation with a neural network
● Input Layer in a 1-hot-encoding form
● Two hidden layers (first more of a random initialization)
● A tanh intermediate layer Neural embeddings (3)
Benjo’s architecture
● A final softmax layer: ○ Outputting the next word in the sequence ● Learned a word representation of 18K words with almost 1M words in the corpus ● IMPORTANT Linguistic Theory: ○ Words that tend to occur in similar linguistic context tend to resemble each other in meanings Word2vec
● A deep learning model (2 layers) that compute dense vector representations of words ● Two different architectures: ○ Continuous Bag-of-Words Model (CBOW) (Faster) ■ Predict the middle word in a window of words ○ Skip-gram Model (Better with small amount of data) ■ Predict the context of a middle word, given the word ● Models the probability of words co-occurring with the current word(candidate) ● The embedding learned is the output of the hidden layer Word2vec (2)
CBOW Skip-Gram Word2vec (3)
● The output is a softmax function
● Three new techniques:
1. Subsample of frequent word:
■ Each word in the training set is discarded with a probability
● is the frequency of word and t (around ) a threshold
● Keep words that are more likely to occur less often
■ accelerates learning and even significantly improves the accuracy of the learned vectors of the rare words Word2vec (4) 2. Hierarchical Softmax: ○ A tree approximation of the softmax, by using a sigmoid at every step ○ Intuition: at every step decide whether to go right or left ○ O(log(n)) instead of O(n) 3. Negative sampling: ○ Alternative to Hierarchical Softmax (works better): ○ Brings up infrequent terms, squeezes the probability of frequent terms ○ Change the weight of only a selected number of terms with probability: ■ Word2vec (5)
● A serendipity effect from the Word2vec is the linearity(analogy) between embeddings:
● The famous example: (King - Man) + Woman = Queen GloVe (Global Vectors) Previous embeddings advantages and drawbacks
● Methods similar to LSA:
○ Advantage: Learn statistical information
○ Drawback: Poor on analogy
● Word2Vec:
○ Advantages: Learn analogy
○ Drawback: Poor in learning statistical information
● Glove for combining removing the disadvantages GloVe (2) Proposed approach:
● Use co-occurrence matrix as a starting point
● The ratio is better than the co-occurrence to distinguish relevant words (solid vs gas) than unrelated words:
○ Use the ratio as the starting point for the algorithm GloVe (3) Find a function
● The function should calculate : ● The ratio should encode the word vector space: ○ Since the vector spaces are linear, the function should combine : ■ ● Left side is vector, right side is a scalar: ○ How do we combine to keep the linearity? F could be a complicated function, but we need to keep the linearity: ■ To avoid that: GloVe (4) Improve the function
● The function is required to be an homomorphism : ● Given X, the matrix of word-to-word co occurrence: ■ ■ With F being exponential (the ratio above): ● ● Add two bias terms for the simplification of the first function ○ ● Objective function:
GloVe (5) Improve the function
● The best alpha is ¾ : looks quite similar to the Word2Vec negative sampling fastText Use n-grams to share weights between words
● The goal is to share information between words
● Split the words in n-grams
○ Distinguish the prefix and postfix with ‘<’ and ‘>’ respectively
○ Example of 3 grams for the word TUGraz:
■ ‘
○ Select all n-grams with ‘n’ between 3 and 6 (inclusive)
■ Select also the word itself : ‘
● Each word is represented as a bag of n-grams
● Skip gram architecture of Word2Vec
● Allows to compute word representation for words that don’t occur in the training Context2Vec Words have different representation in different contexts
● Use a bidirectional LSTM to learn the missing word
● Similar to the CBOW, but in this case we use LSTM
● Later a Multilayer perceptron is added to capture complex
patterns
● Similar to Word2Vec (CBOW)
○ Using random sampling to train weights of the network Context2Vec (2) Results
● Sentence embeddings are close to the terms embeddings ● Outperforms context representation of Averaged Word Embeddings ● Surpass or nearly reach state-of-the-art results on sentence completion, lexical substitution and word sense disambiguation tasks Second Generation of Word Embeddings TagLM (Pre-ELMo) Pre-train network
● Three basic steps:
○ Pre-train word embeddings and LM embeddings on large corpora
○ Extract word embeddings and Language Model embeddings for a given input sequence
○ Use them in a supervised task TagLM (2) Pre-train network ● For each token : ○ Concatenate a character based embedding and a word embedding ■ Character representation captures morphological information (CNN or RNN) ● Use these embeddings in a bi-directional LSTM ○ Concatenate the two outputs for each layer ○ 2 layers LSTM ● Stacked bidirectional LSTMs TagLM (3) Architecture
● Pretrain a bidirectional Language model: ○ Use pre-trained embeddings (character and word level) ○ At the second level of the bi-RNN concatenate the two outputs ● A complicated Sequence tagging with a CRF loss TagLM (4) Architecture ELMo (Embeddings from LM) Pre-train network
● Use as previously a forward and backward LSTM ● The formulation of the problem: ○ Maximize the log likelihood of the forward and backward directions:
○ => parameters for the representation of the token ○ => Parameters for the softmax layer ELMo (2) Pre-train network
● Different from the previous approach: ○ Shares some weights between directions, instead of completely independent variables ● For each token t_k, the L-layer biLM computes a set of 2*L + 1 representations:
A ○ Where => a token independent representation (CNN over characters) ELMo (2) Pre-train network
● Elmo collapses all the layers in one: ● For a specific task, ELMo calculates:
○ A combination of all the layers, so not just the last layer ○ => allows to model scale the entire output ■ It is important to optimize the whole process (authors claim) ○ => are softmax normalized weights ELMo (3) Pre-train network ● Given a task and the pretrained LM, ELMo does: ○ Run the LM on each token and record the output of each layer ○ Then let the architecture for the specific task learn from the representation: ■ Because most of the tasks share the same architecture at the lowest level ■ Then the model forms a model sensitive representation ○ The LM of ELMo weights are freezed: ■ Then the are concatenated with the and feed to the task architecture ULMFiT Universal Language Model Fine Tuning
● Almost the same time as ELMo
● Similar idea:
○ Transfer Learning with task specific tuning
○ Language Models capture a lot of downstream tasks:
■ Long-term dependencies
■ Hierarchical relations
■ Sentiment ULMFiT (2) Universal Language Model Fine Tuning
● Consists in three steps:
○ Language Model trained in a general domain
○ Target task LM fine tuning
○ Target task classifier fine tuning ULMFiT (3) Steps of ULMFiT
● Target task LM fine tuning:
○ Discriminative fine-tuning:
■ Each layer with different learning rate
● Empirically found alpha_l-1 = alpha_l/2.6
■ Slanted triangular learning rates:
● For adapting task specific features and quickly converge to that region ULMFiT (4)
● Target task classifier fine tuning:
○ Concat pooling (for not losing the information in the last states):
■ Concatenates the last hidden states with a maxpool and mean pool of all the hidden states (Doesn’t it look similar to ELMo?)
○ Gradual unfreeze:
■ To avoid catastrophic forget:
● First epoch unfreeze the last layer (contains the least general knowledge) and fine-tune it
● Go through each layer and unfreeze each of them top-down ULMFiT (5) BERT (Introduction) Encoder-Decoder architecture ● Used for Machine Translation or Sequence Tagging ● Encoder: ○ Learns the representations of the words ● Decoder: ○ Generates the new/translated sequence ● Traditionally: ○ bi-RNN with some attention ● Transformer: ○ Only via attention BERT (Introduction 2) Transformer ● Uses only attention to learn the connection ● Architecture: ○ 6 layers of encoders & 6 of decoders ● Encoders: ○ Self attention & feed forward layers ● Decoder: ○ Self attention, Encoder-Decoder attention & Feed Forward ● Tokens are encoded with position embeddings: ○ So the system learns how to connect the word with each other BERT (Introduction 3) Transformer in a Gif BERT From the rich to the poor
● Two steps for BERT: ○ Pre-train
○ Fine-tune
● Architecture: ○ Based on the infamous paper: “Attention is All You Need”
○ Multi-layer bidirectional Transformer encoder
● Pretrain on a large dataset so the users don’t have to spend time/resources to do so BERT (2) ● Two tasks for the pretraining:
○ Masked LM
■ Hide (or mask) a word with a probability of 15% and try to identify it
○ Next sentence prediction:
■ Given two sentences, the networks tries to predict whether one follows the other:
● Seems to be a fine-tuning for Question Answering
● Easy to generate (50% for each sentence)
● IMPORTANT: BERT tunes all the parameters, so the task is an end-to-end training BERT (3) ● Tokens are constructed by summing the token embedding, segment embedding, and position embedding ● Uses wordpiece tokenization ○ Used to find Out Of Vocabulary words: ■ Word : walking => walk ##ing; walked => walk ##ed ● Training: Transformer XL Shallow description
● Problems with previous solutions:
○ Fixed length of the context for the Language Model
● How to make a transformer to encode an arbitrary long text to a single output?
○ Simple approach:
■ Split the text to smaller chunks
● Problem: Dependency is upper-bounded to segment length Transformer XL (2)
● Add a recurrence to the Transformer Architecture:
● How to encode each segment:
○ Keep in memory the output of the hidden states of the previous run:
■ Extends the range Worth mentioning Similar to BERT
● RoBERTa:
○ Improves trained methodology of BERT
○ 1000 times more data to train
● DistilBERT:
○ Uses half of the parameters of BERT
○ Achieves 97% of performance compared to BERT
○ Good tradeoff Worth mentioning (2) Y-axis shows the number of parameters X million Multilinguality Multilinguality Transfer learning for non-contextual Embeddings
● Transfer embeddings from one language to another:
○ Advantages:
■ Transfer Learning
■ Machine Translation
■ Cross Lingual entity linking
○ The basic idea for most of the papers is to learn a matrix transformation:
■ Exploits the linearity of the spaces Multilinguality (2) Example with a small amount of parallel data
● Learn either by a Neural Network or by closed algebraic solution Multilinguality (3) Multilingual BERT ● Uses a single multilingual vocabulary
○ The Word Piece tokenizer helps to capture semantic between languages:
■ To see the effect: Try a NER task with M-Bert compared to English-Bert
■
■ M-BERT’s pretraining on multiple languages has enabled a representational capacity deeper than simple vocabulary memorization
● Nice effects: ○ Task trained in one Language perform well in the other Multilinguality (4) Multilingual ULM-FiT (MultiFiT)
● Similar to ULM-FiT, but uses Quasi RNN instead of LSTM
○ QRNN alternate CNN and Recurrent Pooling Function
○ Outperforms LSTM (+ 16 times faster)
● ULMFiT is restricted to words:
○ MultiFiT uses subword tokenization (sounds familiar?)
○ A new variant of the slanted triangular learning rate + gradual unfreezing
■ Cosine variant of the one cycle policy Multilinguality (5) Multilingual ULM-FiT (MultiFiT 2)
● For the case where an existing pre-trained cross-lingual model and source language data are available, authors propose:
○ In the case of pre-existing cross lingual corpora,
■ Authors propose a multistep approach:
● Check appendix for more details Thank you for your Attention (is All I Need!) Bootstraping for MultiFiT