NLP: The bread and the butter

Andi Rexha • 07.11.2019 Agenda

Traditional NLP embeddings

● Text preprocessing ● First Generation (W2v)

● Features’ type ● Second Generation (ELMo, Bert..)

● Bag-of- model ● Multilinguality

● External Resources ○ Space transformation

● Sequential classification ○ Multilingual Bert, MultiFiT

● Other tasks (MT, LM, Sentiment) Preprocessing

How to preprocess text?

● Divide et impera approach:

○ How do we(humans) split the text to analyse it?

■ Word split

■ Sentence split

■ Paragraphs, etc

○ Is there any other information that we can collect? Preprocessing (2)

Other preprocessing steps

/(Lemmatization usually too expensive)

● Part of Speech Tagging (PoS)

● Chunking/Constituency

● Dependency Parsing Preprocessing (3)

● Stemming ○ The process of bringing the inflected words to their common root: ■ Producing => produc; are =>are ● Lemmatization ○ Bringing the words to the same lemma word: ■ am , is, are => be ● Part of Speech Tagging (PoS) ○ Assign to each word a grammatical tag Preprocessing Example

Sentence “There are different examples that we might use!”

Preprocessing

PoS Tagging

Lemmatization

Definition of PoS tags from Penn : https://sites.google.com/site/partofspeechhelp/#TOC-VBZ Preprocessing (4)

Other preprocessing steps

● Shallow parsing (Chunking):

○ It is a morphological parsing that adds a tree structure to the POS-tags

○ First identifies its constituents and then their relation

● Deep Parsing (Dependency Parsing):

○ Parses the sentence in its grammatical structure

○ “Head” - “Dependent” form

○ It is an acyclic directed graph (mostly implemented as a tree) Preprocessing Example (2)

Sentence Constituency parsing

● “There are different examples that we might use!”

Dependency Parsing

Run examples under: https://corenlp.run/ Core NLP library: https://stanfordnlp.github.io/CoreNLP/ Dependency Parser tags(UD): https://nlp.stanford.edu/software/dependencies_manual.pdf Bag-of-words Model

A major paradigm in NLP and IR

● The text is considered to be a set of its words

● Grammatical dependencies are ignored Representation of features

● Dictionary based (Nominal features)

● One hot encoded/Frequency encoded

● What is the difference with the standard Data Mining problems? Bag-of-words Model (2)

Sentences

Features

Features representation for ML

Source: https://en.wikipedia.org/wiki/Bag-of-words_model Feature encoding PoS tagging

● Option (Word + PoS tag) as part of dictionary:

○ Example: John-PN Dependency-Parsing

● Dependency between two words:

○ Use it as an n-gram feature

● Option (Word + Dependency Path) as part of dictionary:

○ Example: use-nsubj-acl:relcl External Resources

There are linguistic resources that we miss

● Synonyms, antonyms, hyponyms, hypernyms,... External resources can help us to enrich our vocabulary

● Wordnet: A lexical database for English, which groups words in synsets

● Wiktionary: A free multilingual dictionary enriched with relations between words

● We can enrich the feature of our examples with their synonyms Sequential Classification Cases when we need to classify a sequence of tokens ● :

○ Example: Extract the names of companies from documents (open domain) How do we model? ● Classify each token as part or not part of the information: ○ The classification of the current token depends on the classification of the previous one ○ Sequential classifier ○ Still not enough; we need to encode the output ○ We need to know where every “annotation” starts and ends Sequential Classification (2) Two schemas for encoding the output

● BILOU: Beginning, Inside, Last, Outside, Unit

● BIO(most used): Beginning, Inside, Outside

● BILOU has shown to perform better in some datasets

● Example: “The Know Center GmbH is a spinoff of TUGraz.”

○ BILOU: The-O; Know-B; Center-I; GmbH-L; is-O; a-O; spinoff-O; of- O; TUGraz-U

○ BIO: The-O; Know-B; Center-I; GmbH-I; is-O; a-O; spinoff-O; of-O; TUGraz-B

● Sequential classifiers: Hidden Markov Model, CRF, etc

1: Named Entity recognition: https://www.aclweb.org/anthology/W09-1119.pdf

Task presentation

● Translate text from one language to another

● Different approaches:

○ Rule based (Usually by using a dictionary)

○ Statistical (Involving a bilingual aligned corpora)

■ IBM models (1-6) for aligning and training

○ Hybrid (The use of the two previous techniques)

Task presentation

● Assign a sentiment to a piece of text:

○ Binary (like/dislike)

○ Rating based (eg. 1-5)

● Assign the sentiment to a target phrase:

○ Usually involving features around the target

● External resources:

○ SentiWordnet http://sentiwordnet.isti.cnr.it/ Language model Task presentation

● Generating the next token of a sequence

● Usually based on the collection of co-occurrence of words in a window:

○ Statistics are collected and the next word is predicted based on these information

○ Mainly, models the probability of the sequence:

● In traditional approaches solved as an n-gram approximation:

○ Usually solved by combining different sizes of n-grams and weighting them Dense word representation From sparse to dense Topic modeling

● Since LSA(): ○ These methods utilize low-rank approximations to decompose large matrices that capture statistical information about a corpus ● Other Methods later: ○ pLSA (Probability Latent Semantic Analysis) ■ Uses probability instead of SVD (Single Value Decomposition) ● LDA (Latent Dirichlet Allocation): ○ A Bayesian version of pLSA Neural embeddings

A Neural Probabilistic Language Model

● Language models suffer from the “Curse of dimensionality”: ○ The word sequence that we want to predict is likely to be different from the ones we have seen in the training ○ Seeing in the “The cat is walking in the bedroom” => should help us generate: “A dog was running in the room”: ■ Similar semantics and grammatical role ● Benjo et al. implemented in 2003 the idea of Mnih and Hinton (1989): ○ Learned a language model and embeddings for the words Neural embeddings (2)

Benjo’s architecture

● Approximate a function with a window approach

● Model the approximation with a neural network

● Input Layer in a 1-hot-encoding form

● Two hidden layers (first more of a random initialization)

● A tanh intermediate layer Neural embeddings (3)

Benjo’s architecture

● A final softmax layer: ○ Outputting the next word in the sequence ● Learned a word representation of 18K words with almost 1M words in the corpus ● IMPORTANT Linguistic Theory: ○ Words that tend to occur in similar linguistic context tend to resemble each other in meanings

● A deep learning model (2 layers) that compute dense vector representations of words ● Two different architectures: ○ Continuous Bag-of-Words Model (CBOW) (Faster) ■ Predict the middle word in a window of words ○ Skip-gram Model (Better with small amount of data) ■ Predict the context of a middle word, given the word ● Models the probability of words co-occurring with the current word(candidate) ● The embedding learned is the output of the hidden layer Word2vec (2)

CBOW Skip-Gram Word2vec (3)

● The output is a softmax function

● Three new techniques:

1. Subsample of frequent word:

■ Each word in the training set is discarded with a probability

● is the frequency of word and t (around ) a threshold

● Keep words that are more likely to occur less often

■ accelerates learning and even significantly improves the accuracy of the learned vectors of the rare words Word2vec (4) 2. Hierarchical Softmax: ○ A tree approximation of the softmax, by using a sigmoid at every step ○ Intuition: at every step decide whether to go right or left ○ O(log(n)) instead of O(n) 3. Negative sampling: ○ Alternative to Hierarchical Softmax (works better): ○ Brings up infrequent terms, squeezes the probability of frequent terms ○ Change the weight of only a selected number of terms with probability: ■ Word2vec (5)

● A serendipity effect from the Word2vec is the linearity(analogy) between embeddings:

● The famous example: (King - Man) + Woman = Queen GloVe (Global Vectors) Previous embeddings advantages and drawbacks

● Methods similar to LSA:

○ Advantage: Learn statistical information

○ Drawback: Poor on analogy

● Word2Vec:

○ Advantages: Learn analogy

○ Drawback: Poor in learning statistical information

● Glove for combining removing the disadvantages GloVe (2) Proposed approach:

● Use co-occurrence matrix as a starting point

● The ratio is better than the co-occurrence to distinguish relevant words (solid vs gas) than unrelated words:

○ Use the ratio as the starting point for the algorithm GloVe (3) Find a function

● The function should calculate : ● The ratio should encode the word vector space: ○ Since the vector spaces are linear, the function should combine : ■ ● Left side is vector, right side is a scalar: ○ How do we combine to keep the linearity? F could be a complicated function, but we need to keep the linearity: ■ To avoid that: GloVe (4) Improve the function

● The function is required to be an homomorphism : ● Given X, the matrix of word-to-word co occurrence: ■ ■ With F being exponential (the ratio above): ● ● Add two bias terms for the simplification of the first function ○ ● Objective function:

GloVe (5) Improve the function

● The best alpha is ¾ : looks quite similar to the Word2Vec negative sampling fastText Use n-grams to share weights between words

● The goal is to share information between words

● Split the words in n-grams

○ Distinguish the prefix and postfix with ‘<’ and ‘>’ respectively

○ Example of 3 grams for the word TUGraz:

■ ‘

○ Select all n-grams with ‘n’ between 3 and 6 (inclusive)

■ Select also the word itself : ‘’ fastText (2) Use n-grams to share weights between words

● Each word is represented as a bag of n-grams

● Skip gram architecture of Word2Vec

● Allows to compute word representation for words that don’t occur in the training Context2Vec Words have different representation in different contexts

● Use a bidirectional LSTM to learn the missing word

● Similar to the CBOW, but in this case we use LSTM

● Later a Multilayer perceptron is added to capture complex

patterns

● Similar to Word2Vec (CBOW)

○ Using random sampling to train weights of the network Context2Vec (2) Results

● Sentence embeddings are close to the terms embeddings ● Outperforms context representation of Averaged Word Embeddings ● Surpass or nearly reach state-of-the-art results on sentence completion, lexical substitution and word sense disambiguation tasks Second Generation of Word Embeddings TagLM (Pre-ELMo) Pre-train network

● Three basic steps:

○ Pre-train word embeddings and LM embeddings on large corpora

○ Extract word embeddings and Language Model embeddings for a given input sequence

○ Use them in a supervised task TagLM (2) Pre-train network ● For each token : ○ Concatenate a character based embedding and a ■ Character representation captures morphological information (CNN or RNN) ● Use these embeddings in a bi-directional LSTM ○ Concatenate the two outputs for each layer ○ 2 layers LSTM ● Stacked bidirectional LSTMs TagLM (3) Architecture

● Pretrain a bidirectional Language model: ○ Use pre-trained embeddings (character and word level) ○ At the second level of the bi-RNN concatenate the two outputs ● A complicated Sequence tagging with a CRF loss TagLM (4) Architecture ELMo (Embeddings from LM) Pre-train network

● Use as previously a forward and backward LSTM ● The formulation of the problem: ○ Maximize the log likelihood of the forward and backward directions:

○ => parameters for the representation of the token ○ => Parameters for the softmax layer ELMo (2) Pre-train network

● Different from the previous approach: ○ Shares some weights between directions, instead of completely independent variables ● For each token t_k, the L-layer biLM computes a set of 2*L + 1 representations:

A ○ Where => a token independent representation (CNN over characters) ELMo (2) Pre-train network

● Elmo collapses all the layers in one: ● For a specific task, ELMo calculates:

○ A combination of all the layers, so not just the last layer ○ => allows to model scale the entire output ■ It is important to optimize the whole process (authors claim) ○ => are softmax normalized weights ELMo (3) Pre-train network ● Given a task and the pretrained LM, ELMo does: ○ Run the LM on each token and record the output of each layer ○ Then let the architecture for the specific task learn from the representation: ■ Because most of the tasks share the same architecture at the lowest level ■ Then the model forms a model sensitive representation ○ The LM of ELMo weights are freezed: ■ Then the are concatenated with the and feed to the task architecture ULMFiT Universal Language Model Fine Tuning

● Almost the same time as ELMo

● Similar idea:

○ Transfer Learning with task specific tuning

○ Language Models capture a lot of downstream tasks:

■ Long-term dependencies

■ Hierarchical relations

■ Sentiment ULMFiT (2) Universal Language Model Fine Tuning

● Consists in three steps:

○ Language Model trained in a general domain

○ Target task LM fine tuning

○ Target task classifier fine tuning ULMFiT (3) Steps of ULMFiT

● Target task LM fine tuning:

○ Discriminative fine-tuning:

■ Each layer with different learning rate

● Empirically found alpha_l-1 = alpha_l/2.6

■ Slanted triangular learning rates:

● For adapting task specific features and quickly converge to that region ULMFiT (4)

● Target task classifier fine tuning:

○ Concat pooling (for not losing the information in the last states):

■ Concatenates the last hidden states with a maxpool and mean pool of all the hidden states (Doesn’t it look similar to ELMo?)

○ Gradual unfreeze:

■ To avoid catastrophic forget:

● First epoch unfreeze the last layer (contains the least general knowledge) and fine-tune it

● Go through each layer and unfreeze each of them top-down ULMFiT (5) BERT (Introduction) Encoder-Decoder architecture ● Used for Machine Translation or Sequence Tagging ● Encoder: ○ Learns the representations of the words ● Decoder: ○ Generates the new/translated sequence ● Traditionally: ○ bi-RNN with some attention ● Transformer: ○ Only via attention BERT (Introduction 2) Transformer ● Uses only attention to learn the connection ● Architecture: ○ 6 layers of encoders & 6 of decoders ● Encoders: ○ Self attention & feed forward layers ● Decoder: ○ Self attention, Encoder-Decoder attention & Feed Forward ● Tokens are encoded with position embeddings: ○ So the system learns how to connect the word with each other BERT (Introduction 3) Transformer in a Gif BERT From the rich to the poor

● Two steps for BERT: ○ Pre-train

○ Fine-tune

● Architecture: ○ Based on the infamous paper: “Attention is All You Need”

○ Multi-layer bidirectional Transformer encoder

● Pretrain on a large dataset so the users don’t have to spend time/resources to do so BERT (2) ● Two tasks for the pretraining:

○ Masked LM

■ Hide (or mask) a word with a probability of 15% and try to identify it

○ Next sentence prediction:

■ Given two sentences, the networks tries to predict whether one follows the other:

● Seems to be a fine-tuning for

● Easy to generate (50% for each sentence)

● IMPORTANT: BERT tunes all the parameters, so the task is an end-to-end training BERT (3) ● Tokens are constructed by summing the token embedding, segment embedding, and position embedding ● Uses wordpiece tokenization ○ Used to find Out Of Vocabulary words: ■ Word : walking => walk ##ing; walked => walk ##ed ● Training: Transformer XL Shallow description

● Problems with previous solutions:

○ Fixed length of the context for the Language Model

● How to make a transformer to encode an arbitrary long text to a single output?

○ Simple approach:

■ Split the text to smaller chunks

● Problem: Dependency is upper-bounded to segment length Transformer XL (2)

● Add a recurrence to the Transformer Architecture:

● How to encode each segment:

○ Keep in memory the output of the hidden states of the previous run:

■ Extends the range Worth mentioning Similar to BERT

● RoBERTa:

○ Improves trained methodology of BERT

○ 1000 times more data to train

● DistilBERT:

○ Uses half of the parameters of BERT

○ Achieves 97% of performance compared to BERT

○ Good tradeoff Worth mentioning (2) Y-axis shows the number of parameters X million Multilinguality Multilinguality Transfer learning for non-contextual Embeddings

● Transfer embeddings from one language to another:

○ Advantages:

■ Transfer Learning

■ Machine Translation

■ Cross Lingual entity linking

○ The basic idea for most of the papers is to learn a matrix transformation:

■ Exploits the linearity of the spaces Multilinguality (2) Example with a small amount of parallel data

● Learn either by a Neural Network or by closed algebraic solution Multilinguality (3) Multilingual BERT ● Uses a single multilingual vocabulary

○ The Word Piece tokenizer helps to capture semantic between languages:

■ To see the effect: Try a NER task with M-Bert compared to English-Bert

■ M-BERT’s pretraining on multiple languages has enabled a representational capacity deeper than simple vocabulary memorization

● Nice effects: ○ Task trained in one Language perform well in the other Multilinguality (4) Multilingual ULM-FiT (MultiFiT)

● Similar to ULM-FiT, but uses Quasi RNN instead of LSTM

○ QRNN alternate CNN and Recurrent Pooling Function

○ Outperforms LSTM (+ 16 times faster)

● ULMFiT is restricted to words:

○ MultiFiT uses subword tokenization (sounds familiar?)

○ A new variant of the slanted triangular learning rate + gradual unfreezing

■ Cosine variant of the one cycle policy Multilinguality (5) Multilingual ULM-FiT (MultiFiT 2)

● For the case where an existing pre-trained cross-lingual model and source language data are available, authors propose:

○ In the case of pre-existing cross lingual corpora,

■ Authors propose a multistep approach:

● Check appendix for more details Thank you for your Attention (is All I Need!) Bootstraping for MultiFiT