NLP: the Bread and the Butter

NLP: The bread and the butter Andi Rexha • 07.11.2019 Agenda Traditional NLP Word embeddings ● Text preprocessing ● First Generation (W2v) ● Features’ type ● Second Generation (ELMo, Bert..) ● Bag-of-words model ● Multilinguality ● External Resources ○ Space transformation ● Sequential classification ○ Multilingual Bert, MultiFiT ● Other tasks (MT, LM, Sentiment) Preprocessing How to preprocess text? ● Divide et impera approach: ○ How do we(humans) split the text to analyse it? ■ Word split ■ Sentence split ■ Paragraphs, etc ○ Is there any other information that we can collect? Preprocessing (2) Other preprocessing steps ● Stemming/(Lemmatization usually too expensive) ● Part of Speech Tagging (PoS) ● Chunking/Constituency Parsing ● Dependency Parsing Preprocessing (3) ● Stemming ○ The process of bringing the inflected words to their common root: ■ Producing => produc; are =>are ● Lemmatization ○ Bringing the words to the same lemma word: ■ am , is, are => be ● Part of Speech Tagging (PoS) ○ Assign to each word a grammatical tag Preprocessing Example Sentence “There are different examples that we might use!” Preprocessing PoS Tagging Lemmatization Definition of PoS tags from Penn Treebank: https://sites.google.com/site/partofspeechhelp/#TOC-VBZ Preprocessing (4) Other preprocessing steps ● Shallow parsing (Chunking): ○ It is a morphological parsing that adds a tree structure to the POS-tags ○ First identifies its constituents and then their relation ● Deep Parsing (Dependency Parsing): ○ Parses the sentence in its grammatical structure ○ “Head” - “Dependent” form ○ It is an acyclic directed graph (mostly implemented as a tree) Preprocessing Example (2) Sentence Constituency parsing ● “There are different examples that we might use!” Dependency Parsing Run examples under: https://corenlp.run/ Core NLP library: https://stanfordnlp.github.io/CoreNLP/ Dependency Parser tags(UD): https://nlp.stanford.edu/software/dependencies_manual.pdf Bag-of-words Model A major paradigm in NLP and IR ● The text is considered to be a set of its words ● Grammatical dependencies are ignored Representation of features ● Dictionary based (Nominal features) ● One hot encoded/Frequency encoded ● What is the difference with the standard Data Mining problems? Bag-of-words Model (2) Sentences Features Features representation for ML Source: https://en.wikipedia.org/wiki/Bag-of-words_model Feature encoding PoS tagging ● Option (Word + PoS tag) as part of dictionary: ○ Example: John-PN Dependency-Parsing ● Dependency between two words: ○ Use it as an n-gram feature ● Option (Word + Dependency Path) as part of dictionary: ○ Example: use-nsubj-acl:relcl External Resources There are linguistic resources that we miss ● Synonyms, antonyms, hyponyms, hypernyms,... External resources can help us to enrich our vocabulary ● Wordnet: A lexical database for English, which groups words in synsets ● Wiktionary: A free multilingual dictionary enriched with relations between words ● We can enrich the feature of our examples with their synonyms Sequential Classification Cases when we need to classify a sequence of tokens ● Information Extraction: ○ Example: Extract the names of companies from documents (open domain) How do we model? ● Classify each token as part or not part of the information: ○ The classification of the current token depends on the classification of the previous one ○ Sequential classifier ○ Still not enough; we need to encode the output ○ We need to know where every “annotation” starts and ends Sequential Classification (2) Two schemas for encoding the output ● BILOU: Beginning, Inside, Last, Outside, Unit ● BIO(most used): Beginning, Inside, Outside ● BILOU has shown to perform better in some datasets ● Example: “The Know Center GmbH is a spinoff of TUGraz.” ○ BILOU: The-O; Know-B; Center-I; GmbH-L; is-O; a-O; spinoff-O; of- O; TUGraz-U ○ BIO: The-O; Know-B; Center-I; GmbH-I; is-O; a-O; spinoff-O; of-O; TUGraz-B ● Sequential classifiers: Hidden Markov Model, CRF, etc 1: Named Entity recognition: https://www.aclweb.org/anthology/W09-1119.pdf Machine Translation Task presentation ● Translate text from one language to another ● Different approaches: ○ Rule based (Usually by using a dictionary) ○ Statistical (Involving a bilingual aligned corpora) ■ IBM models (1-6) for aligning and training ○ Hybrid (The use of the two previous techniques) Sentiment Analysis Task presentation ● Assign a sentiment to a piece of text: ○ Binary (like/dislike) ○ Rating based (eg. 1-5) ● Assign the sentiment to a target phrase: ○ Usually involving features around the target ● External resources: ○ SentiWordnet http://sentiwordnet.isti.cnr.it/ Language model Task presentation ● Generating the next token of a sequence ● Usually based on the collection of co-occurrence of words in a window: ○ Statistics are collected and the next word is predicted based on these information ○ Mainly, models the probability of the sequence: ■ ● In traditional approaches solved as an n-gram approximation: ○ Usually solved by combining different sizes of n-grams and weighting them Dense word representation From sparse to dense Topic modeling ● Since LSA(Latent Semantic Analysis): ○ These methods utilize low-rank approximations to decompose large matrices that capture statistical information about a corpus ● Other Methods later: ○ pLSA (Probability Latent Semantic Analysis) ■ Uses probability instead of SVD (Single Value Decomposition) ● LDA (Latent Dirichlet Allocation): ○ A Bayesian version of pLSA Neural embeddings A Neural Probabilistic Language Model ● Language models suffer from the “Curse of dimensionality”: ○ The word sequence that we want to predict is likely to be different from the ones we have seen in the training ○ Seeing in the “The cat is walking in the bedroom” => should help us generate: “A dog was running in the room”: ■ Similar semantics and grammatical role ● Benjo et al. implemented in 2003 the idea of Mnih and Hinton (1989): ○ Learned a language model and embeddings for the words Neural embeddings (2) Benjo’s architecture ● ● Approximate a function with a window approach ● Model the approximation with a neural network ● Input Layer in a 1-hot-encoding form ● Two hidden layers (first more of a random initialization) ● A tanh intermediate layer Neural embeddings (3) Benjo’s architecture ● A final softmax layer: ○ Outputting the next word in the sequence ● Learned a word representation of 18K words with almost 1M words in the corpus ● IMPORTANT Linguistic Theory: ○ Words that tend to occur in similar linguistic context tend to resemble each other in meanings Word2vec ● A deep learning model (2 layers) that compute dense vector representations of words ● Two different architectures: ○ Continuous Bag-of-Words Model (CBOW) (Faster) ■ Predict the middle word in a window of words ○ Skip-gram Model (Better with small amount of data) ■ Predict the context of a middle word, given the word ● Models the probability of words co-occurring with the current word(candidate) ● The embedding learned is the output of the hidden layer Word2vec (2) CBOW Skip-Gram Word2vec (3) ● The output is a softmax function ● Three new techniques: 1. Subsample of frequent word: ■ Each word in the training set is discarded with a probability ● is the frequency of word and t (around ) a threshold ● Keep words that are more likely to occur less often ■ accelerates learning and even significantly improves the accuracy of the learned vectors of the rare words Word2vec (4) 2. Hierarchical Softmax: ○ A tree approximation of the softmax, by using a sigmoid at every step ○ Intuition: at every step decide whether to go right or left ○ O(log(n)) instead of O(n) 3. Negative sampling: ○ Alternative to Hierarchical Softmax (works better): ○ Brings up infrequent terms, squeezes the probability of frequent terms ○ Change the weight of only a selected number of terms with probability: ■ Word2vec (5) ● A serendipity effect from the Word2vec is the linearity(analogy) between embeddings: ● The famous example: (King - Man) + Woman = Queen GloVe (Global Vectors) Previous embeddings advantages and drawbacks ● Methods similar to LSA: ○ Advantage: Learn statistical information ○ Drawback: Poor on analogy ● Word2Vec: ○ Advantages: Learn analogy ○ Drawback: Poor in learning statistical information ● Glove for combining removing the disadvantages GloVe (2) Proposed approach: ● Use co-occurrence matrix as a starting point ● The ratio is better than the co-occurrence to distinguish relevant words (solid vs gas) than unrelated words: ○ Use the ratio as the starting point for the algorithm GloVe (3) Find a function ● The function should calculate : ● The ratio should encode the word vector space: ○ Since the vector spaces are linear, the function should combine : ■ ● Left side is vector, right side is a scalar: ○ How do we combine to keep the linearity? F could be a complicated function, but we need to keep the linearity: ■ To avoid that: GloVe (4) Improve the function ● The function is required to be an homomorphism : ● Given X, the matrix of word-to-word co occurrence: ■ ■ With F being exponential (the ratio above): ● ● Add two bias terms for the simplification of the first function ○ ● Objective function: GloVe (5) Improve the function ● The best alpha is ¾ : looks quite similar to the Word2Vec negative sampling fastText Use n-grams to share weights between words ● The goal is to share information between words ● Split the words in n-grams ○ Distinguish the prefix and postfix with ‘<’ and ‘>’ respectively ○ Example of 3 grams for the word TUGraz: ■ ‘<TU’,

Load more