NLP: the Bread and the Butter

NLP: The bread and the butter Andi Rexha • 07.11.2019 Agenda Traditional NLP Word embeddings ● Text preprocessing ● First Generation (W2v) ● Features’ type ● Second Generation (ELMo, Bert..) ● Bag-of-words model ● Multilinguality ● External Resources ○ Space transformation ● Sequential classification ○ Multilingual Bert, MultiFiT ● Other tasks (MT, LM, Sentiment) Preprocessing How to preprocess text? ● Divide et impera approach: ○ How do we(humans) split the text to analyse it? ■ Word split ■ Sentence split ■ Paragraphs, etc ○ Is there any other information that we can collect? Preprocessing (2) Other preprocessing steps ● Stemming/(Lemmatization usually too expensive) ● Part of Speech Tagging (PoS) ● Chunking/Constituency Parsing ● Dependency Parsing Preprocessing (3) ● Stemming ○ The process of bringing the inflected words to their common root: ■ Producing => produc; are =>are ● Lemmatization ○ Bringing the words to the same lemma word: ■ am , is, are => be ● Part of Speech Tagging (PoS) ○ Assign to each word a grammatical tag Preprocessing Example Sentence “There are different examples that we might use!” Preprocessing PoS Tagging Lemmatization Definition of PoS tags from Penn Treebank: https://sites.google.com/site/partofspeechhelp/#TOC-VBZ Preprocessing (4) Other preprocessing steps ● Shallow parsing (Chunking): ○ It is a morphological parsing that adds a tree structure to the POS-tags ○ First identifies its constituents and then their relation ● Deep Parsing (Dependency Parsing): ○ Parses the sentence in its grammatical structure ○ “Head” - “Dependent” form ○ It is an acyclic directed graph (mostly implemented as a tree) Preprocessing Example (2) Sentence Constituency parsing ● “There are different examples that we might use!” Dependency Parsing Run examples under: https://corenlp.run/ Core NLP library: https://stanfordnlp.github.io/CoreNLP/ Dependency Parser tags(UD): https://nlp.stanford.edu/software/dependencies_manual.pdf Bag-of-words Model A major paradigm in NLP and IR ● The text is considered to be a set of its words ● Grammatical dependencies are ignored Representation of features ● Dictionary based (Nominal features) ● One hot encoded/Frequency encoded ● What is the difference with the standard Data Mining problems? Bag-of-words Model (2) Sentences Features Features representation for ML Source: https://en.wikipedia.org/wiki/Bag-of-words_model Feature encoding PoS tagging ● Option (Word + PoS tag) as part of dictionary: ○ Example: John-PN Dependency-Parsing ● Dependency between two words: ○ Use it as an n-gram feature ● Option (Word + Dependency Path) as part of dictionary: ○ Example: use-nsubj-acl:relcl External Resources There are linguistic resources that we miss ● Synonyms, antonyms, hyponyms, hypernyms,... External resources can help us to enrich our vocabulary ● Wordnet: A lexical database for English, which groups words in synsets ● Wiktionary: A free multilingual dictionary enriched with relations between words ● We can enrich the feature of our examples with their synonyms Sequential Classification Cases when we need to classify a sequence of tokens ● Information Extraction: ○ Example: Extract the names of companies from documents (open domain) How do we model? ● Classify each token as part or not part of the information: ○ The classification of the current token depends on the classification of the previous one ○ Sequential classifier ○ Still not enough; we need to encode the output ○ We need to know where every “annotation” starts and ends Sequential Classification (2) Two schemas for encoding the output ● BILOU: Beginning, Inside, Last, Outside, Unit ● BIO(most used): Beginning, Inside, Outside ● BILOU has shown to perform better in some datasets ● Example: “The Know Center GmbH is a spinoff of TUGraz.” ○ BILOU: The-O; Know-B; Center-I; GmbH-L; is-O; a-O; spinoff-O; of- O; TUGraz-U ○ BIO: The-O; Know-B; Center-I; GmbH-I; is-O; a-O; spinoff-O; of-O; TUGraz-B ● Sequential classifiers: Hidden Markov Model, CRF, etc 1: Named Entity recognition: https://www.aclweb.org/anthology/W09-1119.pdf Machine Translation Task presentation ● Translate text from one language to another ● Different approaches: ○ Rule based (Usually by using a dictionary) ○ Statistical (Involving a bilingual aligned corpora) ■ IBM models (1-6) for aligning and training ○ Hybrid (The use of the two previous techniques) Sentiment Analysis Task presentation ● Assign a sentiment to a piece of text: ○ Binary (like/dislike) ○ Rating based (eg. 1-5) ● Assign the sentiment to a target phrase: ○ Usually involving features around the target ● External resources: ○ SentiWordnet http://sentiwordnet.isti.cnr.it/ Language model Task presentation ● Generating the next token of a sequence ● Usually based on the collection of co-occurrence of words in a window: ○ Statistics are collected and the next word is predicted based on these information ○ Mainly, models the probability of the sequence: ■ ● In traditional approaches solved as an n-gram approximation: ○ Usually solved by combining different sizes of n-grams and weighting them Dense word representation From sparse to dense Topic modeling ● Since LSA(Latent Semantic Analysis): ○ These methods utilize low-rank approximations to decompose large matrices that capture statistical information about a corpus ● Other Methods later: ○ pLSA (Probability Latent Semantic Analysis) ■ Uses probability instead of SVD (Single Value Decomposition) ● LDA (Latent Dirichlet Allocation): ○ A Bayesian version of pLSA Neural embeddings A Neural Probabilistic Language Model ● Language models suffer from the “Curse of dimensionality”: ○ The word sequence that we want to predict is likely to be different from the ones we have seen in the training ○ Seeing in the “The cat is walking in the bedroom” => should help us generate: “A dog was running in the room”: ■ Similar semantics and grammatical role ● Benjo et al. implemented in 2003 the idea of Mnih and Hinton (1989): ○ Learned a language model and embeddings for the words Neural embeddings (2) Benjo’s architecture ● ● Approximate a function with a window approach ● Model the approximation with a neural network ● Input Layer in a 1-hot-encoding form ● Two hidden layers (first more of a random initialization) ● A tanh intermediate layer Neural embeddings (3) Benjo’s architecture ● A final softmax layer: ○ Outputting the next word in the sequence ● Learned a word representation of 18K words with almost 1M words in the corpus ● IMPORTANT Linguistic Theory: ○ Words that tend to occur in similar linguistic context tend to resemble each other in meanings Word2vec ● A deep learning model (2 layers) that compute dense vector representations of words ● Two different architectures: ○ Continuous Bag-of-Words Model (CBOW) (Faster) ■ Predict the middle word in a window of words ○ Skip-gram Model (Better with small amount of data) ■ Predict the context of a middle word, given the word ● Models the probability of words co-occurring with the current word(candidate) ● The embedding learned is the output of the hidden layer Word2vec (2) CBOW Skip-Gram Word2vec (3) ● The output is a softmax function ● Three new techniques: 1. Subsample of frequent word: ■ Each word in the training set is discarded with a probability ● is the frequency of word and t (around ) a threshold ● Keep words that are more likely to occur less often ■ accelerates learning and even significantly improves the accuracy of the learned vectors of the rare words Word2vec (4) 2. Hierarchical Softmax: ○ A tree approximation of the softmax, by using a sigmoid at every step ○ Intuition: at every step decide whether to go right or left ○ O(log(n)) instead of O(n) 3. Negative sampling: ○ Alternative to Hierarchical Softmax (works better): ○ Brings up infrequent terms, squeezes the probability of frequent terms ○ Change the weight of only a selected number of terms with probability: ■ Word2vec (5) ● A serendipity effect from the Word2vec is the linearity(analogy) between embeddings: ● The famous example: (King - Man) + Woman = Queen GloVe (Global Vectors) Previous embeddings advantages and drawbacks ● Methods similar to LSA: ○ Advantage: Learn statistical information ○ Drawback: Poor on analogy ● Word2Vec: ○ Advantages: Learn analogy ○ Drawback: Poor in learning statistical information ● Glove for combining removing the disadvantages GloVe (2) Proposed approach: ● Use co-occurrence matrix as a starting point ● The ratio is better than the co-occurrence to distinguish relevant words (solid vs gas) than unrelated words: ○ Use the ratio as the starting point for the algorithm GloVe (3) Find a function ● The function should calculate : ● The ratio should encode the word vector space: ○ Since the vector spaces are linear, the function should combine : ■ ● Left side is vector, right side is a scalar: ○ How do we combine to keep the linearity? F could be a complicated function, but we need to keep the linearity: ■ To avoid that: GloVe (4) Improve the function ● The function is required to be an homomorphism : ● Given X, the matrix of word-to-word co occurrence: ■ ■ With F being exponential (the ratio above): ● ● Add two bias terms for the simplification of the first function ○ ● Objective function: GloVe (5) Improve the function ● The best alpha is ¾ : looks quite similar to the Word2Vec negative sampling fastText Use n-grams to share weights between words ● The goal is to share information between words ● Split the words in n-grams ○ Distinguish the prefix and postfix with ‘<’ and ‘>’ respectively ○ Example of 3 grams for the word TUGraz: ■ ‘<TU’,

NLP: the Bread and the Butter

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support