Embeddings in Natural Language Processing
Total Page:16
File Type:pdf, Size:1020Kb
Embeddings in Natural Language Processing Theory and Advances in Vector Representation of Meaning Mohammad Taher Pilehvar Tehran Institute for Advanced Studies Jose Camacho-Collados Cardiff University SYNTHESIS LECTURESDRAFT ON HUMAN LANGUAGE TECHNOLOGIES M &C Morgan& cLaypool publishers ABSTRACT Embeddings have been one of the dominating buzzwords since the early 2010s for Natural Language Processing (NLP). Encoding information into a low-dimensional vector representation, which is easily integrable in modern machine learning algo- rithms, has played a central role in the development in NLP. Embedding techniques initially focused on words but the attention soon started to shift to other forms: from graph structures, such as knowledge bases, to other types of textual content, such as sentences and documents. This book provides a high level synthesis of the main embedding techniques in NLP, in the broad sense. The book starts by explaining conventional word vector space models and word embeddings (e.g., Word2Vec and GloVe) and then moves to other types of embeddings, such as word sense, sentence and document, and graph embeddings. We also provide an overview on the status of the recent development in contextualized representations (e.g., ELMo, BERT) and explain their potential in NLP. Throughout the book the reader can find both essential information for un- derstanding a certain topic from scratch, and an in-breadth overview of the most successful techniques developed in the literature. KEYWORDS Natural Language Processing, Embeddings, Semantics DRAFT iii Contents 1 Introduction ................................................. 1 1.1 Semantic representation . 3 1.2 One-hot representation . 4 1.3 Vector Space Models . 5 1.4 The Evolution Path of representations . 6 1.5 Coverage of the book . 8 1.6 Outline . 8 2 Background ................................................. 10 2.1 Natural Language Processing Fundamentals . 10 2.1.1 Linguistic fundamentals . 10 2.1.2 Language models . 11 2.2 Deep Learning for NLP . 12 2.2.1 Sequence encoding . 13 2.2.2 Recurrent neural networks . 14 2.2.3 Transformers . 23 2.3 Knowledge Resources . 23 2.3.1 WordNet . 24 2.3.2 Wikipedia, Freebase, Wikidata and DBpedia . 24 2.3.3 BabelNet and ConceptNet . 25 2.3.4 PPDB: The Paraphrase Database . 25 3 Word Embeddings ........................................... 27 3.1 Count-based models . 28 3.1.1 Pointwise Mutual Information . 29 3.1.2 DimensionalityDRAFT reduction . 30 3.2 Predictive models . 31 3.3 Character embedding . 33 3.4 Knowledge-enhanced word embeddings . 34 3.5 Cross-lingual word embeddings . 35 3.5.1 Sentence-level supervision . 36 iv 3.5.2 Document-level supervision . 36 3.5.3 Word-level supervision . 36 3.5.4 Unsupervised . 38 3.6 Evaluation . 39 3.6.1 Intrinsic Evaluation . 39 3.6.2 Extrinsic Evaluation . 42 4 Graph Embeddings .......................................... 44 4.1 Node embedding . 45 4.1.1 Matrix factorization methods . 46 4.1.2 Random Walk methods . 47 4.1.3 Incorporating node attributes . 49 4.1.4 Graph Neural Network methods . 51 4.2 Knowledge-based relation embeddings . 53 4.3 Unsupervised relation embeddings . 55 4.4 Applications and Evaluation . 57 4.4.1 Node embedding . 57 4.4.2 Relation embedding . 58 5 Sense Embeddings ........................................... 60 5.1 Unsupervised sense embeddings . 61 5.1.1 Sense Representations Exploiting Monolingual Corpora . 61 5.1.2 Sense Representations Exploiting Multilingual Corpora . 66 5.2 Knowledge-based sense embeddings . 67 5.3 Evaluation and Application . 72 6 Contextualized Embeddings .................................. 74 6.1 The need for contextualization . 74 6.2 Background: Transformer model . 77 6.2.1 Self-attention . 78 6.2.2 EncoderDRAFT . 79 6.2.3 Decoder . 80 6.2.4 Positional encoding . 81 6.3 Contextualized word embeddings . 82 6.3.1 Earlier methods . 83 6.3.2 Language models for word representation . 84 6.3.3 RNN-based models . 85 v 6.4 Transformer-based Models: BERT . 87 6.4.1 Masked Language Modeling . 88 6.4.2 Next Sentence Prediction . 89 6.4.3 Training . 89 6.5 Extensions . 90 6.5.1 Translation language modeling . 91 6.5.2 Context fragmentation . 91 6.5.3 Permutation language modeling . 92 6.5.4 Reducing model size . 93 6.6 Feature extraction and fine-tuning . 94 6.7 Analysis and Evaluation . 95 6.7.1 Self attention patterns . 96 6.7.2 Syntactic properties . 96 6.7.3 Depth-wise information progression . 98 6.7.4 Multilinguality . 98 6.7.5 Lexical contextualization . 100 6.7.6 Evaluation . 101 7 Sentence and Document Embeddings ........................ 104 7.1 Unsupervised Sentence Embeddings . 104 7.1.1 Bag of Words . 104 7.1.2 Sentence-level Training . 105 7.2 Supervised Sentence Embeddings . 106 7.3 Document Embeddings . 108 7.4 Application and Evaluation . 109 8 Ethics and Bias ............................................. 110 8.1 Bias in Word Embeddings . 111 8.2 Debiasing Word Embeddings . 111 9 Conclusions ................................................DRAFT 114 Bibliography ............................................... 117 Author’s Biography ......................................... 157 DRAFT 1 C H A P T E R 1 Introduction Artificial Intelligence (AI) has undoubtedly been one of the most important buz- zwords over the past years. The goal in AI is to design algorithms that transform com- puters into “intelligent” agents. By intelligence here we do not necessarily mean an extraordinary level of smartness shown by superhuman; it rather often involves very basic problems that humans solve very frequently in their day-to-day life. This can be as simple as recognizing faces in an image, driving a car, playing a board game, or reading (and understanding) an article in a newspaper. The intelligent behaviour ex- hibited by humans when “reading” is one of the main goals for a subfield of AI called Natural Language Processing (NLP). Natural language1 is one of the most complex tools used by humans for a wide range of reasons, for instance to communicate with others, to express thoughts, feelings and ideas, to ask questions, or to give instruc- tions. Therefore, it is crucial for computers to possess the ability to use the same tool in order to effectively interact with humans. From one view, NLP can be divided into two broad subfields: Natural Language Understanding (NLU) and Natural Language Generation (NLG). NLU deals with un- derstanding the meaning of human language, usually expressed as a piece of text.2 For instance, when a Question Answering (QA3) system is.