Representation Learning for Natural Language Processing Representation Learning for Natural Language Processing Zhiyuan Liu • Yankai Lin • Maosong Sun

Zhiyuan Liu · Yankai Lin · Maosong Sun Representation Learning for Natural Language Processing Representation Learning for Natural Language Processing Zhiyuan Liu • Yankai Lin • Maosong Sun Representation Learning for Natural Language Processing 123 Zhiyuan Liu Yankai Lin Tsinghua University Pattern Recognition Center Beijing, China Tencent Wechat Beijing, China Maosong Sun Department of Computer Science and Technology Tsinghua University Beijing, China ISBN 978-981-15-5572-5 ISBN 978-981-15-5573-2 (eBook) https://doi.org/10.1007/978-981-15-5573-2 © The Editor(s) (if applicable) and The Author(s) 2020. This book is an open access publication. Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adap- tation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this book are included in the book’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore Preface In traditional Natural Language Processing (NLP) systems, language entries such as words and phrases are taken as distinct symbols. Various classic ideas and methods, such as n-gram and bag-of-words models, were proposed and have been widely used until now in many industrial applications. All these methods take words as the minimum units for semantic representation, which are either used to further estimate the conditional probabilities of next words given previous words (e.g., n-gram) or used to represent semantic meanings of text (e.g., bag-of-words models). Even when people find it is necessary to model word meanings, they either manually build some linguistic knowledge bases such as WordNet or use context words to represent the meaning of a target word (i.e., distributional representation). All these semantic representation methods are still based on symbols! With the development of NLP techniques for years, it is realized that many issues in NLP are caused by the symbol-based semantic representation. First, the symbol-based representation always suffers from the data sparsity problem. Take statistical NLP methods such as n-gram with large-scale corpora, for example, due to the intrinsic power-law distribution of words, the performance will decay dra- matically for those few-shot words, even many smoothing methods have been developed to calibrate the estimated probabilities about them. Moreover, there are multiple-grained entries in natural languages from words, phrases, sentences to documents, it is difficult to find a unified symbol set to represent the semantic meanings for all of them simultaneously. Meanwhile, in many NLP tasks, it is required to measure semantic relatedness between these language entries at different levels. For example, we have to measure semantic relatedness between words/phrases and documents in Information Retrieval. Due to the absence of a unified scheme for semantic representation, there used to be distinct approaches proposed and explored for different tasks in NLP, and it sometimes makes NLP does not look like a compatible community. As an alternative approach to symbol-based representation, distributed representation was originally proposed by Geoffrey E. Hinton in a technique report in 1984. The report was then included in the well-known two-volume book Parallel Distributed Processing (PDP) that introduced neural networks to model human v vi Preface cognition and intelligence. According to this report, distributed representation is inspired by the neural computation scheme of humans and other animals, and the essential idea is as follows: Each entity is represented by a pattern of activity distributed over many computing ele- ments, and each computing element is involved in representing many different entities. It means that each entity is represented by multiple neurons, and each neuron involves in the representation of many concepts. This also indicates the meaning of distributed in distributed representation. As opposed to distributed representation, people used to assume one neuron only represents a specific concept or object, e.g., there exists a single neuron that will only be activated when recognizing a person or object, such as his/her grandmother, well known as the grandmother-cell hypothesis or local representation. We can see the straightforward connection between the grandmother-cell hypothesis and symbol-based representation. It was about 20 years after distributed representation was proposed, neural probabilistic language model was proposed to model natural languages by Yoshua Bengio in 2003, in which words are represented as low-dimensional and real-valued vectors based on the idea of distributed representation. However, it was until 2013 that a simpler and more efficient framework word2vec was proposed to learn word distributed representations from large-scale corpora, we come to the popularity of distributed representation and neural network techniques in NLP. The performance of almost all NLP tasks has been significantly improved with the support of the distributed representation scheme and the deep learning methods. This book aims to review and present the recent advances of distributed representation learning for NLP, including why representation learning can improve NLP, how representation learning takes part in various important topics of NLP, and what challenges are still not well addressed by distributed representation. Book Organization This book is organized into 11 chapters with 3 parts. The first part of the book depicts key components in NLP and how representation learning works for them. In this part, Chap. 1 first introduces the basics of representation learning and why it is important for NLP. Then we give a comprehensive review of representation learning techniques on multiple-grained entries in NLP, including word representation (Chap. 2), phrase representation as known as compositional semantics (Chap. 3), sentence representation (Chap. 4), and document representation (Chap. 5). The second part presents representation learning for those components closely related to NLP. These components include sememe knowledge that describes the commonsense knowledge of words as human concepts, world knowledge (also known as knowledge graphs) that organizes relational facts between entities in the real world, various network data such as social networks, document networks, and Preface vii cross-modal data that connects natural languages to other modalities such as visual data. A deep understanding of natural languages requires these complex components as a rich context. Therefore, we provide an extensive introduction to these components, i.e., sememe knowledge representation (Chap. 6), world knowledge representation (Chap. 7), network representation (Chap. 8), and cross-modal representation (Chap. 9). In the third part, we will further provide some widely used open resource tools on representation learning techniques (Chap. 10)andfinally outlook the remaining challenges and future research directions of representation learning for NLP (Chap. 11). Although the book is about representation learning for NLP, those theories and algorithms can be also applied in other related domains, such as machine learning, social network analysis, semantic web, information retrieval, data mining, and computational biology. Note that, some parts of this book are based on our previous published or pre-printed papers, including [1, 11] in Chap. 2, [32] in Chap. 3, [10, 5, 29] in Chap. 4, [12, 7] in Chap. 5, [17, 14, 24, 30, 6, 16, 2, 15] in Chap. 6, [9, 8, 13, 21, 22, 23, 3, 4, 31] in Chap. 7, and [25, 19, 18, 20, 26, 27, 33, 28] in Chap. 8. Book Cover The book cover shows an oracle bone divided into three parts, corresponding to three revolutionized stages of cognition and representation in human history. The left part shows oracle scripts, the earliest known form of Chinese writing characters used on oracle bones in the late 1200 BC. It is used to represent the emergence of human languages, especially writing systems. We consider this as the first representation revolution for human beings

Load more