On the Sentence Embeddings from Pre-Trained Language Models

Total Page:16

File Type:pdf, Size:1020Kb

Load more

On the Sentence Embeddings from Pre-trained Language Models Bohan Liy;z∗, Hao Zhouy, Junxian Hez, Mingxuan Wangy, Yiming Yangz, Lei Liy yByteDance AI Lab zLanguage Technologies Institute, Carnegie Mellon University {zhouhao.nlp,wangmingxuan.89,lileilab}@bytedance.com {bohanl1,junxianh,yiming}@cs.cmu.edu Abstract 2019) – for example, they even underperform the GloVe (Pennington et al., 2014) embeddings which Pre-trained contextual representations like are not contextualized and trained with a much sim- BERT have achieved great success in natu- pler model. Such issues hinder applying BERT ral language processing. However, the sen- tence embeddings from the pre-trained lan- sentence embeddings directly to many real-world guage models without fine-tuning have been scenarios where collecting labeled data is highly- found to poorly capture semantic meaning of costing or even intractable. sentences. In this paper, we argue that the se- In this paper, we aim to answer two major ques- mantic information in the BERT embeddings tions: (1) why do the BERT-induced sentence em- is not fully exploited. We first reveal the the- beddings perform poorly to retrieve semantically oretical connection between the masked lan- similar sentences? Do they carry too little semantic guage model pre-training objective and the se- mantic similarity task theoretically, and then information, or just because the semantic meanings analyze the BERT sentence embeddings em- in these embeddings are not exploited properly? (2) pirically. We find that BERT always induces If the BERT embeddings capture enough semantic a non-smooth anisotropic semantic space of information that is hard to be directly utilized, how sentences, which harms its performance of can we make it easier without external supervision? semantic similarity. To address this issue, Towards this end, we first study the connection we propose to transform the anisotropic sen- between the BERT pretraining objective and the se- tence embedding distribution to a smooth and isotropic Gaussian distribution through nor- mantic similarity task. Our analysis reveals that the malizing flows that are learned with an un- sentence embeddings of BERT should be able to supervised objective. Experimental results intuitively reflect the semantic similarity between show that our proposed BERT-flow method ob- sentences, which contradicts with experimental ob- tains significant performance gains over the servations. Inspired by Gao et al.(2019) who find state-of-the-art sentence embeddings on a va- that the language modeling performance can be riety of semantic textual similarity tasks. The limited by the learned anisotropic word embedding code is available at https://github.com/ bohanli/BERT-flow. space where the word embeddings occupy a narrow cone, and Ethayarajh(2019) who find that BERT 1 Introduction word embeddings also suffer from anisotropy, we hypothesize that the sentence embeddings from Recently, pre-trained language models and its vari- BERT – as average of context embeddings from last ants (Radford et al., 2019; Devlin et al., 2019; Yang layers1 – may suffer from similar issues. Through et al., 2019; Liu et al., 2019) like BERT (Devlin empirical probing over the embeddings, we further et al., 2019) have been widely used as represen- observe that the BERT sentence embedding space tations of natural language. Despite their great is semantically non-smoothing and poorly defined success on many NLP tasks through fine-tuning, in some areas, which makes it hard to be used di- the sentence embeddings from BERT without fine- rectly through simple similarity metrics such as dot tuning are significantly inferior in terms of se- mantic textual similarity (Reimers and Gurevych, 1In this paper, we compute average of context embeddings from last one or two layers as our sentence embeddings since ∗ The work was done when BL was an intern at they are consistently better than the [CLS] vector as shown ByteDance. in (Reimers and Gurevych, 2019). 9119 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 9119–9130, November 16–20, 2020. c 2020 Association for Computational Linguistics product or cosine similarity. Reimers and Gurevych(2019) demonstrate that To address these issues, we propose to transform such BERT sentence embeddings lag behind the the BERT sentence embedding distribution into a state-of-the-art sentence embeddings in terms of smooth and isotropic Gaussian distribution through semantic similarity. On the STS-B dataset, BERT normalizing flows (Dinh et al., 2015), which is sentence embeddings are even less competitive to an invertible function parameterized by neural net- averaged GloVe (Pennington et al., 2014) embed- works. Concretely, we learn a flow-based genera- dings, which is a simple and non-contextualized tive model to maximize the likelihood of generating baseline proposed several years ago. Nevertheless, BERT sentence embeddings from a standard Gaus- this incompetence has not been well understood sian latent variable in a unsupervised fashion. Dur- yet in existing literature. ing training, only the flow network is optimized Note that as demonstrated by Reimers and while the BERT parameters remain unchanged. Gurevych(2019), averaging context embeddings The learned flow, an invertible mapping function consistently outperforms the [CLS] embedding. between the BERT sentence embedding and Gaus- Therefore, unless mentioned otherwise, we use av- sian latent variable, is then used to transform the erage of context embeddings as BERT sentence BERT sentence embedding to the Gaussian space. embeddings and do not distinguish them in the rest We name the proposed method as BERT-flow. of the paper. We perform extensive experiments on 7 stan- 2.1 The Connection between Semantic dard semantic textual similarity benchmarks with- Similarity and BERT Pre-training out using any downstream supervision. Our empir- We consider a sequence of tokens x = ical results demonstrate that the flow transforma- 1:T (x ; : : : ; x ). Language modeling (LM) factor- tion is able to consistently improve BERT by up 1 T izes the joint probability p(x ) in an autoregres- to 12.70 points with an average of 8.16 points in 1:T sive way, namely log p(x ) = PT log p(x jc ) terms of Spearman correlation between cosine em- 1:T t=1 t t where the context c = x . To capture bidirec- bedding similarity and human annotated similarity. t 1:t−1 tional context during pretraining, BERT proposes When combined with external supervision from a masked language modeling (MLM) objective, natural language inference tasks (Bowman et al., which instead factorizes the probability of noisy 2015; Williams et al., 2018), our method outper- reconstruction p(¯xjx^) = PT m p(x jc ), where forms the sentence-BERT embeddings (Reimers t=1 t t t x^ is a corrupted sequence, x¯ is the masked tokens, and Gurevych, 2019), leading to new state-of-the- m is equal to 1 when x is masked and 0 otherwise. art performance. In addition to semantic sim- t t The context c =x ^. ilarity tasks, we apply sentence embeddings to t Note that both LM and MLM can be reduced to a question-answer entailment task, QNLI (Wang modeling the conditional distribution of a token x et al., 2019), directly without task-specific super- given the context c, which is typically formulated vision, and demonstrate the superiority of our ap- with a softmax function as, proach. Moreover, our further analysis implies that BERT-induced similarity can excessively correlate > exp hc wx with lexical similarity compared to semantic sim- p(xjc) = P > : (1) 0 exp h w 0 ilarity, and our proposed flow-based method can x c x effectively remedy this problem. Here the context embedding hc is a function of c, which is usually heavily parameterized by a deep neural network (e.g., a Transformer (Vaswani et al., 2 Understanding the Sentence 2017)); The word embedding wx is a function of Embedding Space of BERT x, which is parameterized by an embedding lookup table. To encode a sentence into a fixed-length vector with The similarity between BERT sentence embed- BERT, it is a convention to either compute an aver- dings can be reduced to the similarity between age of context embeddings in the last few layers of T 2 BERT context embeddings hc hc0 . However, as BERT, or extract the BERT context embedding at the position of the [CLS] token. Note that there is 2This is because we approximate BERT sentence embed- dings with context embeddings, and compute their dot product no token masked when producing sentence embed- (or cosine similarity) as model-predicted sentence similarity. dings, which is different from pretraining. Dot product is equivalent to cosine similarity when the em- 9120 shown in Equation1, the pretraining of BERT does Higher-order context-context co-occurrence T not explicitly involve the computation of hc hc0 . could also be inferred and propagated during pre- Therefore, we can hardly derive a mathematical training. The update of a context embedding hc > formulation of what hc hc0 exactly represents. could affect another context embedding hc0 in the above way, and similarly hc0 can further affect an- Co-Occurrence Statistics as the Proxy for Se- other hc00 . Therefore, the context embeddings can mantic Similarity Instead of directly analyzing form an implicit interaction among themselves via T 0 > hc hc, we consider hc wx, the dot product between higher-order co-occurrence relations. a context embedding hc and a word embedding wx. According to Yang et al.(2018), in a well-trained 2.2 Anisotropic Embedding Space Induces > language model, hc wx can be approximately de- Poor Semantic Similarity composed as follows, As discussed in Section 2.1, the pretraining of BERT should have encouraged semantically mean- ingful context embeddings implicitly. Why BERT > ∗ hc wx ≈ log p (xjc) + λc (2) sentence embeddings without finetuning yield un- = PMI(x; c) + log p(x) + λc: (3) satisfactory performance? To investigate the underlying problem of the fail- p(x;c) ure, we use word embeddings as a surrogate be- where PMI(x; c) = log p(x)p(c) denotes the point- wise mutual information between x and c, log p(x) cause words and contexts share the same embed- ding space.
Recommended publications
  • Arxiv:2002.06235V1

    Arxiv:2002.06235V1

    Semantic Relatedness and Taxonomic Word Embeddings Magdalena Kacmajor John D. Kelleher Innovation Exchange ADAPT Research Centre IBM Ireland Technological University Dublin [email protected] [email protected] Filip Klubickaˇ Alfredo Maldonado ADAPT Research Centre Underwriters Laboratories Technological University Dublin [email protected] [email protected] Abstract This paper1 connects a series of papers dealing with taxonomic word embeddings. It begins by noting that there are different types of semantic relatedness and that different lexical representa- tions encode different forms of relatedness. A particularly important distinction within semantic relatedness is that of thematic versus taxonomic relatedness. Next, we present a number of ex- periments that analyse taxonomic embeddings that have been trained on a synthetic corpus that has been generated via a random walk over a taxonomy. These experiments demonstrate how the properties of the synthetic corpus, such as the percentage of rare words, are affected by the shape of the knowledge graph the corpus is generated from. Finally, we explore the interactions between the relative sizes of natural and synthetic corpora on the performance of embeddings when taxonomic and thematic embeddings are combined. 1 Introduction Deep learning has revolutionised natural language processing over the last decade. A key enabler of deep learning for natural language processing has been the development of word embeddings. One reason for this is that deep learning intrinsically involves the use of neural network models and these models only work with numeric inputs. Consequently, applying deep learning to natural language processing first requires developing a numeric representation for words. Word embeddings provide a way of creating numeric representations that have proven to have a number of advantages over traditional numeric rep- resentations of language.
  • Combining Word Embeddings with Semantic Resources

    Combining Word Embeddings with Semantic Resources

    AutoExtend: Combining Word Embeddings with Semantic Resources Sascha Rothe∗ LMU Munich Hinrich Schutze¨ ∗ LMU Munich We present AutoExtend, a system that combines word embeddings with semantic resources by learning embeddings for non-word objects like synsets and entities and learning word embeddings that incorporate the semantic information from the resource. The method is based on encoding and decoding the word embeddings and is flexible in that it can take any word embeddings as input and does not need an additional training corpus. The obtained embeddings live in the same vector space as the input word embeddings. A sparse tensor formalization guar- antees efficiency and parallelizability. We use WordNet, GermaNet, and Freebase as semantic resources. AutoExtend achieves state-of-the-art performance on Word-in-Context Similarity and Word Sense Disambiguation tasks. 1. Introduction Unsupervised methods for learning word embeddings are widely used in natural lan- guage processing (NLP). The only data these methods need as input are very large corpora. However, in addition to corpora, there are many other resources that are undoubtedly useful in NLP, including lexical resources like WordNet and Wiktionary and knowledge bases like Wikipedia and Freebase. We will simply refer to these as resources. In this article, we present AutoExtend, a method for enriching these valuable resources with embeddings for non-word objects they describe; for example, Auto- Extend enriches WordNet with embeddings for synsets. The word embeddings and the new non-word embeddings live in the same vector space. Many NLP applications benefit if non-word objects described by resources—such as synsets in WordNet—are also available as embeddings.
  • Word-Like Character N-Gram Embedding

    Word-Like Character N-Gram Embedding

    Word-like character n-gram embedding Geewook Kim and Kazuki Fukui and Hidetoshi Shimodaira Department of Systems Science, Graduate School of Informatics, Kyoto University Mathematical Statistics Team, RIKEN Center for Advanced Intelligence Project fgeewook, [email protected], [email protected] Abstract Table 1: Top-10 2-grams in Sina Weibo and 4-grams in We propose a new word embedding method Japanese Twitter (Experiment 1). Words are indicated called word-like character n-gram embed- by boldface and space characters are marked by . ding, which learns distributed representations FNE WNE (Proposed) Chinese Japanese Chinese Japanese of words by embedding word-like character n- 1 ][ wwww 自己 フォロー grams. Our method is an extension of recently 2 。␣ !!!! 。␣ ありがと proposed segmentation-free word embedding, 3 !␣ ありがと ][ wwww 4 .. りがとう 一个 !!!! which directly embeds frequent character n- 5 ]␣ ございま 微博 めっちゃ grams from a raw corpus. However, its n-gram 6 。。 うござい 什么 んだけど vocabulary tends to contain too many non- 7 ,我 とうござ 可以 うござい 8 !! ざいます 没有 line word n-grams. We solved this problem by in- 9 ␣我 がとうご 吗? 2018 troducing an idea of expected word frequency. 10 了, ください 哈哈 じゃない Compared to the previously proposed meth- ods, our method can embed more words, along tion tools are used to determine word boundaries with the words that are not included in a given in the raw corpus. However, these segmenters re- basic word dictionary. Since our method does quire rich dictionaries for accurate segmentation, not rely on word segmentation with rich word which are expensive to prepare and not always dictionaries, it is especially effective when the available.
  • Learned in Speech Recognition: Contextual Acoustic Word Embeddings

    Learned in Speech Recognition: Contextual Acoustic Word Embeddings

    LEARNED IN SPEECH RECOGNITION: CONTEXTUAL ACOUSTIC WORD EMBEDDINGS Shruti Palaskar∗, Vikas Raunak∗ and Florian Metze Carnegie Mellon University, Pittsburgh, PA, U.S.A. fspalaska j vraunak j fmetze [email protected] ABSTRACT model [10, 11, 12] trained for direct Acoustic-to-Word (A2W) speech recognition [13]. Using this model, we jointly learn to End-to-end acoustic-to-word speech recognition models have re- automatically segment and classify input speech into individual cently gained popularity because they are easy to train, scale well to words, hence getting rid of the problem of chunking or requiring large amounts of training data, and do not require a lexicon. In addi- pre-defined word boundaries. As our A2W model is trained at the tion, word models may also be easier to integrate with downstream utterance level, we show that we can not only learn acoustic word tasks such as spoken language understanding, because inference embeddings, but also learn them in the proper context of their con- (search) is much simplified compared to phoneme, character or any taining sentence. We also evaluate our contextual acoustic word other sort of sub-word units. In this paper, we describe methods embeddings on a spoken language understanding task, demonstrat- to construct contextual acoustic word embeddings directly from a ing that they can be useful in non-transcription downstream tasks. supervised sequence-to-sequence acoustic-to-word speech recog- Our main contributions in this paper are the following: nition model using the learned attention distribution. On a suite 1. We demonstrate the usability of attention not only for aligning of 16 standard sentence evaluation tasks, our embeddings show words to acoustic frames without any forced alignment but also for competitive performance against a word2vec model trained on the constructing Contextual Acoustic Word Embeddings (CAWE).
  • Knowledge-Powered Deep Learning for Word Embedding

    Knowledge-Powered Deep Learning for Word Embedding

    Knowledge-Powered Deep Learning for Word Embedding Jiang Bian, Bin Gao, and Tie-Yan Liu Microsoft Research {jibian,bingao,tyliu}@microsoft.com Abstract. The basis of applying deep learning to solve natural language process- ing tasks is to obtain high-quality distributed representations of words, i.e., word embeddings, from large amounts of text data. However, text itself usually con- tains incomplete and ambiguous information, which makes necessity to leverage extra knowledge to understand it. Fortunately, text itself already contains well- defined morphological and syntactic knowledge; moreover, the large amount of texts on the Web enable the extraction of plenty of semantic knowledge. There- fore, it makes sense to design novel deep learning algorithms and systems in order to leverage the above knowledge to compute more effective word embed- dings. In this paper, we conduct an empirical study on the capacity of leveraging morphological, syntactic, and semantic knowledge to achieve high-quality word embeddings. Our study explores these types of knowledge to define new basis for word representation, provide additional input information, and serve as auxiliary supervision in deep learning, respectively. Experiments on an analogical reason- ing task, a word similarity task, and a word completion task have all demonstrated that knowledge-powered deep learning can enhance the effectiveness of word em- bedding. 1 Introduction With rapid development of deep learning techniques in recent years, it has drawn in- creasing attention to train complex and deep models on large amounts of data, in order to solve a wide range of text mining and natural language processing (NLP) tasks [4, 1, 8, 13, 19, 20].
  • Local Homology of Word Embeddings

    Local Homology of Word Embeddings

    Local Homology of Word Embeddings Tadas Temcinasˇ 1 Abstract Intuitively, stratification is a decomposition of a topologi- Topological data analysis (TDA) has been widely cal space into manifold-like pieces. When thinking about used to make progress on a number of problems. stratification learning and word embeddings, it seems intu- However, it seems that TDA application in natural itive that vectors of words corresponding to the same broad language processing (NLP) is at its infancy. In topic would constitute a structure, which we might hope this paper we try to bridge the gap by arguing why to be a manifold. Hence, for example, by looking at the TDA tools are a natural choice when it comes to intersections between those manifolds or singularities on analysing word embedding data. We describe the manifolds (both of which can be recovered using local a parallelisable unsupervised learning algorithm homology-based algorithms (Nanda, 2017)) one might hope based on local homology of datapoints and show to find vectors of homonyms like ‘bank’ (which can mean some experimental results on word embedding either a river bank, or a financial institution). This, in turn, data. We see that local homology of datapoints in has potential to help solve the word sense disambiguation word embedding data contains some information (WSD) problem in NLP, which is pinning down a particular that can potentially be used to solve the word meaning of a word used in a sentence when the word has sense disambiguation problem. multiple meanings. In this work we present a clustering algorithm based on lo- cal homology, which is more relaxed1 that the stratification- 1.
  • Word Embedding, Sense Embedding and Their Application to Word Sense Induction

    Word Embedding, Sense Embedding and Their Application to Word Sense Induction

    Word Embeddings, Sense Embeddings and their Application to Word Sense Induction Linfeng Song The University of Rochester Computer Science Department Rochester, NY 14627 Area Paper April 2016 Abstract This paper investigates the cutting-edge techniques for word embedding, sense embedding, and our evaluation results on large-scale datasets. Word embedding refers to a kind of methods that learn a distributed dense vector for each word in a vocabulary. Traditional word embedding methods first obtain the co-occurrence matrix then perform dimension reduction with PCA. Recent methods use neural language models that directly learn word vectors by predicting the context words of the target word. Moving one step forward, sense embedding learns a distributed vector for each sense of a word. They either define a sense as a cluster of contexts where the target word appears or define a sense based on a sense inventory. To evaluate the performance of the state-of-the-art sense embedding methods, I first compare them on the dominant word similarity datasets, then compare them on my experimental settings. In addition, I show that sense embedding is applicable to the task of word sense induction (WSI). Actually we are the first to show that sense embedding methods are competitive on WSI by building sense-embedding-based systems that demonstrate highly competitive performances on the SemEval 2010 WSI shared task. Finally, I propose several possible future research directions on word embedding and sense embedding. The University of Rochester Computer Science Department supported this work. Contents 1 Introduction 3 2 Word Embedding 5 2.1 Skip-gramModel................................
  • Learning Word Meta-Embeddings by Autoencoding

    Learning Word Meta-Embeddings by Autoencoding

    Learning Word Meta-Embeddings by Autoencoding Cong Bao Danushka Bollegala Department of Computer Science Department of Computer Science University of Liverpool University of Liverpool [email protected] [email protected] Abstract Distributed word embeddings have shown superior performances in numerous Natural Language Processing (NLP) tasks. However, their performances vary significantly across different tasks, implying that the word embeddings learnt by those methods capture complementary aspects of lexical semantics. Therefore, we believe that it is important to combine the existing word em- beddings to produce more accurate and complete meta-embeddings of words. We model the meta-embedding learning problem as an autoencoding problem, where we would like to learn a meta-embedding space that can accurately reconstruct all source embeddings simultaneously. Thereby, the meta-embedding space is enforced to capture complementary information in differ- ent source embeddings via a coherent common embedding space. We propose three flavours of autoencoded meta-embeddings motivated by different requirements that must be satisfied by a meta-embedding. Our experimental results on a series of benchmark evaluations show that the proposed autoencoded meta-embeddings outperform the existing state-of-the-art meta- embeddings in multiple tasks. 1 Introduction Representing the meanings of words is a fundamental task in Natural Language Processing (NLP). A popular approach to represent the meaning of a word is to embed it in some fixed-dimensional
  • A Comparison of Word Embeddings and N-Gram Models for Dbpedia Type and Invalid Entity Detection †

    A Comparison of Word Embeddings and N-Gram Models for Dbpedia Type and Invalid Entity Detection †

    information Article A Comparison of Word Embeddings and N-gram Models for DBpedia Type and Invalid Entity Detection † Hanqing Zhou *, Amal Zouaq and Diana Inkpen School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa ON K1N 6N5, Canada; [email protected] (A.Z.); [email protected] (D.I.) * Correspondence: [email protected]; Tel.: +1-613-562-5800 † This paper is an extended version of our conference paper: Hanqing Zhou, Amal Zouaq, and Diana Inkpen. DBpedia Entity Type Detection using Entity Embeddings and N-Gram Models. In Proceedings of the International Conference on Knowledge Engineering and Semantic Web (KESW 2017), Szczecin, Poland, 8–10 November 2017, pp. 309–322. Received: 6 November 2018; Accepted: 20 December 2018; Published: 25 December 2018 Abstract: This article presents and evaluates a method for the detection of DBpedia types and entities that can be used for knowledge base completion and maintenance. This method compares entity embeddings with traditional N-gram models coupled with clustering and classification. We tackle two challenges: (a) the detection of entity types, which can be used to detect invalid DBpedia types and assign DBpedia types for type-less entities; and (b) the detection of invalid entities in the resource description of a DBpedia entity. Our results show that entity embeddings outperform n-gram models for type and entity detection and can contribute to the improvement of DBpedia’s quality, maintenance, and evolution. Keywords: semantic web; DBpedia; entity embedding; n-grams; type identification; entity identification; data mining; machine learning 1. Introduction The Semantic Web is defined by Berners-Lee et al.
  • Arxiv:2007.00183V2 [Eess.AS] 24 Nov 2020

    Arxiv:2007.00183V2 [Eess.AS] 24 Nov 2020

    WHOLE-WORD SEGMENTAL SPEECH RECOGNITION WITH ACOUSTIC WORD EMBEDDINGS Bowen Shi, Shane Settle, Karen Livescu TTI-Chicago, USA fbshi,settle.shane,[email protected] ABSTRACT Segmental models are sequence prediction models in which scores of hypotheses are based on entire variable-length seg- ments of frames. We consider segmental models for whole- word (“acoustic-to-word”) speech recognition, with the feature vectors defined using vector embeddings of segments. Such models are computationally challenging as the number of paths is proportional to the vocabulary size, which can be orders of magnitude larger than when using subword units like phones. We describe an efficient approach for end-to-end whole-word segmental models, with forward-backward and Viterbi de- coding performed on a GPU and a simple segment scoring function that reduces space complexity. In addition, we inves- tigate the use of pre-training via jointly trained acoustic word embeddings (AWEs) and acoustically grounded word embed- dings (AGWEs) of written word labels. We find that word error rate can be reduced by a large margin by pre-training the acoustic segment representation with AWEs, and additional Fig. 1. Whole-word segmental model for speech recognition. (smaller) gains can be obtained by pre-training the word pre- Note: boundary frames are not shared. diction layer with AGWEs. Our final models improve over segmental models, where the sequence probability is com- prior A2W models. puted based on segment scores instead of frame probabilities. Index Terms— speech recognition, segmental model, Segmental models have a long history in speech recognition acoustic-to-word, acoustic word embeddings, pre-training research, but they have been used primarily for phonetic recog- nition or as phone-level acoustic models [11–18].
  • Deep Learning for Natural Language Processing Creating Neural Networks with Python

    Deep Learning for Natural Language Processing Creating Neural Networks with Python

    Deep Learning for Natural Language Processing Creating Neural Networks with Python Palash Goyal Sumit Pandey Karan Jain Deep Learning for Natural Language Processing: Creating Neural Networks with Python Palash Goyal Sumit Pandey Bangalore, Karnataka, India Bangalore, Karnataka, India Karan Jain Bangalore, Karnataka, India ISBN-13 (pbk): 978-1-4842-3684-0 ISBN-13 (electronic): 978-1-4842-3685-7 https://doi.org/10.1007/978-1-4842-3685-7 Library of Congress Control Number: 2018947502 Copyright © 2018 by Palash Goyal, Sumit Pandey, Karan Jain Any source code or other supplementary material referenced by the author in this book is available to readers on GitHub via the book’s product page, located at www.apress.com/978-1-4842-3684-0. For more detailed information, please visit www.apress.com/source-code. Contents Introduction�������������������������������������������������������������������������������������� xvii Chapter 1: Introduction to Natural Language Processing and Deep Learning ���������������������������������������������������������������������1 Python Packages ���������������������������������������������������������������������������������������������������3 NumPy �������������������������������������������������������������������������������������������������������������3 Pandas �������������������������������������������������������������������������������������������������������������8 SciPy ��������������������������������������������������������������������������������������������������������������13 Introduction to Natural
  • Text Feature Mining Using Pre-Trained Word Embeddings

    Text Feature Mining Using Pre-Trained Word Embeddings

    DEGREE PROJECT IN MATHEMATICS, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2018 Text Feature Mining Using Pre-trained Word Embeddings HENRIK SJÖKVIST KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES Text Feature Mining Using Pre- trained Word Embeddings HENRIK SJÖKVIST Degree Projects in Financial Mathematics (30 ECTS credits) Degree Programme in Industrial Engineering and Management KTH Royal Institute of Technology year 2018 Supervisor at Handelsbanken: Richard Henricsson Supervisor at KTH: Henrik Hult Examiner at KTH: Henrik Hult TRITA-SCI-GRU 2018:167 MAT-E 2018:28 Royal Institute of Technology School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci Abstract This thesis explores a machine learning task where the data contains not only numer- ical features but also free-text features. In order to employ a supervised classifier and make predictions, the free-text features must be converted into numerical features. In this thesis, an algorithm is developed to perform that conversion. The algorithm uses a pre-trained word embedding model which maps each word to a vector. The vectors for multiple word embeddings belonging to the same sentence are then combined to form a single sentence embedding. The sentence embeddings for the whole dataset are clustered to identify distinct groups of free-text strings. The cluster labels are output as the numerical features. The algorithm is applied on a specific case concerning operational risk control in banking. The data consists of modifications made to trades in financial instruments. Each such modification comes with a short text string which documents the modi- fication, a trader comment.