Improved Word Sense Disambiguation Using Pre-Trained Contextualized Word Representations

Total Page:16

File Type:pdf, Size:1020Kb

Improved Word Sense Disambiguation Using Pre-Trained Contextualized Word Representations Improved Word Sense Disambiguation Using Pre-Trained Contextualized Word Representations Christian Hadiwinoto, Hwee Tou Ng, and Wee Chung Gan Department of Computer Science, National University of Singapore [email protected], [email protected], gan [email protected] Abstract can also make use of continuous word represen- tations of the surrounding words (Taghipour and Contextualized word representations are able Ng, 2015; Iacobacci et al., 2016). Neural WSD to give different representations for the same systems (Kageb˚ ack¨ and Salomonsson, 2016; Ra- word in different contexts, and they have been shown to be effective in downstream natu- ganato et al., 2017b) feed the continuous word rep- ral language processing tasks, such as ques- resentations into a neural network that captures the tion answering, named entity recognition, and whole sentence and the word representation in the sentiment analysis. However, evaluation on sentence. However, in both approaches, the word word sense disambiguation (WSD) in prior representations are independent of the context. work shows that using contextualized word Recently, pre-trained contextualized word rep- representations does not outperform the state- resentations (Melamud et al., 2016; McCann et al., of-the-art approach that makes use of non- contextualized word embeddings. In this pa- 2017; Peters et al., 2018; Devlin et al., 2019) per, we explore different strategies of integrat- have been shown to improve downstream NLP ing pre-trained contextualized word represen- tasks. Pre-trained contextualized word represen- tations and our best strategy achieves accura- tations are obtained through neural sentence en- cies exceeding the best prior published accura- coders trained on a huge amount of raw texts. cies by significant margins on multiple bench- When the resulting sentence encoder is fine-tuned mark WSD datasets. on the downstream task, such as question answer- 1 Introduction ing, named entity recognition, and sentiment anal- ysis, with much smaller annotated training data, it Word sense disambiguation (WSD) automatically has been shown that the trained model, with the assigns a pre-defined sense to a word in a text. Dif- pre-trained sentence encoder component, achieves ferent senses of a word reflect different meanings a new state-of-the-art results on those tasks. word has in different contexts. Identifying the cor- While demonstrating superior performance in rect word sense given a context is crucial in natural downstream NLP tasks, pre-trained contextual- language processing (NLP). Unfortunately, while ized word representations are still reported to give it is easy for a human to infer the correct sense of lower accuracy compared to approaches that use a word given a context, it is a challenge for NLP non-contextualized word representations (Mela- systems. As such, WSD is an important task and it mud et al., 2016; Peters et al., 2018) when eval- has been shown that WSD helps downstream NLP uated on WSD. This seems counter-intuitive, as tasks, such as machine translation (Chan et al., a neural sentence encoder better captures the sur- 2007a) and information retrieval (Zhong and Ng, rounding context that serves as an important cue to 2012). disambiguate words. In this paper, we explore dif- A WSD system assigns a sense to a word by tak- ferent strategies of integrating pre-trained contex- ing into account its context, comprising the other tualized word representations for WSD. Our best words in the sentence. This can be done through strategy outperforms prior methods of incorpo- discrete word features, which typically involve rating pre-trained contextualized word represen- surrounding words and collocations trained using tations and achieves new state-of-the-art accuracy a classifier (Lee et al., 2004; Ando, 2006; Chan on multiple benchmark WSD datasets. et al., 2007b; Zhong and Ng, 2010). The classifier The following sections are organized as follows. 5297 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 5297–5306, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics Section2 presents related work. Section3 de- A neural language model (LM) is aimed at pre- scribes our pre-trained contextualized word repre- dicting a word given its surrounding context. As sentation. Section4 proposes different strategies such, the resulting hidden representation vector to incorporate the contextualized word represen- captures the context of a word in a sentence. Mela- tation for WSD. Section5 describes our experi- mud et al.(2016) designed context2vec, which is mental setup. Section6 presents the experimental a one-layer bidirectional LSTM trained to maxi- results. Section7 discusses the findings from the mize the similarity between the hidden state rep- experiments. Finally, Section8 presents the con- resentation of the LSTM and the target word em- clusion. bedding. Peters et al.(2018) designed ELMo, which is a two-layer bidirectional LSTM language 2 Related Work model trained to predict the next word in the for- ward LSTM and the previous word in the back- Continuous word representations in real-valued ward LSTM. In both models, WSD was evaluated vectors, or commonly known as word embed- by nearest neighbor matching between the test and dings, have been shown to help improve NLP training instance representations. However, de- performance. Initially, exploiting continuous rep- spite training on a huge amount of raw texts, the resentations was achieved by adding real-valued resulting accuracies were still lower than those vectors as classification features (Turian et al., achieved by WSD approaches with pre-trained 2010). Taghipour and Ng(2015) fine-tuned non-contextualized word representations. non-contextualized word embeddings by a feed- End-to-end neural machine translation (NMT) forward neural network such that those word em- (Sutskever et al., 2014; Bahdanau et al., 2015) beddings were more suited for WSD. The fine- learns to generate an output sequence given an tuned embeddings were incorporated into an SVM input sequence, using an encoder-decoder model. classifier. Iacobacci et al.(2016) explored differ- The encoder captures the contextualized represen- ent strategies of incorporating word embeddings tation of the words in the input sentence for the and found that their best strategy involved expo- decoder to generate the output sentence. Follow- nential decay that decreased the contribution of ing this intuition, McCann et al.(2017) trained an surrounding word features as their distances to the encoder-decoder model on parallel texts and ob- target word increased. tained pre-trained contextualized word representa- The neural sequence tagging approach has also tions from the encoder. been explored for WSD.K ageb˚ ack¨ and Salomons- son(2016) proposed bidirectional long short-term 3 Pre-Trained Contextualized Word memory (LSTM) (Hochreiter and Schmidhuber, Representation 1997) for WSD. They concatenated the hidden states of the forward and backward LSTMs and The contextualized word representation that we fed the concatenation into an affine transforma- use is BERT (Devlin et al., 2019), which is a tion followed by softmax normalization, similar to bidirectional transformer encoder model (Vaswani the approach to incorporate a bidirectional LSTM et al., 2017) pre-trained on billions of words of adopted in sequence labeling tasks such as part-of- texts. There are two tasks on which the model is speech tagging and named entity recognition (Ma trained, i.e., masked word and next sentence pre- and Hovy, 2016). Raganato et al.(2017b) pro- diction. In both tasks, prediction accuracy is de- posed a self-attention layer on top of the concate- termined by the ability of the model to understand nated bidirectional LSTM hidden states for WSD the context. and introduced multi-task learning with part-of- A transformer encoder computes the represen- speech tagging and semantic labeling as auxiliary tation of each word through an attention mecha- tasks. However, on average across the test sets, nism with respect to the surrounding words. Given n their approach did not outperform SVM with word a sentence x1 of length n, the transformer com- embedding features. Subsequently, Luo et al. putes the representation of each word xi through a (2018) proposed the incorporation of glosses from multi-head attention mechanism, where the query WordNet in a bidirectional LSTM for WSD, and vector is from xi and the key-value vector pairs are 0 reported better results than both SVM and prior from the surrounding words xi0 (1 ≤ i ≤ n). The bidirectional LSTM models. word representation produced by the transformer 5298 captures the contextual information of a word. For the hidden layer l (1 ≤ l ≤ L), the self- l The attention mechanism can be viewed as attention sub-layer output fi is computed as fol- mapping a query vector q and a set of key-value lows: vector pairs (k; v) to an output vector. The at- l l l−1 tention function A(·) computes the output vector fi = LayerNorm(χh;i + hi ) which is the weighted sum of the value vectors and l l l−1 l−1 l−1 1 χh;i = MHh(hi ; h1:n ; h1:n ; p ) is defined as: dv X A(q; K; V; ρ) = α(q; k; ρ)v (1) where LayerNorm refers to layer normalization (k;v)2(K;V) (Ba et al., 2016) and the superscript l and sub- exp(ρk>q) script h indicate that each encoder layer l has an α(q; k; ρ) = P 0> (2) independent set of multi-head attention weight pa- 0 exp(ρk q) k 2K rameters (see Equations3 and4). The input for 0 where K and V are matrices, containing the key the first layer is hi = E(xi), which is the non- vectors and the value vectors of the words in the contextualized word embedding of xi. sentence respectively, and α(q; k; ρ) is a scalar at- The second sub-layer consists of the position- tention weight between q and k, re-scaled by a wise fully connected FFNN, computed as: scalar ρ.
Recommended publications
  • Arxiv:2002.06235V1
    Semantic Relatedness and Taxonomic Word Embeddings Magdalena Kacmajor John D. Kelleher Innovation Exchange ADAPT Research Centre IBM Ireland Technological University Dublin [email protected] [email protected] Filip Klubickaˇ Alfredo Maldonado ADAPT Research Centre Underwriters Laboratories Technological University Dublin [email protected] [email protected] Abstract This paper1 connects a series of papers dealing with taxonomic word embeddings. It begins by noting that there are different types of semantic relatedness and that different lexical representa- tions encode different forms of relatedness. A particularly important distinction within semantic relatedness is that of thematic versus taxonomic relatedness. Next, we present a number of ex- periments that analyse taxonomic embeddings that have been trained on a synthetic corpus that has been generated via a random walk over a taxonomy. These experiments demonstrate how the properties of the synthetic corpus, such as the percentage of rare words, are affected by the shape of the knowledge graph the corpus is generated from. Finally, we explore the interactions between the relative sizes of natural and synthetic corpora on the performance of embeddings when taxonomic and thematic embeddings are combined. 1 Introduction Deep learning has revolutionised natural language processing over the last decade. A key enabler of deep learning for natural language processing has been the development of word embeddings. One reason for this is that deep learning intrinsically involves the use of neural network models and these models only work with numeric inputs. Consequently, applying deep learning to natural language processing first requires developing a numeric representation for words. Word embeddings provide a way of creating numeric representations that have proven to have a number of advantages over traditional numeric rep- resentations of language.
    [Show full text]
  • Combining Word Embeddings with Semantic Resources
    AutoExtend: Combining Word Embeddings with Semantic Resources Sascha Rothe∗ LMU Munich Hinrich Schutze¨ ∗ LMU Munich We present AutoExtend, a system that combines word embeddings with semantic resources by learning embeddings for non-word objects like synsets and entities and learning word embeddings that incorporate the semantic information from the resource. The method is based on encoding and decoding the word embeddings and is flexible in that it can take any word embeddings as input and does not need an additional training corpus. The obtained embeddings live in the same vector space as the input word embeddings. A sparse tensor formalization guar- antees efficiency and parallelizability. We use WordNet, GermaNet, and Freebase as semantic resources. AutoExtend achieves state-of-the-art performance on Word-in-Context Similarity and Word Sense Disambiguation tasks. 1. Introduction Unsupervised methods for learning word embeddings are widely used in natural lan- guage processing (NLP). The only data these methods need as input are very large corpora. However, in addition to corpora, there are many other resources that are undoubtedly useful in NLP, including lexical resources like WordNet and Wiktionary and knowledge bases like Wikipedia and Freebase. We will simply refer to these as resources. In this article, we present AutoExtend, a method for enriching these valuable resources with embeddings for non-word objects they describe; for example, Auto- Extend enriches WordNet with embeddings for synsets. The word embeddings and the new non-word embeddings live in the same vector space. Many NLP applications benefit if non-word objects described by resources—such as synsets in WordNet—are also available as embeddings.
    [Show full text]
  • Word-Like Character N-Gram Embedding
    Word-like character n-gram embedding Geewook Kim and Kazuki Fukui and Hidetoshi Shimodaira Department of Systems Science, Graduate School of Informatics, Kyoto University Mathematical Statistics Team, RIKEN Center for Advanced Intelligence Project fgeewook, [email protected], [email protected] Abstract Table 1: Top-10 2-grams in Sina Weibo and 4-grams in We propose a new word embedding method Japanese Twitter (Experiment 1). Words are indicated called word-like character n-gram embed- by boldface and space characters are marked by . ding, which learns distributed representations FNE WNE (Proposed) Chinese Japanese Chinese Japanese of words by embedding word-like character n- 1 ][ wwww 自己 フォロー grams. Our method is an extension of recently 2 。␣ !!!! 。␣ ありがと proposed segmentation-free word embedding, 3 !␣ ありがと ][ wwww 4 .. りがとう 一个 !!!! which directly embeds frequent character n- 5 ]␣ ございま 微博 めっちゃ grams from a raw corpus. However, its n-gram 6 。。 うござい 什么 んだけど vocabulary tends to contain too many non- 7 ,我 とうござ 可以 うござい 8 !! ざいます 没有 line word n-grams. We solved this problem by in- 9 ␣我 がとうご 吗? 2018 troducing an idea of expected word frequency. 10 了, ください 哈哈 じゃない Compared to the previously proposed meth- ods, our method can embed more words, along tion tools are used to determine word boundaries with the words that are not included in a given in the raw corpus. However, these segmenters re- basic word dictionary. Since our method does quire rich dictionaries for accurate segmentation, not rely on word segmentation with rich word which are expensive to prepare and not always dictionaries, it is especially effective when the available.
    [Show full text]
  • Learned in Speech Recognition: Contextual Acoustic Word Embeddings
    LEARNED IN SPEECH RECOGNITION: CONTEXTUAL ACOUSTIC WORD EMBEDDINGS Shruti Palaskar∗, Vikas Raunak∗ and Florian Metze Carnegie Mellon University, Pittsburgh, PA, U.S.A. fspalaska j vraunak j fmetze [email protected] ABSTRACT model [10, 11, 12] trained for direct Acoustic-to-Word (A2W) speech recognition [13]. Using this model, we jointly learn to End-to-end acoustic-to-word speech recognition models have re- automatically segment and classify input speech into individual cently gained popularity because they are easy to train, scale well to words, hence getting rid of the problem of chunking or requiring large amounts of training data, and do not require a lexicon. In addi- pre-defined word boundaries. As our A2W model is trained at the tion, word models may also be easier to integrate with downstream utterance level, we show that we can not only learn acoustic word tasks such as spoken language understanding, because inference embeddings, but also learn them in the proper context of their con- (search) is much simplified compared to phoneme, character or any taining sentence. We also evaluate our contextual acoustic word other sort of sub-word units. In this paper, we describe methods embeddings on a spoken language understanding task, demonstrat- to construct contextual acoustic word embeddings directly from a ing that they can be useful in non-transcription downstream tasks. supervised sequence-to-sequence acoustic-to-word speech recog- Our main contributions in this paper are the following: nition model using the learned attention distribution. On a suite 1. We demonstrate the usability of attention not only for aligning of 16 standard sentence evaluation tasks, our embeddings show words to acoustic frames without any forced alignment but also for competitive performance against a word2vec model trained on the constructing Contextual Acoustic Word Embeddings (CAWE).
    [Show full text]
  • Knowledge-Powered Deep Learning for Word Embedding
    Knowledge-Powered Deep Learning for Word Embedding Jiang Bian, Bin Gao, and Tie-Yan Liu Microsoft Research {jibian,bingao,tyliu}@microsoft.com Abstract. The basis of applying deep learning to solve natural language process- ing tasks is to obtain high-quality distributed representations of words, i.e., word embeddings, from large amounts of text data. However, text itself usually con- tains incomplete and ambiguous information, which makes necessity to leverage extra knowledge to understand it. Fortunately, text itself already contains well- defined morphological and syntactic knowledge; moreover, the large amount of texts on the Web enable the extraction of plenty of semantic knowledge. There- fore, it makes sense to design novel deep learning algorithms and systems in order to leverage the above knowledge to compute more effective word embed- dings. In this paper, we conduct an empirical study on the capacity of leveraging morphological, syntactic, and semantic knowledge to achieve high-quality word embeddings. Our study explores these types of knowledge to define new basis for word representation, provide additional input information, and serve as auxiliary supervision in deep learning, respectively. Experiments on an analogical reason- ing task, a word similarity task, and a word completion task have all demonstrated that knowledge-powered deep learning can enhance the effectiveness of word em- bedding. 1 Introduction With rapid development of deep learning techniques in recent years, it has drawn in- creasing attention to train complex and deep models on large amounts of data, in order to solve a wide range of text mining and natural language processing (NLP) tasks [4, 1, 8, 13, 19, 20].
    [Show full text]
  • Local Homology of Word Embeddings
    Local Homology of Word Embeddings Tadas Temcinasˇ 1 Abstract Intuitively, stratification is a decomposition of a topologi- Topological data analysis (TDA) has been widely cal space into manifold-like pieces. When thinking about used to make progress on a number of problems. stratification learning and word embeddings, it seems intu- However, it seems that TDA application in natural itive that vectors of words corresponding to the same broad language processing (NLP) is at its infancy. In topic would constitute a structure, which we might hope this paper we try to bridge the gap by arguing why to be a manifold. Hence, for example, by looking at the TDA tools are a natural choice when it comes to intersections between those manifolds or singularities on analysing word embedding data. We describe the manifolds (both of which can be recovered using local a parallelisable unsupervised learning algorithm homology-based algorithms (Nanda, 2017)) one might hope based on local homology of datapoints and show to find vectors of homonyms like ‘bank’ (which can mean some experimental results on word embedding either a river bank, or a financial institution). This, in turn, data. We see that local homology of datapoints in has potential to help solve the word sense disambiguation word embedding data contains some information (WSD) problem in NLP, which is pinning down a particular that can potentially be used to solve the word meaning of a word used in a sentence when the word has sense disambiguation problem. multiple meanings. In this work we present a clustering algorithm based on lo- cal homology, which is more relaxed1 that the stratification- 1.
    [Show full text]
  • Word Embedding, Sense Embedding and Their Application to Word Sense Induction
    Word Embeddings, Sense Embeddings and their Application to Word Sense Induction Linfeng Song The University of Rochester Computer Science Department Rochester, NY 14627 Area Paper April 2016 Abstract This paper investigates the cutting-edge techniques for word embedding, sense embedding, and our evaluation results on large-scale datasets. Word embedding refers to a kind of methods that learn a distributed dense vector for each word in a vocabulary. Traditional word embedding methods first obtain the co-occurrence matrix then perform dimension reduction with PCA. Recent methods use neural language models that directly learn word vectors by predicting the context words of the target word. Moving one step forward, sense embedding learns a distributed vector for each sense of a word. They either define a sense as a cluster of contexts where the target word appears or define a sense based on a sense inventory. To evaluate the performance of the state-of-the-art sense embedding methods, I first compare them on the dominant word similarity datasets, then compare them on my experimental settings. In addition, I show that sense embedding is applicable to the task of word sense induction (WSI). Actually we are the first to show that sense embedding methods are competitive on WSI by building sense-embedding-based systems that demonstrate highly competitive performances on the SemEval 2010 WSI shared task. Finally, I propose several possible future research directions on word embedding and sense embedding. The University of Rochester Computer Science Department supported this work. Contents 1 Introduction 3 2 Word Embedding 5 2.1 Skip-gramModel................................
    [Show full text]
  • Learning Word Meta-Embeddings by Autoencoding
    Learning Word Meta-Embeddings by Autoencoding Cong Bao Danushka Bollegala Department of Computer Science Department of Computer Science University of Liverpool University of Liverpool [email protected] [email protected] Abstract Distributed word embeddings have shown superior performances in numerous Natural Language Processing (NLP) tasks. However, their performances vary significantly across different tasks, implying that the word embeddings learnt by those methods capture complementary aspects of lexical semantics. Therefore, we believe that it is important to combine the existing word em- beddings to produce more accurate and complete meta-embeddings of words. We model the meta-embedding learning problem as an autoencoding problem, where we would like to learn a meta-embedding space that can accurately reconstruct all source embeddings simultaneously. Thereby, the meta-embedding space is enforced to capture complementary information in differ- ent source embeddings via a coherent common embedding space. We propose three flavours of autoencoded meta-embeddings motivated by different requirements that must be satisfied by a meta-embedding. Our experimental results on a series of benchmark evaluations show that the proposed autoencoded meta-embeddings outperform the existing state-of-the-art meta- embeddings in multiple tasks. 1 Introduction Representing the meanings of words is a fundamental task in Natural Language Processing (NLP). A popular approach to represent the meaning of a word is to embed it in some fixed-dimensional
    [Show full text]
  • A Comparison of Word Embeddings and N-Gram Models for Dbpedia Type and Invalid Entity Detection †
    information Article A Comparison of Word Embeddings and N-gram Models for DBpedia Type and Invalid Entity Detection † Hanqing Zhou *, Amal Zouaq and Diana Inkpen School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa ON K1N 6N5, Canada; [email protected] (A.Z.); [email protected] (D.I.) * Correspondence: [email protected]; Tel.: +1-613-562-5800 † This paper is an extended version of our conference paper: Hanqing Zhou, Amal Zouaq, and Diana Inkpen. DBpedia Entity Type Detection using Entity Embeddings and N-Gram Models. In Proceedings of the International Conference on Knowledge Engineering and Semantic Web (KESW 2017), Szczecin, Poland, 8–10 November 2017, pp. 309–322. Received: 6 November 2018; Accepted: 20 December 2018; Published: 25 December 2018 Abstract: This article presents and evaluates a method for the detection of DBpedia types and entities that can be used for knowledge base completion and maintenance. This method compares entity embeddings with traditional N-gram models coupled with clustering and classification. We tackle two challenges: (a) the detection of entity types, which can be used to detect invalid DBpedia types and assign DBpedia types for type-less entities; and (b) the detection of invalid entities in the resource description of a DBpedia entity. Our results show that entity embeddings outperform n-gram models for type and entity detection and can contribute to the improvement of DBpedia’s quality, maintenance, and evolution. Keywords: semantic web; DBpedia; entity embedding; n-grams; type identification; entity identification; data mining; machine learning 1. Introduction The Semantic Web is defined by Berners-Lee et al.
    [Show full text]
  • Arxiv:2007.00183V2 [Eess.AS] 24 Nov 2020
    WHOLE-WORD SEGMENTAL SPEECH RECOGNITION WITH ACOUSTIC WORD EMBEDDINGS Bowen Shi, Shane Settle, Karen Livescu TTI-Chicago, USA fbshi,settle.shane,[email protected] ABSTRACT Segmental models are sequence prediction models in which scores of hypotheses are based on entire variable-length seg- ments of frames. We consider segmental models for whole- word (“acoustic-to-word”) speech recognition, with the feature vectors defined using vector embeddings of segments. Such models are computationally challenging as the number of paths is proportional to the vocabulary size, which can be orders of magnitude larger than when using subword units like phones. We describe an efficient approach for end-to-end whole-word segmental models, with forward-backward and Viterbi de- coding performed on a GPU and a simple segment scoring function that reduces space complexity. In addition, we inves- tigate the use of pre-training via jointly trained acoustic word embeddings (AWEs) and acoustically grounded word embed- dings (AGWEs) of written word labels. We find that word error rate can be reduced by a large margin by pre-training the acoustic segment representation with AWEs, and additional Fig. 1. Whole-word segmental model for speech recognition. (smaller) gains can be obtained by pre-training the word pre- Note: boundary frames are not shared. diction layer with AGWEs. Our final models improve over segmental models, where the sequence probability is com- prior A2W models. puted based on segment scores instead of frame probabilities. Index Terms— speech recognition, segmental model, Segmental models have a long history in speech recognition acoustic-to-word, acoustic word embeddings, pre-training research, but they have been used primarily for phonetic recog- nition or as phone-level acoustic models [11–18].
    [Show full text]
  • Arxiv:2104.05274V2 [Cs.CL] 27 Aug 2021 NLP field Due to Its Excellent Performance in Various Tasks
    Learning to Remove: Towards Isotropic Pre-trained BERT Embedding Yuxin Liang1, Rui Cao1, Jie Zheng?1, Jie Ren2, and Ling Gao1 1 Northwest University, Xi'an, China fliangyuxin,[email protected], fjzheng,[email protected] 2 Shannxi Normal University, Xi'an, China [email protected] Abstract. Research in word representation shows that isotropic em- beddings can significantly improve performance on downstream tasks. However, we measure and analyze the geometry of pre-trained BERT embedding and find that it is far from isotropic. We find that the word vectors are not centered around the origin, and the average cosine similar- ity between two random words is much higher than zero, which indicates that the word vectors are distributed in a narrow cone and deteriorate the representation capacity of word embedding. We propose a simple, and yet effective method to fix this problem: remove several dominant directions of BERT embedding with a set of learnable weights. We train the weights on word similarity tasks and show that processed embed- ding is more isotropic. Our method is evaluated on three standardized tasks: word similarity, word analogy, and semantic textual similarity. In all tasks, the word embedding processed by our method consistently out- performs the original embedding (with average improvement of 13% on word analogy and 16% on semantic textual similarity) and two baseline methods. Our method is also proven to be more robust to changes of hyperparameter. Keywords: Natural language processing · Pre-trained embedding · Word representation · Anisotropic. 1 Introduction With the rise of Transformers [12], its derivative model BERT [1] stormed the arXiv:2104.05274v2 [cs.CL] 27 Aug 2021 NLP field due to its excellent performance in various tasks.
    [Show full text]
  • A Survey of Embeddings in Clinical Natural Language Processing Kalyan KS, S
    This paper got published in Journal of Biomedical Informatics. Updated version is available @ https://doi.org/10.1016/j.jbi.2019.103323 SECNLP: A Survey of Embeddings in Clinical Natural Language Processing Kalyan KS, S. Sangeetha Text Analytics and Natural Language Processing Lab Department of Computer Applications National Institute of Technology Tiruchirappalli, INDIA. [email protected], [email protected] ABSTRACT Traditional representations like Bag of words are high dimensional, sparse and ignore the order as well as syntactic and semantic information. Distributed vector representations or embeddings map variable length text to dense fixed length vectors as well as capture prior knowledge which can transferred to downstream tasks. Even though embedding has become de facto standard for representations in deep learning based NLP tasks in both general and clinical domains, there is no survey paper which presents a detailed review of embeddings in Clinical Natural Language Processing. In this survey paper, we discuss various medical corpora and their characteristics, medical codes and present a brief overview as well as comparison of popular embeddings models. We classify clinical embeddings into nine types and discuss each embedding type in detail. We discuss various evaluation methods followed by possible solutions to various challenges in clinical embeddings. Finally, we conclude with some of the future directions which will advance the research in clinical embeddings. Keywords: embeddings, distributed representations, medical, natural language processing, survey 1. INTRODUCTION Distributed vector representation or embedding is one of the recent as well as prominent addition to modern natural language processing. Embedding has gained lot of attention and has become a part of NLP researcher’s toolkit.
    [Show full text]