Evaluating Distributed Word Representations for Capturing Semantics of Biomedical Concepts
Total Page:16
File Type:pdf, Size:1020Kb
Evaluating distributed word representations for capturing semantics of biomedical concepts Muneeb T H1, Sunil Kumar Sahu1 and Ashish Anand1 1Department of Computer Science and Engineering, Indian Institute of Technology Guwahati, Assam - 781039, India muneeb, sunil.sahu, anand.ashish @iitg.ac.in { } Abstract of the vocabulary and the element at the word in- dex is 1 while the other elements are 0s. Two ma- Recently there is a surge in interest in jor drawbacks with this representation are: first, learning vector representations of words length of the vector is huge and the second, there using huge corpus in unsupervised man- is no notion of similarity between words. The in- ner. Such word vector representations, ability of one-hot vector representation to embody also known as word embedding, have lexico-semantic properties prompted researchers been shown to improve the performance of to develop methods which are based on the notion machine learning models in several NLP that the “similar words appear in similar contexts”. tasks. However efficiency of such repre- These methods can broadly be classified into two sentation has not been systematically eval- categories (Turian et al., 2010), namely, distribu- uated in biomedical domain. In this work tional representation and distributed representa- our aim is to compare the performance tion. Both group of methods works in unsuper- of two state-of-the-art word embedding vised manner with huge corpus. Distributional methods, namely word2vec and GloVe on representations are mainly based on co-occurrence a basic task of reflecting semantic simi- matrix O of words in the vocabulary and their con- larity and relatedness of biomedical con- texts. Here, among other possibilities, contexts cepts. For this, vector representations of can be documents or words within a particular all unique words in the corpus of more window. Each entry Oij in the matrix may indi- than 1 million full-length research arti- cate either frequency of word i in the context j cles in biomedical domain are obtained or simply whether the word i has appeared in the from the two methods. These word vec- context j at least once. Co-occurrence matrix can tors are evaluated for their ability to reflect be designed in variety of ways (Turney and Pan- semantic similarity and semantic related- tel, 2010). The major issue with such methods is ness of word-pairs in a benchmark data size of the matrix O and reducing its size gener- set of manually curated semantic similar ally tends to be computationally very expensive. and related words available at http:// Nevertheless, the requirement of constructing and rxinformatics.umn.edu. We ob- storing the matrix O are always there. The second serve that parameters of these models group of methods are mainly based on language do affect their ability to capture lexico- modeling (Bengio et al., 2003). We discuss more semantic properties and word2vec with about these methods in the section 3. particular language modeling seems to Outside the biomedical domain, this kind of perform better than others. representation has shown significant improvement 1 Introduction in the performance of many NLP tasks. For ex- ample, Turian et al. (2010) have improved the per- One of the crucial step in machine learning (ML) formance of chunking and named entity recogni- based NLP models is how we represent word as tion by using word embedding also as one of the an input to our model. Most of earlier works were features in their CRF model. In one study, Col- treating word as atomic symbol and were assign- lobert et al. (2011) have formulated the NLP tasks ing one hot vector to each word. Length of the of parts of speech tagging, chunking, named entity vector in this representation was equal to the size recognition and semantic role labeling as multi- 158 Proceedings of the 2015 Workshop on Biomedical Natural Language Processing (BioNLP 2015), pages 158–163, Beijing, China, July 30, 2015. c 2015 Association for Computational Linguistics task learning problem. They have shown improve- ter. Authors have shown that including word em- ment in the performance when word vectors are bedding based features has improved the perfor- learned together with other NLP tasks. Socher mance of their classifier. et al. (2012) improved the performance of senti- Faruqui and Dyer (2014) have developed an on- ment analysis task and semantic relation classifi- line suit to analyze and compare different word cation task using recursive neural network. One vector representation models on a variety of tasks. common step among these models is: learning of These tasks include syntactic and semantic rela- word embedding from huge unannotated corpus tions, sentence completion and sentiment analy- like Wikipedia, and later use them as features. sis. In another recent work, Levy et al. (2015) Motivated by the above results, we evaluate have done extensive study on the effect of hyper- performance of the two word embedding mod- parameters of word representation models and els, word2vec (Mikolov et al., 2013a; Mikolov et have shown their influence on the performance on al., 2013b) and GloVe (Pennington et al., 2014) word similarity and analogy tasks. However in for their ability to capture syntactic as well as se- both the studies (Faruqui and Dyer, 2014; Levy mantic properties of words in biomedical domain. et al., 2015) the benchmark datasets available for We have used full-length articles obtained from NLP tasks are not suitable for analyzing vector PubMed Central (PMC) open access subset1 as representations of clinical and biomedical terms. our corpus for learning word embedding. For eval- uation we have used publicly available validated 3 Word Embedding reference dataset (Pakhomov et al., 2010; Ped- ersen et al., 2007) containing semantic similarity As discussed earlier, word embedding or dis- and relatedness scores of around 500 word-pairs. tributed representation is a technique of learn- Our results indicate that the word2vec word em- ing vector representation for all words present in bedding is capturing semantic similarity between the given corpus. The learned vector representa- words better than the GloVe word embedding in tion is generally dense, real-valued and of low- the biomedical domain, whereas for the task of dimension. As contrast to one-hot vector repre- semantic relatedness, there does not seem to be sentation each dimension of the word-vector is any statistical significant difference among differ- supposed to represent a latent feature of lexico- ent word-embeddings. semantic properties of the word. In our work we considered two state of the art word embed- 2 Related Work ding techniques, namely, word2vec and GloVe. Although in literature there exists several word- In a recent study, Minarro-Gim˜ enez´ et al. (2015) embedding techniques (Hinton et al., 1986; Ben- have evaluated the efficiency of word2vec in find- gio et al., 2003; Bengio, 2008; Mnih and Hin- ing clinical relationships such as “may treat”, “has ton, 2009; Collobert et al., 2011), the selected two physiological effect” etc. For this, they have word embedding techniques are very much com- selected the manually curated information from putationally efficient and are considered as state- the National Drug File - Reference Terminology of-the art. We have summarized the basic princi- (NDF-RT) ontology as reference data. They have ples of the two methods in subsequent sections. used several corpora for learning word-vector rep- resentation and compared these different vectors. 3.1 word2vec Model The word-vectors obtained from the largest corpus word2vec generates word vector by two different gave the best result for finding the ”may treat” re- schemes of language modeling: continuous bag lationship with accuracy of 38.78%. The relatively of words (CBOW) and skip-gram (Mikolov et al., poor result obtained for finding different clinical 2013a; Mikolov et al., 2013b). In the CBOW relationships indicates the need for more careful method, the goal is to predict a word given the construction of corpus, design of experiment and surrounding words, whereas in skip-gram, given finding better ways to include domain knowledge. a single word, window or context of words are In another recent study, Nikfarjam et al. (2015) predicted. We can say skip-gram model is op- have described an automatic way to find adverse posite of CBOW model. Both models are neural drug reaction mention in social media such as twit- network based language model and take huge cor- 1http://www.ncbi.nlm.nih.gov/pmc/tools/ftp/ pus as an input and learn vector representation for 159 each words in the corpus. We used freely avail- 566 pairs of UMLS concepts which were man- able word2vec2 tool for our purpose. Apart from ually rated for their semantic similarity and 587 the choice of architecture skip-gram or CBOW, pairs of UMLS concepts for semantic relatedness. word2vec has several parameters including size of We removed all pairs in which at least one word context window, dimension of vector, which effect has less than 10 occurrences in the entire corpus the speed and quality of training. as such words are removed while building vocabu- lary from the corpus. After removing less frequent 3.2 GloVe Model words in both reference sets, we obtain 462 pairs GloVe (Pennington et al., 2014) stands for Global for semantic similarity having 278 unique words, Vectors. In some sense, GloVe can be seen as a and 465 pairs for semantic relatedness having 285 hybrid approach, where it considers global context unique words. In both cases, each concept pair is (by considering co-occurrence matrix) as well as given a score in the range of 0 1600, with higher − local context (such as skip-gram model) of words. score implies similar or more related judgments GloVe try to learn vector for words wx and wy such of manual annotators.