Can Network Embedding of Distributional Thesaurus Be Combined with Word Vectors for Better Representation?

Can Network Embedding of Distributional Thesaurus be Combined with Word Vectors for Better Representation? Abhik Jana Pawan Goyal IIT Kharagpur IIT Kharagpur Kharagpur, India Kharagpur, India [email protected] [email protected] Abstract meaningfully similar if the vectors of those words are close in the euclidean space. In recent times, Distributed representations of words learned from text have proved to be successful in var- attempts have been made for dense representa- ious natural language processing tasks in re- tion of words, be it using predictive model like cent times. While some methods represent Word2vec (Mikolov et al., 2013) or count-based words as vectors computed from text using model like GloVe (Pennington et al., 2014) which predictive model (Word2vec) or dense count are computationally efficient as well. Another based model (GloVe), others attempt to rep- stream of representation talks about network like resent these in a distributional thesaurus net- structure where two words are considered neigh- work structure where the neighborhood of a bors if they both occur in the same context above word is a set of words having adequate context overlap. Being motivated by recent surge a certain number of times. The words are finally of research in network embedding techniques represented using these neighbors. Distributional (DeepWalk, LINE, node2vec etc.), we turn Thesaurus is one such instance of this type, which a distributional thesaurus network into dense gets automatically produced from a text corpus word vectors and investigate the usefulness and identifies words that occur in similar contexts; of distributional thesaurus embedding in im- the notion of which was used in early work about proving overall word representation. This is distributional semantics (Grefenstette, 2012; Lin, the first attempt where we show that combining the proposed word representation obtained 1998; Curran and Moens, 2002). One such repre- by distributional thesaurus embedding with the sentation is JoBimText proposed by Biemann and state-of-the-art word representations helps in Riedl(2013) that contains, for each word, a list improving the performance by a significant of words that are similar with respect to their bi- margin when evaluated against NLP tasks like gram distribution, thus producing a network rep- word similarity and relatedness, synonym de- resentation. Later, Riedl and Biemann(2013) in- tection, analogy detection. Additionally, we troduced a highly scalable approach for comput- show that even without using any handcrafted ing this network. We mention this representation lexical resources we can come up with representations having comparable performance in as a DT network throughout this article. With the the word similarity and relatedness tasks com- emergence of recent trend of embedding large net- pared to the representations where a lexical re- works into dense low-dimensional vector space ef- source has been used. ficiently (Perozzi et al., 2014; Tang et al., 2015; Grover and Leskovec, 2016) which are focused 1 Introduction on capturing different properties of the network Natural language understanding has always been like neighborhood structure, community structure, a primary challenge in natural language process- etc., we explore representing DT network in a ing (NLP) domain. Learning word representa- dense vector space and evaluate its useful appli- tions is one of the basic and primary steps in un- cation in various NLP tasks. derstanding text and nowadays there are predomi- There has been attempt (Ferret, 2017) to turn nantly two views of learning word representations. distributional thesauri into word vectors for syn- In one realm of representation, words are vectors onym extraction and expansion but the full uti- of distributions obtained from analyzing their con- lization of DT embedding has not yet been ex- texts in the text and two words are considered plored. In this paper, as a main contribution, we 463 Proceedings of NAACL-HLT 2018, pages 463–473 New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics investigate the best way of turning a Distributional 2006) which prove to be very useful in differ- Thesaurus (DT) network into word embeddings ent natural language tasks. Even though compu- by applying efficient network embedding meth- tation of sparse count based models used to be ods and analyze how these embeddings gener- inefficient, in this era of high speed processors ated from DT network can improve the representa- and storage, attempts are being made to stream- tions generated from prediction-based model like line the computation with ease. One such effort is Word2vec or dense count based semantic model made by Kilgarriff et al.(2004) where they pro- like GloVe. We experiment with several combina- pose Sketch Engine, a corpus tool which takes as tion techniques and find that DT network embed- input a corpus of any language and corresponding ding can be combined with Word2vec and GloVe grammar patterns, and generates word sketches for to outperform the performances when used inde- the words of that language and a thesaurus. Re- pendently. Further, we show that we can use DT cently, Riedl and Biemann(2013) introduce a network embedding as a proxy of WordNet em- new highly scalable approach for computing qual- bedding in order to improve the already exist- ity distributional thesauri by incorporating prun- ing state-of-the-art word representations as both ing techniques and using a distributed computation of them achieve comparable performance as far framework. They prepare distributional thesaurus as word similarity and word relatedness tasks are from Google book corpus in a network structure concerned. Considering the fact that the vocabu- and make it publicly available. lary size of WordNet is small and preparing Word- In another stream of literature, word embed- Net like lexical resources needs huge human en- dings represent words as dense unit vectors of real gagement, it would be useful to have a represen- numbers, where vectors that are close together in tation which can be generated automatically from euclidean space are considered to be semantically corpus. We also attempt to combine both Word- related. In this genre of representation, one of Net and DT embeddings to improve the existing the captivating attempt is made by Mikolov et al. word representations and find that DT embedding (2013), where they propose Word2vec, basically a still has some extra information to bring in leading set of two predictive models for neural embedding to better performance when compared to combina- whereas Pennington et al.(2014) propose GloVe, tion of only WordNet embedding and state-of-the- which utilizes a dense count based model to art word embeddings. While most of our exper- come up with word embeddings that approximate iments are focused on word similarity and relat- this. Comparisons have also been made between edness tasks, we show the usefulness of DT em- count-based and prediction-based distributional beddings on synonym detection and analogy de- models (Baroni et al., 2014) upon various tasks tection as well. In both the tasks, combined rep- like relatedness, analogy, concept categorization resentation of GloVe and DT embeddings shows etc., where researchers show that prediction-based promising performance gain over state-of-the-art word embeddings outperform sparse count-based embeddings. methods used for computing distributional semantic models. In other study, Levy and Goldberg 2 Related Work (2014) show that dense count-based methods, using PPMI weighted co-occurrences and SVD, ap- The core idea behind the construction of dis- proximates neural word embeddings. Later, Levy tributional thesauri is the distributional hypothe- et al.(2015) show the impact of various parame- sis (Firth, 1957): “You should know a word by ters and the best performing parameters for these the company it keeps”. The semantic neighbors methods. All these approaches are completely of a target word are words whose contexts over- text based; no external knowledge source has been lap with the context of a target word above a cer- used. tain threshold. Some of the initial attempts for preparing distributional thesaurus are made by Lin More recently, a new direction of investigation (1998), Curran and Moens(2002), Grefenstette has been opened up where researchers are try- (2012). The semantic relation between a target ing to combine knowledge extracted from knowl- word and its neighbors can be of different types, edge bases, images with distributed word repre- e.g., synonymy, hypernymy, hyponymy or other sentations prepared from text with the expecta- relations (Adam et al., 2013; Budanitsky and Hirst, tion of getting better representation. Some use 464 Knowledge bases like WordNet (Miller, 1995), tional Thesaurus (DT) network applying network FreeBase (Bollacker et al., 2008), PPDB (Gan- representation learning model. Next we com- itkevitch et al., 2013), ConceptNet (Speer et al., bine this thesaurus embedding with state-of-the- 2017), whereas others use ImageNet (Frome et al., art vector representations prepared using GloVe 2013; Kiela and Bottou, 2014; Both et al., 2017; and Word2vec model for analysis. Thoma et al., 2017) for capturing visual representation of lexical items. There are various ways of combining multiple representations. Some of 3.1 Distributional Thesaurus (DT) Network the works extract lists of relations from knowledge bases and use those to either

Load more