Deconstructing Word Embedding Models
Total Page:16
File Type:pdf, Size:1020Kb
Deconstructing Word Embedding Models Koushik K. Varma Ashoka University, [email protected] Abstract – Analysis of Word Embedding Models through Analysis, one must be certain that all linguistic aspects of a a deconstructive approach reveals their several word are captured within the vector representations because shortcomings and inconsistencies. These include any discrepancies could be further amplified in practical instability of the vector representations, a distorted applications. While these vector representations have analogical reasoning, geometric incompatibility with widespread use in modern natural language processing, it is linguistic features, and the inconsistencies in the corpus unclear as to what degree they accurately encode the data. A new theoretical embedding model, ‘Derridian essence of language in its structural, contextual, cultural and Embedding,’ is proposed in this paper. Contemporary ethical assimilation. Surface evaluations on training-test embedding models are evaluated qualitatively in terms data provide efficiency measures for a specific of how adequate they are in relation to the capabilities of implementation but do not explicitly give insights on the a Derridian Embedding. factors causing the existing efficiency (or the lack there of). We henceforth present a deconstructive review of 1. INTRODUCTION contemporary word embeddings using available literature to clarify certain misconceptions and inconsistencies Saussure, in the Course in General Linguistics, concerning them. argued that there is no substance in language and that all language consists of differences. In language there are only forms, not substances, and by that he meant that all 1.1 DECONSTRUCTIVE APPROACH apparently substantive units of language are generated by other things that lie outside them, but these external Derrida, a post-structualist French philosopher, characteristics are actually internal to their make-up. All describes the process of spatial and temporal movement that elements of language have identity only in so much as they he claims makes all thought and reality possible. He set are produced by a network of differences and therefore, the about demonstrating that ideas especially are like units of value of these elements is established by its relation to other language; they are generated by difference; they have no elements. (Saussure 1966) This relation can be thought substance apart from the networks of differences (each paradigmatically, in terms of substitution, or bearing the traces of other elements and other differences) syntagmatically, in terms of concatenation as the following: that generate them as effects. (Derrida 1982) The value of a sign is given by the positions it can Drawing upon his work, I define the theoretical occupy, in opposition to the ones it cannot. existence of a ‘Derridian Embedding’ that transposes his network of differences into a vector-space model. Or as JR Firth plainly puts it, “You shall know a word by the company it keeps.” (Firth, 1957, p. 11) The following would be the properties of such an Quantifying the works of these linguists is the domain embedding framework: Distributional Semantics which is based on the hypothesis that words with similar meanings occur in similar contexts • All lexical relations of language (Hyponymy, (Harris, 1954) homonymy, polysemy, synonymy, antonymy and metonymy) ought to be captured within the vector Built on top of this hypothesis are Word representations. Embeddings (Mikolov et al., 2013): unsupervised techniques used to map words or phrases from a text to a • The resultant vectors should provide a linear corresponding vector of real numbers. The obtained vector mapping between concepts in cognitive thought space through embedding models preserve the contextual and elements of language. This presupposes that similarity of words – therefore words that appear regularly even the most abstract concepts (such as love, together in text will also appear together in the vector space. success, freedom, good, and moral) represented in With embeddings playing a crucial role in a vector space are analogous to neural (of the downstream NLP tasks such as: Word Sense human brain) representations of the concept. Disambiguation, POS – Tagging, Named Entity Recognition, Semantic Dependency Parsing and Sentiment 1 • All possible discourse structures ought to be accommodated as traversals within the network of differences. I consider Derridian Embeddings to be the epitome of natural language understanding. I then tackle contemporary word embeddings at various levels to be striving towards the end of reaching the capabilities of a Derridean Embedding and point out reasons as to why they fail. In particular, my approach in this paper is the following: 1. Isolate a constraint that limits proper functioning of a word embedding. 2. Analyze the repercussions of the constraint and potential solutions to it. 3. Assume that the constraint has been dissolved and move to another higher-level constraint that exists within the framework. Fig. 1.0 Overview of constraints to be discussed in this paper (Bottom to Top) and their significant attributes. 4. Go to 2. For instance, Section: 3 discusses the instability in embedding models. Further analysis of available literature appear in close proximity in the vector space. shows that, the discontinuity in domain-based senses is a major reason for instability. One of the solutions to this is to 2.1 WORD2VEC use an ensemble of domain-based models. Although this isn’t an absolute solution, I make an assumption that the Word2Vec models work with an intuition as to train neural instability caused by domain-based senses to be nullified. networks in order to predict a context given a word (Skip- We then verify if there could be any incompatibility within Gram), or a word given a context (Continuous Bag of the embeddings models after the above assumption has been Words). Figures 2.01 & 2.02 depict the architectures of the made. aforementioned models. Their implementation relies on the At each constraint, we end up looking at word Feedforward Neural Net Language Model (introduced by embeddings through a lens broader than that of the previous Bengio et al.) constraint. Fig 1.0 illustrates different constraints on word embeddings as a part of the deconstruction approach that will be presented over a series of sections in the rest of the paper. The higher in the hierarchy a constraint is, the more broader and complex its repercussions are. 2. EMBEDDING MODELS Word embeddings (Mikolov et al., 2013) are scalable, Fig. 2.01 Fig. 2.02 unsupervised, contemporary models of distributional semantics with the primary objective being the mapping of linguistic terms into real-valued vectors. The obtained In the CBOW approach, the input layer takes the N vector space preserves the contextual similarity of words – previous words in 1xV matrix, V being the size of the therefore words that appear in similar contexts tend to vocabulary. A projection using the shared projection matrix encoded by the projection layer is then made. A probability distribution over all the words in the vocabulary is computed with the help of a softmax activation function, the output layer shows probability of each word being the next one. 2 For words i, j, k; with Pij being the probability of Skip-Gram approach works in the same way as i, j to be in the same context and Pik being the probability of CBOW, except that the task is to predict a context given a j, k to be in the same context, the ration is expressed by the word. Technically speaking, the objective is to average log following: probability of the context of a given word. It is expressed in the following way: Following a series of assumptions and derivations, the authors finally arrive at simplified word vectors of the form: Where T is the number of training words and c being the number of context words. p(wt+j|wt) is given by th the softmax function expressed as the following: Where wi is the representation of the i word, w is th the representation of the j word, bi and bj are bias terms, and logXij is the co-occurrence count of words i and j. In contrast to word2vec, GloVe implements the following cost function to compute the best possible word Where W is the number of unique words in corpus representations of wi , wj: w1 ... wT, and vw and vw’ are the input and output vector representations of word w. With an increase in the size of the vocabulary follows a significant increase in the computational cost of Word2Vec. Henceforth, one of the solutions often adopted GloVe further introduces a weighting scheme into is the usage of a hierarchical variant of the softmax function the cost function of the model thereby avoiding log 0 errors (Mnih et al., 2009). Being built upon a Huffman-tree base and further reducing the negative effect of high frequency word frequency (most frequent words = shortest path from co-occurrences: the root), Hierarchical Softmax reduces the computation complexity from H × V to log2(V) × H., where H is the size of the hidden layer and V being the vocabulary size. Another solution backed by Mikolov et al. is a relatively novel technique called Negative Sampling. In where x is the co-occurrence count, and α (an negative sampling, the model updates the representations of exponential weighting term of range: 0 and Xmax) The a small number of words such that the network predicts an performance of a GloVe model thus depends on the observed “positive” word pair (e.g., Ice Cream), and does dimensionality of the word vector , Xmax, α , and the size of not predict any of a number of “negative” pairs that are the window used to compute co-occurrence statistics around unlikely to be observed in the text (e.g. ice future or ice each word. happy). Because a small number of negative samples are Although it depends on the parameters of each used—usually fewer than 20—a relatively small number of model, GloVe is generally faster than Word2Vec (Pennigton weights need to be updated at each iteration.