Corpus Specificity in LSA and Word2vec

Corpus specificity in LSA and word2vec: the role of out-of-domain documents Edgar Altszyler Mariano Sigman Diego Fernandez´ Slezak UBA, FCEyN, DC. U. Torcuato Di Tella - CONICET. UBA, FCEyN, DC, ICC, UBA-CONICET [email protected] ICC, UBA-CONICET [email protected] [email protected] Abstract tend to occur in similar contexts (Harris, 1954). This proposition is called distributional hypothe- Despite the popularity of word embed- sis and provides a practical framework to under- dings, the precise way by which they ac- stand and compute the semantic relationship be- quire semantic relations between words re- tween words. Based in the distributional hypothe- main unclear. In the present article, we in- sis, Latent Semantic Analysis (LSA) (Deerwester vestigate whether LSA and word2vec ca- et al., 1990; Landauer and Dumais, 1997; Hu et al., pacity to identify relevant semantic rela- 2007) and word2vec (Mikolov et al., 2013a,b), are tions increases with corpus size. One in- one of the most important methods for word mean- tuitive hypothesis is that the capacity to ing representation, which describes each word in a identify relevant associations should in- vectorial space, where words with similar mean- crease as the amount of data increases. ings are located close to each other. However, if corpus size grows in topics Word embeddings have been applied in a wide which are not specific to the domain of in- variety of areas such as information retrieval terest, signal to noise ratio may weaken. (Deerwester et al., 1990), psychiatry (Altszyler Here we investigate the effect of corpus et al., 2018; Carrillo et al., 2018), treatment op- specificity and size in word-embeddings, timization(Corcoran et al., 2018), literature (Diuk and for this, we study two ways for pro- et al., 2012) and cognitive sciences (Landauer gressive elimination of documents: the and Dumais, 1997; Denhiere` and Lemaire, 2004; elimination of random documents vs. the Lemaire and Denhi, 2004; Diuk et al., 2012). elimination of documents unrelated to a LSA takes as input a training Corpus formed by specific task. We show that word2vec a collection of documents. Then a word by doc- can take advantage of all the documents, ument co-occurrence matrix is constructed, which obtaining its best performance when it contains the distribution of occurrence of the dif- is trained with the whole corpus. On ferent words along the documents. Then, usually, the contrary, the specialization (removal a mathematical transformation is applied to reduce of out-of-domain documents) of the train- the weight of uninformative high-frequency words ing corpus, accompanied by a decrease of in the words-documents matrix (Dumais, 1991). dimensionality, can increase LSA word- Finally, a linear dimensionality reduction is imple- representation quality while speeding up mented by a truncated Singular Value Decomposi- the processing time. From a cognitive- tion, SVD, which projects every word in a sub- modeling point of view, we point out space of a predefined number of dimensions, k. that LSA’s word-knowledge acquisitions The success of LSA in capturing the latent mean- may not be efficiently exploiting higher- ing of words comes from this low-dimensional order co-occurrences and global relations, mapping. This representation improvement can be whereas word2vec does. explained as a consequence of the elimination of the noisiest dimensions (Turney and Pantel, 2010). 1 Introduction Word2vec consists of two neural network mod- The main idea behind corpus-based semantic rep- els, Continuous Bag of Words (CBOW) and Skip- resentation is that words with similar meanings gram. To train the models, a sliding window is 1 Proceedings of the 3rd Workshop on Representation Learning for NLP, pages 1–10 Melbourne, Australia, July 20, 2018. c 2018 Association for Computational Linguistics moved along the corpus. In the CBOW scheme, placed exam-words in the corpus with non-sense in each step, the neural network is trained to pre- words and varied the number of corpus’ docu- dict the center word (the word in the center of the ments selecting nested sub-samples of the total window based) given the context words (the other corpus. They concluded that LSA improves its words in the window). While in the skip-gram performance on the exam both when training with scheme, the model is trained to predict the context documents with exam-words and without them. words based on the central word. In the present However, as could be expected, they observed a paper, we use the skip-gram, which has produced greater effect when training with exam-words. It better performance in Mikolov et al.(2013b). is worth mentioning that the replacement of exam- Despite the development of new word represen- words with non-sense words may create incorrect tation methods, LSA is still intensively used and documents, thus, making the algorithm acquire has been shown that produce better performances word-knowledge from documents which should than word2vec methods in small to medium size have an exam-word but do not. In the Results sec- training corpus (Altszyler et al., 2017). tion, we will study this indirect word acquisition in the TOEFL test without using non-sense words. 1.1 Training Corpus Size and Specificity in Along the same line, Lemaire and Den- Word-embeddings hiere(2006) studied the effect of high-order co- occurrences in LSA semantic similarity, which Over the last years, great effort has been devoted goes further in the study of Landauer’s indirect to understanding how to choose the right parame- word acquisition. ter settings for different tasks (Quesada, 2011; Du- mais, 2003; Landauer and Dumais, 1997; Lapesa In their work, Lemaire and Denhiere(2006) and Evert, 2014; Bradford, 2008; Nakov et al., measure how the similarity between 28 pairs of 2003; Baroni et al., 2014). However, considerably words (such as bee/honey and buy/shop) changes lesser attention has been given to study how dif- when a 400-dimensions LSA is trained with a ferent corpus used as input for training may affect growing number of paragraphs. Furthermore, they the performance. Here we ask a simple question identify for this task the marginal contribution of on the property of the corpus: is there a monotonic the first, second and third order of co-occurrence relation between corpus size and the performance? as the number of paragraphs is increased. In this More precisely, what happens if the topic of ad- experiment, they found that not only does the first ditional documents differs from the topics in the order of co-occurrence contribute to the semantic specific task? Previous studies have surprisingly closeness of the word pairs, but also the second shown some contradictory results on this simple and the third order promote an increment on pairs question. similarity. It is worth noting that Landauer’s indi- On the one hand, in the foundational work, rect word acquisition can be understood in terms Landauer and Dumais(1997) compare the word- of paragraphs without either of the words in a pair, knowledge acquisition between LSA and that of and containing a third or more order co-occurrence children’s. This acquisition process may be pro- link. duced by 1) direct learning, enhancing the incor- So, the conclusion from Lemaire and Denhiere poration of new words by reading texts that ex- (2006) and Landauer and Dumais(1997) studies plicitly contain them; or 2) indirect learning, en- suggest that increasing corpus size results in a hancing the incorporation of new words by read- gain, even if this increase is in topics which are un- ing texts that do not contain them. To do that, related for the relevant semantic directions which they evaluate LSA semantic representation trained are pertinent for the task. with different size corpus in multiple-choice syn- However, a different conclusion seems to result onym questions extracted from the TOEFL exam. from other sets of studies. Stone et al.(2006) have This test consists of 80 multiple-choice questions, studied the effect of Corpus size and specificity in which its requested to identify the synonym of in a document similarity rating task. They found a word between 4 options. In order to train the that training LSA with smaller subcorpus selected LSA, Landauer and Dumais used the TASA cor- for the specific task domain maintains or even im- pus (Zeno et al., 1995). proves LSA performance. This corresponds to the Landauer and Dumais(1997) randomly re- intuition of noise filtering, when removing infor- 2 mation from irrelevant dimensions results in im- ments without the decrease of k results in the com- provements of performance. plete opposite behavior, showing a strong perfor- In addition, Olde et al.(2002) have studied the mance reduction. On the other hand, word2vec effect of selecting specific subcorpus in an auto- shows in all cases a performance reduction when matic exam evaluation task. They created sev- discarding out-of-domain, which suggests an ex- eral subcorpus from a Physics corpus, progres- ploitation of higher-order word co-occurrences. sively discarding documents unrelated to the spe- Our contribution in understanding the effect cific questions. Their results showed small differ- of out-of-domain documents in word-embeddings ences in the performance between the LSA trained knowledge acquisitions is valuable from two dif- with original corpus and the LSA trained with the ferent perspectives: more specific subcorpus. • From an operational point of view: we show It is well known that the number of LSA di- that LSA’s performance can be enhanced mensions (k) is a key parameter to be duly ad- when: (1) its training corpus is cleaned from justed in order to eliminate the noisiest dimen- out-of-domain documents, and (2) a reduc- sions (Landauer and Dumais, 1997; Turney and tion of LSA’s dimensions number is applied.

Load more