Parameter-Free Sentence Embedding Via Orthogonal Basis

Parameter-free Sentence Embedding via Orthogonal Basis Ziyi Yang1∗, Chenguang Zhu2, and Weizhu Chen3 1Department of Mechanical Engineering, Stanford university 2Microsoft Speech and Dialogue Research Group 3Microsoft Dynamics 365 AI [email protected], fchezhu, [email protected] Abstract Based on learning paradigms, the existing approaches to sentence embeddings can be catego- We propose a simple and robust non- rized into two categories: i) parameterized meth- parameterized approach for building sentence representations. Inspired by the Gram- ods and ii) non-parameterized methods. Schmidt Process in geometric theory, we build Parameterized sentence embeddings. These an orthogonal basis of the subspace spanned models are parameterized and require training to by a word and its surrounding context in a sen- optimize their parameters. SkipThought (Kiros tence. We model the semantic meaning of a et al., 2015) is an encoder-decoder model that pre- word in a sentence based on two aspects. One dicts adjacent sentences. Pagliardini et al.(2018) is its relatedness to the word vector subspace proposes an unsupervised model, Sent2Vec, to already spanned by its contextual words. The other is the word’s novel semantic meaning learn an n-gram feature in a sentence to predict the which shall be introduced as a new basis vector center word from the surrounding context. Quick perpendicular to this existing subspace. Fol- thoughts (QT) (Logeswaran and Lee, 2018) re- lowing this motivation, we develop an innova- places the encoder with a classifier to predict con- tive method based on orthogonal basis to com- text sentences from candidate sequences. Khodak bine pre-trained word embeddings into sen- et al.(2018) proposes a` la carte to learn a linear tence representations. This approach requires mapping to reconstruct the center word from its zero parameters, along with efficient inference context. Conneau et al.(2017) generates the sen- performance. We evaluate our approach on 11 downstream NLP tasks. Experimental results tence encoder InferSent using Natural Language show that our model outperforms all existing Inference (NLI) dataset. Universal Sentence En- non-parameterized alternatives in all the tasks coder (Yang et al., 2018; Cer et al., 2018) uti- and it is competitive to other approaches rely- lizes the emerging transformer structure (Vaswani ing on either large amounts of labelled data or et al., 2017; Devlin et al., 2018) that has been prolonged training time. proved powerful in various NLP tasks. The model 1 Introduction is first trained on large scale of unsupervised data from Wikipedia and forums, and then trained on The concept of word embeddings has been preva- the Stanford Natural Language Inference (SNLI) lent in NLP community in recent years, as they can dataset. Wieting and Gimpel(2017b) propose characterize semantic similarity between any pair the gated recurrent averaging network (GRAN), of words, achieving promising results in a large which is trained on Paraphrase Database (PPDB) number of NLP tasks (Mikolov et al., 2013; Pen- and English Wikipedia. Subramanian et al.(2018) nington et al., 2014; Salle et al., 2016). However, leverages a multi-task learning framework to gen- due to the hierarchical nature of human language, erate sentence embeddings. Wieting et al.(2015a) it is not sufficient to comprehend text solely based learns the paraphrastic sentence representations as on isolated understanding of each word. This has the simple average of updated word embeddings. prompted a recent rise in search for semantically Non-parameterized sentence embedding. robust embeddings for longer pieces of text, such Recent work (Arora et al., 2017) shows that, as sentences and paragraphs. surprisingly, a weighted sum or transformation Most∗ of the work was done during summer internship at of word representations can outperform many Microsoft. sophisticated neural network structures in sentence embedding tasks. These methods are bias sentence similarity comparison. As we build parameter-free and require no further training a new orthogonal basis for each sentence, we pro- upon pre-trained word vectors. Arora et al.(2017) pose to have disparate background components constructs a sentence embedding called SIF as a for each sentence. This motivates us to put for- sum of pre-trained word embeddings, weighted by ward a sentence-specific principal vector removal reverse document frequency. Ethayarajh(2018) method, leading to better empirical results. builds upon the random walk model proposed in We evaluate our algorithm on 11 NLP tasks. SIF by setting the probability of word generation Our algorithm outperforms all non-parameterized inversely related to the angular distance between methods and many parameterized approaches in the word and sentence embeddings.R uckl¨ e´ et al. 10 tasks. Compared to SIF (Arora et al., 2017), the (2018) concatenates different power mean word performance is boosted by 5.5% on STS bench- embeddings as a sentence vector in p-mean. As mark dataset, and by 2.5% on SST dataset. Plus, these methods do not have a parameterized model, the running time of our model compares favorably they can be easily adapted to novel text domains with existing models. with both fast inference speed and high-quality The rest of this paper is organized as following. sentence embeddings. In view of this trend, our In Section2, we describe our sentence embedding work aims to further advance the frontier of this algorithm GEM. We evaluate our model on vari- group and make its new state-of-the-art. ous tasks in Section3 and Section4. Finally, we In this paper, we propose a novel sen- summarize our work in Section5. tence embedding algorithm, Geometric Embed- ding (GEM), based entirely on the geometric structure of word embedding space. Given a d- 2 Approach d×n dim word embedding matrix A 2 R for a sentence with n words, any linear combination of the To embed a sentence into a vector of fixed length, sentence’s word embeddings lies in the subspace we generate a weighted sum of its word vec- spanned by the n word vectors. We analyze the tors (Arora et al., 2017; Wieting et al., 2015a). d geometric structure of this subspace in R . When Sentence Embeddings should capture information we consider the words in a sentence one-by-one from two levels, the sentence and corpus level. In in order, each word may bring in a novel orthog- the corpus level, we need to decide if the new se- onal basis to the existing subspace. This new ba- mantic meaning is semantically unique, otherwise sis can be considered as the new semantic mean- it’s too common to bring in useful information. On ing brought in by this word, while the length of the sentence aspect, it is essential to know, first, projection in this direction can indicate the inten- what is the importance, or the portion of the new sity of this new meaning. It follows that a word direction qi in the word that brings it in? Sec- with a strong intensity should have a larger influ- ond, is this direction important on the entity level?. ence in the sentence’s meaning. Thus, these in- Combining with the previous idea of word vec- tensities can be converted into weights to linearly tor decomposition, we will establish an associa- combine all word embeddings to obtain the sen- tion between the weight of each word vector in the tence embedding. In this paper, we theoretically sentence embedding and the new semantic mean- frame the above approach in a QR factorization of ing the word brings to the sentence. the word embedding matrix A. Furthermore, since the meaning and importance of a word largely de- pends on its close neighborhood, we propose the 2.1 Quantify New Semantic Meaning sliding-window QR factorization method to cap- Let us consider the idea of word embeddings ture the context of a word and characterize its sig- (Mikolov et al., 2013), where a word wi is pro- nificance within the context. d jected as a vector vwi 2 R . Any sequence of d In the last step, we adapt a similar approach as words can be viewed as a subspace in R spanned Arora et al.(2017) to remove top principal vec- by its word vectors. Before the appearance of d tors before generating the final sentence embed- the ith word, S is a subspace in R spanned by ding. This step is to ensure commonly shared fvw1 ; vw2 ; :::; vwi−1 g. Its orthonormal basis is background components, e.g. stop words, do not fq1; q2; :::; qi−1g. The embedding vwi of the ith word wi can be decomposed into new and important information a word brings to the sentence. The previous process yields the or- i−1 X thogonal basis vector qi. We propose that qi repre- v = r q + r q wi j j i i sents the novel semantic meaning brought by word j=1 wi. We will now discuss how to quantify i) the r = qT v (1) j j wi novelty of qi to other meanings in wi, ii) the sig- i−1 X nificance of qi to its context, and iii) the corpus- ri = kvwi − rjqjk2 wise uniqueness of qi w.r.t the whole corpus. j=1 2.2 Novelty where Pi−1 r q is the part in v that resides in j=1 j j wi We propose that a word w is more important to a subspace S, and q is orthogonal to S and is to be i i sentence if its novel orthogonal basis vector q is added to S. The above algorithm is also known i a large component in v , quantified by the pro- as Gram-Schmidt Process. In the case of rank wi posed novelty score α .

Load more