
Unsupervised Sentence-embeddings by Manifold Approximation and Projection Subhradeep Kayal Prosus N.V. Amsterdam, Netherlands. [email protected] Abstract to use the weighted average over some or all of the embeddings of words in a sentence or docu- The concept of unsupervised universal sen- ment. Although sentence-embeddings constructed tence encoders has gained traction recently, this way often lose information because of the dis- wherein pre-trained models generate effec- tive task-agnostic fixed-dimensional represen- regard for word-order during averaging, they have tations for phrases, sentences and paragraphs. been found to be surprisingly performant (Aldar- Such methods are of varying complexity, from maki and Diab, 2018). simple weighted-averages of word vectors to More sophisticated methods focus on jointly complex language-models based on bidirec- learning the embeddings of sentences and words tional transformers. In this work we pro- using models similar to Word2Vec (Le and Mikolov, pose a novel technique to generate sentence- 2014; Chen, 2017), using encoder-decoder ap- embeddings in an unsupervised fashion by pro- jecting the sentences onto a fixed-dimensional proaches that reconstruct the surrounding sentences manifold with the objective of preserving local of an encoded passage (Kiros et al., 2015), or train- neighbourhoods in the original space. To delin- ing bi-directional LSTM models on large exter- eate such neighbourhoods we experiment with nal datasets (Conneau et al., 2017). Meaningful several set-distance metrics, including the re- sentence-embeddings have also been constructed cently proposed Word Mover’s distance, while by fine-tuning pre-trained bidirectional transform- the fixed-dimensional projection is achieved ers (Devlin et al., 2019) using a Siamese architec- by employing a scalable and efficient mani- fold approximation method rooted in topologi- ture (Reimers and Gurevych, 2019). cal data analysis. We test our approach, which In parallel to the approaches mentioned above, a we term EMAP or Embeddings by Manifold stream of methods have emerged recently which ex- Approximation and Projection, on six publicly ploit the inherent geometric properties of the struc- available text-classification datasets of varying ture of sentences, by treating them as sets or se- size and complexity. Empirical results show quences of word-embeddings. For example, Arora that our method consistently performs similar et al. (2017) propose the construction of sentence- to or better than several alternative state-of-the- embeddings based on weighted word-embedding art approaches. averages with the removal of the dominant singular 1 Introduction vector, while Ruckl¨ e´ et al. (2018) produce sentence- embeddings by concatenating several power-means 1.1 On sentence-embeddings of word-embeddings corresponding to a sentence. Dense vector representation of words, or word- Very recently, spectral decomposition techniques embeddings, form the backbone of most modern were used to create sentence-embeddings, which NLP applications and can be constructed using produced state-of-the-art results when used in con- context-free (Bengio et al., 2003; Mikolov et al., catenation with averaging (Kayal and Tsatsaronis, 2013; Pennington et al., 2014) or contextualized 2019; Almarwani et al., 2019). methods (Peters et al., 2018; Devlin et al., 2019). Our work is most related to that of Wu et al. Given that practical systems often benefit from (2018) who use Random Features (Rahimi and having representations for sentences and docu- Recht, 2008) to learn document embeddings which ments, in addition to word-embeddings (Palangi preserve the properties of an explicitly-defined ker- et al., 2016; Yan et al., 2016), a simple trick is nel based on the Word Mover’s Distance (Kusner 1 Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 1–11 April 19 - 23, 2021. ©2021 Association for Computational Linguistics et al., 2015). Where Wu et al. predefine the na- ing only relative distances between them. With ture of the kernel, our proposed approach can learn this as motivation, the second contribution of this the similarity-preserving manifold for a given set- work is to produce sentence-embeddings that ap- distance metric, offering increased flexibility. proximately preserve the topological properties of the original sentence space. We propose to do so 1.2 Motivation and contributions using an efficient scalable manifold-learning algo- rithm termed UMAP (McInnes et al., 2018) from A simple way to form sentence-embeddings is to topological data analysis. Empirical results show compute the dimension-wise arithmetic mean of that this process yields sentence-embeddings that the embeddings of the words in a particular sen- deliver near state-of-the-art classification perfor- tence. Even though this approach incurs informa- mance with a simple classifier. tion loss by disregarding the fact that sentences are sequences (or, at the very least, sets) of word 2 Methodology vectors, it works well in practice. This already pro- vides an indication that there is more information 2.1 Calculating distances in the sentences to be exploited. In this work, we experiment with three different Kusner et al. (2015) aim to use more of the distance measures to determine the distance be- information available in a sentence by represent- tween sentences. The first measure (Energy dis- ing sentences as a weighted point cloud of embed- tance) is motivated by a useful linkage criterion ded words. Rooted in transportation theory, their from hierarchical clustering (Rokach and Maimon, Word Mover’s distance (WMD) is the minimum 2005), while the second one (Hausdorff distance) amount of distance that the embedded words of is an important metric from algebraic topology that a sentence need to travel to reach the embedded has been successfully used in document indexing words of another sentence. The approach achieves (Tsatsaronis et al., 2012). The final metric (Word state-of-the-art results for sentence classification Mover’s distance) is a recent extension of an exist- when combined with a k-NN classifier (Cover and ing distance measure between distributions, that is Hart, 1967). Since their work, other distance met- particularly suited for use with word-embeddings rics have been suggested (Singh et al., 2019; Wang (Kusner et al., 2015). et al., 2019), also motivated by how transportation Prior to defining the distances that have been problems are solved. used in this work, we first proceed to outline the Considering that sentences are sets of word vec- notations that we will be using to describe them. tors, a large variety of methods exist in literature that can be used to calculate the distance between 2.1.1 Notations N×d two sets, in addition to the ones based on transport Let W 2 R denote a word-embedding matrix, theory. Thus, as a first contribution, we compare such that the vocabulary corresponding to it con- d alternative metrics to measure distances between sists of N words, and each word in it, wi 2 R , is sentences. The metrics we suggest, namely the d-dimensional. This word-embedding matrix and Hausdorff distance and the Energy distance, are its constituent words may come from pre-trained intuitive to explain and reasonably fast to calculate. representations such as Word2Vec (Mikolov et al., The choice of these particular distances are moti- 2013) or GloVe (Pennington et al., 2014), in which vated by their differing origins and their general case d = 300. usefulness in the respective application domains. Let S be a set of sentences and s; s0 be two Once calculated, these distances can be used in sentences from this set. Each such sentence can conjunction with k-nearest neighbours for classi- be viewed as a set of word-embeddings, fwg 2 s. fication tasks, and k-means for clustering tasks. Additionally, let the length of a sentence, s, be However, these learning algorithms are rather sim- denoted as jsj, and the cardinality of the set, S , be plistic and the state-of-the-art machine learning denoted by jS j. algorithms require a fixed-length feature represen- Let e(wi; wj) denote the distance between two tation as input to them. Moreover, having fixed- word-embeddings, wi; wj. In the context of this length representations for sentences (sentence- paper, this distance is Euclidean: embeddings) also provides a large degree of flex- ibility for downstream tasks, as compared to hav- e(wi; wj) = kwi − wjk2 (1) 2 Finally, D(s; s0) denotes the distance between 2.1.4 Word Mover’s distance two sentences. In addition to the representation of a sentence as a 2.1.2 Energy distance set of word-embeddings, a sentence s can also be represented as a N-dimensional normalized term- Energy distance is a statistical distance between frequency vector, where ns is the number of times probability distributions, based on the inter and i word w occurs in sentence s normalized by the intra-distribution variance, that satisfies all the cri- i total number of words in s: teria of being a metric (Szekely´ and Rizzo, 2013). Using the notations defined earlier, we write it cs ns = i (5) as: i Pk=N s k=1 ck 2 X X 0 where, cs is the number of times word w appears D(s; s ) = 0 e(wi; wj) i i jsjjs j 0 wi2s wj 2s in sentence s. 1 X X The goal of the Word Mover’s distance (WMD) − e(wi; wj) jsj2 (2) (Kusner et al., 2015) is to construct a sentence sim- wi2s wj 2s ilarity metric based on the distances between the 1 X X − e(wi; wj) individual words within each sentence, given by 0 2 js j 0 0 wi2s wj 2s Equation1. In order to calculate the distance be- tween two sentences, WMD introduces a transport N×N The original conception of the energy distance matrix, T 2 R , such that each element in it, was inspired by gravitational potential energy of s Tij, denotes how much of ni should be transported celestial objects.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages11 Page
-
File Size-