Unsupervised Learning of Sentence Embeddings Using Compositional N-Gram Features
Total Page:16
File Type:pdf, Size:1020Kb
Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features Matteo Pagliardini* Prakhar Gupta* Martin Jaggi Iprova SA, Switzerland EPFL, Switzerland EPFL, Switzerland [email protected] [email protected] [email protected] Abstract mains a key goal to learn such general-purpose representations in an unsupervised way. The recent tremendous success of unsuper- vised word embeddings in a multitude of ap- Currently, two contrary research trends have plications raises the obvious question if simi- emerged in text representation learning: On one lar methods could be derived to improve em- hand, a strong trend in deep-learning for NLP beddings (i.e. semantic representations) of leads towards increasingly powerful and com- word sequences as well. We present a sim- plex models, such as recurrent neural networks ple but efficient unsupervised objective to train (RNNs), LSTMs, attention models and even Neu- distributed representations of sentences. Our ral Turing Machine architectures. While ex- method outperforms the state-of-the-art unsu- pervised models on most benchmark tasks, tremely strong in expressiveness, the increased highlighting the robustness of the produced model complexity makes such models much general-purpose sentence embeddings. slower to train on larger datasets. On the other end of the spectrum, simpler “shallow” models such 1 Introduction as matrix factorizations (or bilinear models) can benefit from training on much larger sets of data, Improving unsupervised learning is of key impor- which can be a key advantage, especially in the tance for advancing machine learning methods, as unsupervised setting. to unlock access to almost unlimited amounts of data to be used as training resources. The ma- Surprisingly, for constructing sentence embed- jority of recent success stories of deep learning dings, naively using averaged word vectors was does not fall into this category but instead relied shown to outperform LSTMs (see Wieting et al. on supervised training (in particular in the vision (2016b) for plain averaging, and Arora et al. domain). A very notable exception comes from (2017) for weighted averaging). This example the text and natural language processing domain, shows potential in exploiting the trade-off be- in the form of semantic word embeddings trained tween model complexity and ability to process unsupervised (Mikolov et al., 2013b,a; Penning- huge amounts of text using scalable algorithms, ton et al., 2014). Within only a few years from towards the simpler side. In view of this trade- their invention, such word representations – which off, our work here further advances unsupervised are based on a simple matrix factorization model learning of sentence embeddings. Our proposed as we formalize below – are now routinely trained model can be seen as an extension of the C-BOW on very large amounts of raw text data, and have (Mikolov et al., 2013b,a) training objective to train become ubiquitous building blocks of a majority sentence instead of word embeddings. We demon- of current state-of-the-art NLP applications. strate that the empirical performance of our re- sulting general-purpose sentence embeddings very While very useful semantic representations are significantly exceeds the state of the art, while available for words, it remains challenging to pro- keeping the model simplicity as well as training duce and learn such semantic embeddings for and inference complexity exactly as low as in aver- longer pieces of text, such as sentences, para- aging methods (Wieting et al., 2016b; Arora et al., graphs or entire documents. Even more so, it re- 2017), thereby also putting the work by (Arora * indicates equal contribution et al., 2017) in perspective. 528 Proceedings of NAACL-HLT 2018, pages 528–540 New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics Contributions. The main contributions in this which can be of arbitrary length, the indicator vec- work can be summarized as follows: tor ι 0, 1 is a binary vector encoding S S ∈ { }|V| (bag of words encoding). • Model. We propose Sent2Vec1, a sim- Fixed-length context windows S running over ple unsupervised model allowing to com- the corpus are used in word embedding methods pose sentence embeddings using word vec- as in C-BOW (Mikolov et al., 2013b,a) and GloVe tors along with n-gram embeddings, simulta- (Pennington et al., 2014). Here we have k = k |V| neously training composition and the embed- and each cost function fS : R R only de- → ding vectors themselves. pends on a single row of its input, describing the observed target word for the given fixed-length • Efficiency & Scalability. The computational context S. In contrast, for sentence embeddings complexity of our embeddings is only (1) O which are the focus of our paper here, S will vector operations per word processed, both be entire sentences or documents (therefore vari- during training and inference of the sentence able length). This property is shared with the su- embeddings. This strongly contrasts all neu- pervised FastText classifier (Joulin et al., 2017), ral network based approaches, and allows our which however uses soft-max with k being |V| model to learn from extremely large datasets, the number of class labels. in a streaming fashion, which is a crucial ad- vantage in the unsupervised setting. Fast in- 2.1 Proposed Unsupervised Model ference is a key benefit in downstream tasks We propose a new unsupervised model, Sent2Vec, and industry applications. for learning universal sentence embeddings. Con- ceptually, the model can be interpreted as a natu- • Performance. Our method shows signifi- ral extension of the word-contexts from C-BOW cant performance improvements compared to (Mikolov et al., 2013b,a) to a larger sentence con- the current state-of-the-art unsupervised and text, with the sentence words being specifically even semi-supervised models. The resulting optimized towards additive combination over the general-purpose embeddings show strong ro- sentence, by means of the unsupervised objective bustness when transferred to a wide range of function. prediction benchmarks. Formally, we learn a source (or context) embed- 2 Model ding vw and target embedding uw for each word w in the vocabulary, with embedding dimension h Our model is inspired by simple matrix factor and k = as in (1). The sentence embedding |V| models (bilinear models) such as recently very is defined as the average of the source word em- successfully used in unsupervised learning of beddings of its constituent words, as in (2). We word embeddings (Mikolov et al., 2013b,a; Pen- augment this model furthermore by also learning nington et al., 2014; Bojanowski et al., 2017) source embeddings for not only unigrams but also as well as supervised of sentence classification n-grams present in each sentence, and averaging (Joulin et al., 2017). More precisely, these models the n-gram embeddings along with the words, i.e., can all be formalized as an optimization problem the sentence embedding vS for S is modeled as of the form 1 1 vS := R(S) V ιR(S) = R(S) vw (2) | | | | min fS(UV ιS) (1) w R(S) U,V ∈X S X∈C where R(S) is the list of n-grams (including un- k h for two parameter matrices U R × and V igrams) present in sentence S. In order to pre- h ∈ ∈ R ×|V|, where denotes the vocabulary. Here, dict a missing word from the context, our objective V the columns of the matrix V represent the learnt models the softmax output approximated by neg- source word vectors whereas those of U represent ative sampling following (Mikolov et al., 2013b). the target word vectors. For a given sentence S, For the large number of output classes to be |V| predicted, negative sampling is known to signifi- 1 All our code and pre-trained models will be made publicly available on http://github.com/epfml/ cantly improve training efficiency, see also (Gold- sent2vec berg and Levy, 2014). Given the binary logistic 529 loss function ` : x log (1 + e x) coupled with model, parallel training is straight-forward using 7→ − negative sampling, our unsupervised training ob- parallelized or distributed SGD. jective is formulated as follows: Also, in order to store higher-order n-grams effi- ciently, we use the standard hashing trick, see e.g. (Weinberger et al., 2009), with the same hashing min ` uw> vS wt U,V t \{ } S wt S function as used in FastText (Joulin et al., 2017; X∈C X∈ Bojanowski et al., 2017). + ` uw> vS wt − 0 \{ } w0 Nwt 2.3 Comparison to C-BOW X∈ where S corresponds to the current sentence and C-BOW (Mikolov et al., 2013b,a) aims to predict a chosen target word given its fixed-size context Nwt is the set of words sampled negatively for the word w S. The negatives are sampled2 window, the context being defined by the average t ∈ following a multinomial distribution where each of the vectors associated with the words at a dis- word w is associated with a probability qn(w) := tance less than the window size hyper-parameter √f f f ws. If our system, when restricted to unigram w wi wi , where w is the normal- ized frequency∈V of w in the corpus. features, can be seen as an extension of C-BOW P p To select the possible target unigrams (posi- where the context window includes the entire sen- tives), we use subsampling as in (Joulin et al., tence, in practice there are few important differ- 2017; Bojanowski et al., 2017), each word w be- ences as C-BOW uses important tricks to facilitate ing discarded with probability 1 q (w) where the learning of word embeddings. C-BOW first − p qp(w) := min 1, t/fw + t/fw . Where t is uses frequent word subsampling on the sentences, the subsampling hyper-parameter.