Supervised Learning of Universal Sentence Representations From

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data Alexis Conneau Douwe Kiela Holger Schwenk Facebook AI Research Facebook AI Research Facebook AI Research [email protected] [email protected] [email protected] Lo¨ıc Barrault Antoine Bordes LIUM, Universite´ Le Mans Facebook AI Research [email protected] [email protected] Abstract multiple words and phrases in a single vector re- mains an question to be solved. Many modern NLP systems rely on word In this paper, we study the task of learning uni- embeddings, previously trained in an un- versal representations of sentences, i.e., a sentence supervised manner on large corpora, as encoder model that is trained on a large corpus base features. Efforts to obtain embed- and subsequently transferred to other tasks. Two dings for larger chunks of text, such as questions need to be solved in order to build such sentences, have however not been so suc- an encoder, namely: what is the preferable neu- cessful. Several attempts at learning unsu- ral network architecture; and how and on what pervised representations of sentences have task should such a network be trained. Follow- not reached satisfactory enough perfor- ing existing work on learning word embeddings, mance to be widely adopted. In this paper, most current approaches consider learning sen- we show how universal sentence represen- tence encoders in an unsupervised manner like tations trained using the supervised data of SkipThought (Kiros et al., 2015) or FastSent (Hill the Stanford Natural Language Inference et al., 2016). Here, we investigate whether su- datasets can consistently outperform unsupervised learning can be leveraged instead, tak- pervised methods like SkipThought vec- ing inspiration from previous results in computer tors (Kiros et al., 2015) on a wide range vision, where many models are pretrained on the of transfer tasks. Much like how com- ImageNet (Deng et al., 2009) before being trans- puter vision uses ImageNet to obtain fea- ferred. We compare sentence embeddings trained tures, which can then be transferred to on various supervised tasks, and show that sen- other tasks, our work tends to indicate the tence embeddings generated from models trained suitability of natural language inference on a natural language inference (NLI) task reach for transfer learning to other NLP tasks. the best results in terms of transfer accuracy. We 1 Our encoder is publicly available . hypothesize that the suitability of NLI as a train- 1 Introduction ing task is caused by the fact that it is a high-level understanding task that involves reasoning about Distributed representations of words (or word em- the semantic relationships within sentences. beddings) (Bengio et al., 2003; Collobert et al., Unlike in computer vision, where convolutional 2011; Mikolov et al., 2013; Pennington et al., neural networks are predominant, there are mul- 2014; Bojanowski et al., 2016) have shown to pro- tiple ways to encode a sentence using neural net- vide useful features for various tasks in natural works. Hence, we investigate the impact of the language processing and computer vision. While sentence encoding architecture on representational there seems to be a consensus concerning the use- transferability, and compare convolutional, recur- fulness of word embeddings and how to learn rent and even simpler word composition schemes. them, this is not yet clear with regard to represen- Our experiments show that an encoder based on a tations that carry the meaning of a full sentence. bi-directional LSTM architecture with max pool- That is, how to capture the relationships among ing, trained on the Stanford Natural Language In- 1https://www.github.com/ ference (SNLI) dataset (Bowman et al., 2015), facebookresearch/InferSent yields state-of-the-art sentence embeddings compared to all existing alternative unsupervised ap- machine translation data (using the WMT’14 En- proaches like SkipThought or FastSent, while be- glish/French and English/German pairs), dictio- ing much faster to train. We establish this finding nary definitions and image captioning data from on a broad and diverse set of transfer tasks that the COCO dataset (Lin et al., 2014). These mod- measures the ability of sentence representations to els obtained significantly lower results compared capture general and useful information. to the unsupervised Skip-Thought approach. Recent work has explored training sentence en- 2 Related work coders on the SNLI corpus and applying them on the SICK corpus (Marelli et al., 2014), either us- Transfer learning using supervised features has ing multi-task learning or pretraining (Mou et al., been successful in several computer vision appli- 2016; Bowman et al., 2015). The results were in- cations (Razavian et al., 2014). Striking examples conclusive and did not reach the same level as sim- include face recognition (Taigman et al., 2014) pler approaches that directly learn a classifier on and visual question answering (Antol et al., 2015), top of unsupervised sentence embeddings instead where image features trained on ImageNet (Deng (Arora et al., 2017). To our knowledge, this work et al., 2009) and word embeddings trained on large is the first attempt to fully exploit the SNLI cor- unsupervised corpora are combined. pus for building generic sentence encoders. As we In contrast, most approaches for sentence repre- show in our experiments, we are able to consis- sentation learning are unsupervised, arguably be- tently outperform unsupervised approaches, even cause the NLP community has not yet found the if our models are trained on much less (but human- best supervised task for embedding the semantics annotated) data. of a whole sentence. Another reason is that neural networks are very good at capturing the biases of 3 Approach the task on which they are trained, but can easily forget the overall information or semantics of the This work combines two research directions, input data by specializing too much on these bi- which we describe in what follows. First, we ex- ases. Learning models on large unsupervised task plain how the NLI task can be used to train univer- makes it harder for the model to specialize. Lit- sal sentence encoding models using the SNLI task. twin and Wolf(2016) showed that co-adaptation of We subsequently describe the architectures that we encoders and classifiers, when trained end-to-end, investigated for the sentence encoder, which, in can negatively impact the generalization power of our opinion, covers a suitable range of sentence image features generated by an encoder. They pro- encoders currently in use. Specifically, we exam- pose a loss that incorporates multiple orthogonal ine standard recurrent models such as LSTMs and classifiers to counteract this effect. GRUs, for which we investigate mean and max- Recent work on generating sentence embed- pooling over the hidden representations; a self- dings range from models that compose word em- attentive network that incorporates different views beddings (Le and Mikolov, 2014; Arora et al., of the sentence; and a hierarchical convolutional 2017; Wieting et al., 2016) to more complex neu- network that can be seen as a tree-based method ral network architectures. SkipThought vectors that blends different levels of abstraction. (Kiros et al., 2015) propose an objective func- tion that adapts the skip-gram model for words 3.1 The Natural Language Inference task (Mikolov et al., 2013) to the sentence level. By en- The SNLI dataset consists of 570k human- coding a sentence to predict the sentences around generated English sentence pairs, manually la- it, and using the features in a linear model, they beled with one of three categories: entailment, were able to demonstrate good performance on 8 contradiction and neutral. It captures natural lan- transfer tasks. They further obtained better results guage inference, also known in previous incarna- using layer-norm regularization of their model tions as Recognizing Textual Entailment (RTE), in (Ba et al., 2016). Hill et al.(2016) showed and constitutes one of the largest high-quality la- that the task on which sentence embeddings are beled resources explicitly constructed in order to trained significantly impacts their quality. In ad- require understanding sentence semantics. We hy- dition to unsupervised methods, they included su- pothesize that the semantic nature of NLI makes pervised training in their comparison—namely, on it a good candidate for learning universal sentence embeddings in a supervised way. That is, we aim with either mean or max pooling, self-attentive to demonstrate that sentence encoders trained on network and hierarchical convolutional networks. natural language inference are able to learn sentence representations that capture universally use- 3.2.1 LSTM and GRU ful features. Our first, and simplest, encoders apply recurrent neural networks using either LSTM (Hochreiter 3-way softmax and Schmidhuber, 1997) or GRU (Cho et al., 2014) modules, as in sequence to sequence en- fully-connected layers coders (Sutskever et al., 2014). For a sequence of T words (w1; : : : ; wT ), the network computes (u, v, u v , u v) a set of T hidden representations h1; : : : ; hT , with | − | ∗ ht = −−−−!LSTM(w1; : : : ; wT ) (or using GRU units instead). A sentence is represented by the last hid- u v den vector, hT . We also consider a model BiGRU-last that con- sentence encoder sentence encoder catenates the last hidden state of a forward GRU, with premise input with hypothesis input and the last hidden state of a backward GRU to have the same architecture as for SkipThought Figure 1: Generic NLI training scheme. vectors. Models can be trained on SNLI in two differ- 3.2.2 BiLSTM with mean/max pooling ent ways: (i) sentence encoding-based models that For a sequence of T words wt t=1;:::;T , a bidirec- f g explicitly separate the encoding of the individual tional LSTM computes a set of T vectors ht t. f g sentences and (ii) joint methods that allow to use For t [1;:::;T ], ht, is the concatenation of a encoding of both sentences (to use cross-features forward2 LSTM and a backward LSTM that read or attention from one sentence to the other).

Supervised Learning of Universal Sentence Representations From

Deep Learning for Natural Language Processing Creating Neural Networks with Python

Text Feature Mining Using Pre-Trained Word Embeddings

A Deep Reinforcement Learning Based Multi-Step Coarse to Fine Question Answering (MSCQA) System

Deep Reinforcement Learning for Chatbots Using Clustered Actions and Human-Likeness Rewards

Abusive and Hate Speech Tweets Detection with Text Generation

Learning Sentence Embeddings Using Recursive Networks

Learning Representations of Social Media Users Arxiv:1812.00436V1

Learning Bilingual Sentence Embeddings Via Autoencoding and Computing Similarities with a Multilayer Perceptron

Dependency-Based Convolutional Neural Networks for Sentence Embedding

Automatic Text Summarization Using Reinforcement Learning with Embedding Features

Dependency-Based Convolutional Neural Networks for Sentence Embedding∗

A Sentence Embedding Method by Dissecting BERT-Based Word Models Bin Wang, Student Member, IEEE, and C.-C