An Unsupervised Sentence Embedding Method by Mutual

An Unsupervised Sentence Embedding Method by Mutual Information Maximization Yan Zhang∗1;2 Ruidan He∗2 Zuozhu Liuy3 Kwan Hui Lim1 Lidong Bing2 1Singapore University of Technology and Design 2DAMO Academy, Alibaba Group 3ZJU-UIUC Institue yan [email protected], [email protected] [email protected], kwanhui [email protected] [email protected] Abstract For example, finding the most similar pair in a col- lection of 10k sentences requires about 50 million BERT is inefficient for sentence-pair tasks 10k such as clustering or semantic search as it ( 2 ) inference computations with BERT, which needs to evaluate combinatorially many sen- requires about 65 hours on a V100 GPU (Reimers tence pairs which is very time-consuming. and Gurevych, 2019). Sentence BERT (SBERT) attempted to solve Much previous work attempted to address this this challenge by learning semantically mean- problem by learning semantically meaningful rep- ingful representations of single sentences, resentations for each sentence, such that similar- such that similarity comparison can be easily ity measures like cosine distance can be easily accessed. However, SBERT is trained on corpus with high-quality labeled sentence pairs, evaluated for sentence-pair regression tasks. The which limits its application to tasks where la- straightforward way to derive a fixed-size sentence beled data is extremely scarce. In this paper, embedding from BERT-based models is to aver- we propose a lightweight extension on top of age the token representations at the last layer or BERT and a novel self-supervised learning ob- using the output of the [CLS] token. Reimers jective based on mutual information maximiza- and Gurevych(2019) showed that both approaches tion strategies to derive meaningful sentence yield rather unsatisfactory sentence embeddings. embeddings in an unsupervised manner. Un- They proposed a model, Sentence-BERT (SBERT), like SBERT, our method is not restricted by the availability of labeled data, such that it can to further fine-tune BERT on natural language in- be applied on different domain-specific cor- ference (NLI) tasks with labeled sentence pairs pus. Experimental results show that the pro- and achieved state-of-the-art performance on many posed method significantly outperforms other semantic textual similarity tasks. However, such unsupervised sentence embedding baselines improvements are induced by high-quality supervi- on common semantic textual similarity (STS) sion, and we find that their performance is degraded tasks and downstream supervised tasks. It where labeled data of the target task is extremely also outperforms SBERT in a setting where in-domain labeled data is not available, and scarce or the distribution of test set differs signifi- achieves performance competitive with super- cantly from the NLI dataset used for training. vised methods on various tasks. Learning sentence representations in an unsupervised manner is a critical step to work with 1 Introduction unlabeled or partially labeled dataset to address BERT-based pretrained language models (Devlin the aforementioned challenge (Kiros et al., 2015; et al.; Liu et al., 2019) have set new state-of- Gan et al., 2017; Hill et al., 2016; Pagliardini et al., the-art performance on various downstream NLP 2017; Yang et al., 2018). A common approach for tasks. However, they are inefficient for sentence- unsupervised sentence representation learning is to pair regression tasks such as clustering or semantic leverage on self-supervision with large unlabeled search because they need to evaluate combinatori- corpus. For example, early methods explored vari- ally many sentence pairs during inference, which ous auto-encoders for sentence embedding (Socher will result in a massive computational overhead. et al., 2011; Hill et al., 2016). Recent work such as skip-thought (Kiros et al., 2015) and FastSent (Hill ∗∗ Equally Contributed. This work was done when Yan Zhang was an intern at DAMO Academy, Alibaba Group. et al., 2016) assumed that a sentence is likely to y Corresponding author. have similar semantics to its context, and designed 1601 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 1601–1610, November 16–20, 2020. c 2020 Association for Computational Linguistics self-supervised objectives that encourage models 2 Related Work to learn sentence representations by predicting con- 2.1 Sentence Representation Learning textual information. However, the performance of these models is far behind that of supervised learn- Prior approaches for sentence embedding include ing ones on many tasks, which unveils an urgent two main categories: (1) unsupervised sentence need of better unsupervised sentence embedding embedding with unlabeled sentences, and (2) su- methods. pervised learning with labeled sentences, while a few methods might leverage on both of them. In this work, we propose a novel unsupervised Unsupervised Sentence Embedding. There are sentence embedding model with light-weight fea- two main directions to work with unlabeled cor- ture extractor on top of BERT for sentence encod- pus, according to whether the input sentences are ing, and train it with a novel self-supervised learn- ordered or not. In the scenario with unordered sen- ing objective. Our model is not restricted by the tences, the input is usually a single sentence and availability of labeled data and can be applied to models are designated to learn sentence represen- any domain of interest. Instead of simply averag- tations base on the internal structures within each ing BERT token embeddings, we use convolutional sentence, such as recursive auto-encoders (Socher neural network (CNN) layers with mean-over-time et al., 2011), denoising auto-encoders (Hill et al., pooling that transform BERT token embeddings to 2016), and the paragraph vector model (Le and a global sentence embedding (Kim, 2014). More- Mikolov, 2014). Our model follows this setting as over, we propose a novel self-supervised learning well but benefits from the model capacity of BERT objective that maximises the mutual information and knowledge in large pretraining corpus. (MI) between the global sentence embedding and Methods working with ordered sentences utilize all its local contexts embeddings, inspired by re- the distributional hypothesis which assumes that cent advances on unsupervised representation learn- a sentence is likely to have similar semantics to ing for images and graphs (Hjelm et al., 2019; its context. Under this assumption, they formulate Velickovic et al., 2019). Our model is named Info- generative or discriminative tasks that require the Sentence BERT (IS-BERT). In IS-BERT, the rep- models to correctly predict the contextual informa- resentation of a specific sentence is encouraged tion , such as skip-thought (Kiros et al., 2015) and to encode all aspects of its local context informa- FastSent (Hill et al., 2016), or to distinguish tar- tion, using local contexts derived from other in- get sentences from contrastive ones (Jernite et al., put sentences as negative examples for contrastive 2017; Logeswaran and Lee, 2018) for sentence em- learning. This learning procedure encourages the bedding (Jernite et al., 2017; Logeswaran and Lee, encoder to capture the unique information that is 2018). These methods require ordered sentences shared across all local segments of the specific in- or corpus with inter-sentential coherence for train- put sentence while different from other inputs, lead- ing, which limits their applications to domains with ing to more expressive and semantically meaning- only short texts. ful sentence embeddings. Supervised Sentence Embedding. There have also been attempts to use labeled data for sentence We evaluate our method on two groups of tasks embedding. Conneau et al.(2017) proposed the – Semantic Textual Similarity (STS) and SentEval InferSent model that uses labeled data of the Stan- (Conneau and Kiela, 2018). Empirical results show ford Natural Language Inference dataset (SNLI) that IS-BERT significantly outperforms other unsu- (Bowman et al., 2015) and the Multi-Genre NLI pervised baselines on STS and SentEval tasks. In dataset (Williams et al., 2018) to train a BiLSTM addition, we show that IS-BERT substantially out- siamese network for sentence embedding. Uni- performs SBERT in a setting where task-specific versal Sentence Encoder (Cer et al., 2018) uti- labeled data is not available. This demonstrates lized supervised training with SNLI to augment that IS-BERT has the flexibility to be applied to the unsupervised training of a transformer network. new domains without label restriction. Finally, IS- SBERT (Reimers and Gurevych, 2019) also trained BERT can achieve performance competitive with a siamese network on NLI to encode sentences, but or even better than supervised learning methods in it further benefits from the pretraining procedure certain scenarios. of BERT. Though effective, those models could be 1602 problematic to port to new domains where high- 3.1 Problem Formulation quality labeled data is not available, or the text Given a set of input sentences X = distribution is significantly different from the NLI fx1; x2; :::; xng, our goal is to learn a repre- dataset such that knowledge learned from NLI can- d sentation yi 2 R in Y for each sentence xi not be successfully transferred. Addressing this in an unsupervised manner. For simplicity, limitation requires unsupervised methods. we denote this process with a parameterized function EΘ : X −! Y, and

Load more