Arxiv:2010.03652V1 [Cs.CL] 7 Oct 2020 Dings), Which Can Signiﬁcantly Accelerate Inference Applicable to Sentence Encoder Training, Because Speed

Cross-Thought for Sentence Encoder Pre-training Shuohang Wang1 Yuwei Fang1 Siqi Sun1 Zhe Gan1 Yu Cheng1 Jing Jiang2 Jingjing Liu1 1Microsoft Dynamics 365 AI Research, 2Singapore Management University fshuowa, yuwfan, siqi.sun, zhe.gan, yu.cheng, [email protected] [email protected] Abstract In this paper, we propose Cross-Thought, a novel approach to pre-training sequence encoder, which is instrumental in building reusable sequence embeddings for large-scale NLP tasks such as question answering. In- stead of using the original signals of full sentences, we train a Transformer-based sequence encoder over a large set of short sequences, which allows the model to automatically select the most useful information for predict- Figure 1: Example of short sequences that can leverage ing masked words. Experiments on ques- each other for pre-training sentence encoder. tion answering and textual entailment tasks demonstrate that our pre-trained encoder can outperform state-of-the-art encoders trained embeddings to generate the next sentence (Fig- with continuous sentence signals as well as ure2(a)). Inverse Cloze Task (Lee et al., 2019) traditional masked language modeling base- defines some pseudo labels to pre-train a sentence lines. Our proposed approach also achieves encoder (Figure2(b)). However, pseudo labels may new state of the art on HotpotQA (full-wiki setting) by improving intermediate information bear low accuracy, and rich linguistic information retrieval performance.1 that can be well learned in generic language modeling is often lost in these unsupervised methods. 1 Introduction In this paper, we propose a novel unsupervised ap- Encoding sentences into embeddings (Kiros et al., proach that fully exploits the strength of language 2015; Subramanian et al., 2018; Reimers and modeling for sentence encoder pre-training. Gurevych, 2019) is a critical step in many Nat- Popular pre-training tasks such as language mod- ural Language Processing (NLP) tasks. The benefit eling (Peters et al., 2018; Radford et al., 2018), of using sentence embeddings is that the represen- masked language modeling (Devlin et al., 2019; tations of all the encoded sentences can be reused Liu et al., 2019) and sequence generation (Dong on a chunk level (compared to word-level embed- et al., 2019; Lewis et al., 2019) are not directly arXiv:2010.03652v1 [cs.CL] 7 Oct 2020 dings), which can significantly accelerate inference applicable to sentence encoder training, because speed. For example, when used in question answer- only the hidden state of the first token (a special ing (QA), it can significantly shorten inference time token) (Reimers and Gurevych, 2019; Devlin et al., with all the embeddings of candidate paragraphs 2019) can be used as the sentence embedding, but pre-cached into memory and only matched with no loss or gradient is specifically designed for the the question embedding during inference. first special token, which renders sentence embed- There have been several models specifically de- dings learned in such settings contain limited useful signed to pre-train sentence encoders with large- information. scale unlabeled corpus. For example, Skip- Another limitation in existing masked language thought (Kiros et al., 2015) uses encoded sentence modeling methods (Devlin et al., 2019; Liu et al., 1Our code will be released at https://github.com/ 2019) is that they focus on long sequences (512 shuohangwang/Cross-Thought. words), where masked tokens can be recovered by Figure 2: Structures of pre-training models for sentence encoder. (a) is a Seq2Seq model that generates the next sentence based on the embedding of the previous sentence. (b) is the classification/ranking model based on pre- defined pseudo labels. (c) is the structure of our model by using the sentence embedding from other sequences to generate the masked word. considering the context within the same sequence. Our contributions are summarized as follows. (i) This is useful for word dependency learning within We propose the Cross-Thought model to pre-train a sequence, but less effective for sentence embed- a sentence encoder with a novel pre-training task: ding learning. recovering a masked short sequence by taking into consideration the embeddings of surrounding se- In this paper, we propose Cross-Thought, which quences. (ii) Our model can be easily finetuned on segments input text into shorter sequences, where diverse downstream tasks. The attention weights masked words in one sequence are less likely to of the pre-trained cross-sequence Transformers can be recovered based on the current sequence itself, also be directly used for ranking tasks. (iii) Our but more relying on the embeddings of other sur- model achieves the best performance on multiple rounding sequences. For example, in Figure1, the sequence-pair classification and answer-selection masked words “George Washington” and “United tasks, compared to state-of-the-art baselines. In States” in the third sequence can only be cor- addition, it further boosts the recall of informa- rectly predicted by considering the context from tion retrieval (IR) models in open-domain QA task, the first sequence. Thus, instead of performing self- and achieves new state of the art on the HotpotQA attention over all the words in all sentences, our benchmark (full-wiki setting). proposed pre-training method enforces the model to learn from mutually-relevant sequences and automatically select the most relevant neighbors for 2 Related Work masked words recovery. Sequence Encoder Many studies have explored The proposed Cross-Thought architecture is il- different ways to improve sequence embeddings. lustrated in Figure2(c). Specifically, we pre- Huang et al.(2013) proposes deep structured se- append each sequence with multiple special tokens, mantic encoders for web search. Tan et al.(2015) the final hidden states of which are used as the uses LSTM as the encoder for non-factoid an- final sentence embedding. Then, we train multi- swer selection, and Tai et al.(2015) proposes tree- ple cross-sequence Transformers over the hidden LSTM to compute semantic relatedness between states of different special tokens independently, to sentences. Mou et al.(2016) also uses tree-based retrieve relevant sequence embeddings for masked CNN as the encoder for textual entailment tasks. words prediction. After pre-training, the attention Cheng et al.(2016) proposes Long Short-Term weights in the cross-sequence Transformers can be Memory-Networks (LSTMN) for inferring the re- directly applied to downstream tasks (e.g., in QA lation between sentences, and Lin et al.(2017) tasks, similarity scores between question and can- combines LSTM and self-attention mechanism to didate answers can be ranked by their respective improve sentence embeddings. Multi-task learn- sentence embeddings). ing (Subramanian et al., 2018; Cer et al., 2018) has also been applied for training better sentence em- designed to encourage the model to recover masked beddings. Recently, in additional to supervised words based on sentence-level global context from learning, models pre-trained with unsupervised other sequences, instead of word-level local con- methods begin to dominate the field. text within the same sequence (Figure1). There- fore, unlike previous work that segments raw text Pre-training Several methods have been pro- into long sequences and shuffles the sequences for posed to directly pre-train sentence embedding, pre-training, we propose to create shorter text se- such as Skip-thought (Kiros et al., 2015), Fast- quences instead, without shuffling. In this way, Sent (Hill et al., 2016), and Inverse Cloze Task (Lee a shorter sequence may not contain all the neces- et al., 2019). Although these methods can obtain sary information for recovering the masked words, better sentence embeddings in an unsupervised hence requiring the probing into surrounding se- way, they cannot achieve state-of-the-art perfor- quences to capture the missing information. mance in downstream tasks even with further fine- tuning. More recently, Peters et al.(2018) proposes 3.2 Cross-Thought Pre-training to pre-train LSTM with language modeling (LM) The pre-training model is illustrated in Figure task, and Radford et al.(2018) pre-trains Trans- 3(a). As aforementioned, the input of pre- former also with LM. Instead of sequentially gen- training data consists of M continuous sequences erating words in a single direction, Devlin et al. [X ;X ; :::; X ]. Similar to BERT (Devlin (2019) proposes the masked language modeling 0 1 M−1 et al., 2019), we use the hidden state of the spe- task to pre-train bidirectional Transformer. Most cial token as the final sentence embedding. To recently, Guu et al.(2020); Lewis et al.(2020) pro- encode the embeddings with richer semantic infor- pose to jointly train sentence-embedding-based in- mation, we propose to pre-append N special tokens formation retriever and Transformer to re-construct S instead of a single one to each sequence X. documents. However, their methods are usually We first use Transformer to encode the seg- difficult to train with reinforcement learning meth- mented short sequences as follows: ods involved, and need to periodically re-index the whole corpus such as Wikipedia. In this paper, Hm = Transformer([S; Xm]); (1) to pre-train sentence encoder, we propose a new E = H [0:N − 1]; (2) model Cross-Thought to recover the masked in- m m formation across sequences. We make use of the N×d where S 2 R are the embeddings of N spe- heuristics that nearby sequences in the document lm×d cial tokens, and Xm 2 R are the contextual- contain the most important information to recover ized word embeddings of the m-th sequence. lm the masked words. Therefore, the challenging re- is the sequence length and d is the dimension of trieval part can be replaced by soft-attention mech- the embedding. [·; ·] is the concatenation of matri- anism, making our model much easier to train.

Load more