Segatron: Segment-Aware Transformer for Language Modeling and Understanding
Total Page:16
File Type:pdf, Size:1020Kb
Segatron: Segment-Aware Transformer for Language Modeling and Understanding He Bai,1 Peng Shi,1 Jimmy Lin,1 2 Yuqing Xie,1 Luchen Tan,2 Kun Xiong,2 Wen Gao,3 Ming Li1 2 1David R. Cheriton School of Computer Science, University of Waterloo 2 RSVP.ai 3 School of Electronics Engineering and Computer Science, Peking University {he.bai, peng.shi, jimmylin, yuqing.xie, mli}@uwaterloo.ca, {lctan, kun}@rsvp.ai, [email protected] Abstract sentences and paragraphs for language modeling. Although vanilla position encoding can help the transformer be aware Transformers are powerful for sequence modeling. Nearly of the token position by assigning a unique index to each all state-of-the-art language models and pre-trained language models are based on the Transformer architecture. However, token, the token index in a sentence, sentence index in a para- it distinguishes sequential tokens only with the token posi- graph, and paragraph index in a document are all implicit. tion index. We hypothesize that better contextual represen- Such segmentation information is essential for language mod- tations can be generated from the Transformer with richer eling, as tokens in different segments of context hold differ- positional information. To verify this, we propose a segment- ent significance for next token prediction. If the Transformer aware Transformer (Segatron), by replacing the original token model can be aware of the segment position of each con- position encoding with a combined position encoding of para- text token, we hypothesize that better context representations graph, sentence, and token. We first introduce the segment- will be encoded. This statement is not made lightly, as for aware mechanism to Transformer-XL, which is a popular 3000 years, many languages including ancient Latin, Greek, Transformer-based language model with memory extension English, French, and Chinese did not have punctuations or and relative position encoding. We find that our method can further improve the Transformer-XL base model and large paragraphs. The introduction of sentence and paragraph sep- model, achieving 17.1 perplexity on the WikiText-103 dataset. arators was fundamental, so is indexing them to train Trans- We further investigate the pre-training masked language mod- formers. Although punctuations and paragraph breakers can eling task with Segatron. Experimental results show that BERT provide boundary information to some extent, the boundary pre-trained with Segatron (SegaBERT) can outperform BERT is not as straightforward as segment position, especially for with vanilla Transformer on various NLP tasks, and outper- the dot-product self-attention based Transformer. forms RoBERTa on zero-shot sentence representation learning. 1 Hence, we propose a novel segment-aware Trans- Our code is available on GitHub. former (Segatron), which encodes paragraph index in a doc- ument, sentence index in a paragraph, and token index in Introduction a sentence all together for the input sequence. We first ver- ify the proposed method with relative position encoding on Language modeling (LM) is a traditional sequence modeling the language modeling task. By applying the segment-aware task which requires learning long-distance dependencies for mechanism to Transformer-XL (Dai et al. 2019), our base next token prediction based on the previous context. Recently, model trained with the WikiText-103 dataset (Merity et al. large neural LMs trained on a massive amount of text data 2017) outperforms Transformer-XL base by 1.5 points in have shown great potential for representation learning and terms of perplexity. Our large model achieves a perplexity transfer learning, and also achieved state-of-the-art results in of 17.1, the same score as Compressive Transformer (Rae various natural language processing tasks. arXiv:2004.14996v2 [cs.CL] 15 Dec 2020 et al. 2020), which is a more complicated model with longer To the best of our knowledge, state-of-the-art language input context and additional training objectives. We also models (Dai et al. 2019; Baevski and Auli 2019; Rae et al. pre-train masked language models with Transformer (BERT- 2020) and pre-trained language models (Radford 2018; De- base) and Segatron (SegaBERT-base) with English Wikipedia vlin et al. 2019; Yang et al. 2019; Lan et al. 2020) all use a for 500K training steps. According to experimental results, multi-layer Transformer (Vaswani et al. 2017). The Trans- SegaBERT can outperform BERT on both general language former network was initially used in the seq2seq architecture understanding (GLUE) and machine reading comprehen- for machine translation, whose input is usually a sentence. sion (SQUAD and RACE) tasks. We further pre-trained a Hence, it is intuitive to distinguish each token with its posi- large model SegaBERT-large with the same data used in tion index in the input sequence. However, the input length BERT. Experimental results show that SegaBERT-large can can grow to 1024 or more tokens and come from different not only outperform BERT-large on all the above tasks, but Copyright © 2021, Association for the Advancement of Artificial also outperforms RoBERTa-large on zero-shot Semantic Tex- Intelligence (www.aaai.org). All rights reserved. tual Similarity tasks. These results demonstrate the value of 1https://github.com/rsvp-ai/segatron_aaai segment encodings in Transformers. Model (a) Concating Relative Position Embedding In this section, we show how to apply our proposed segment- Input Sequence aware Transformer to language modeling. More specifi- cally, we first introduce our Segatron-XL (Segment-aware Relative Position Embeddings Transformer-XL) with non-learnable relative position encod- ing for autoregressive language modeling. Then we introduce Word our pre-trained Segatron (SegaBERT) with learnable absolute Embeddings position encoding for masked language modeling (MLM). (b) Summing Absolute Position Embedding Segatron-XL Input Sequence We first introduce our method in the context of autoregressive language modeling, by replacing the vanilla Transformer Paragraph Index index in Transformer-XL (Dai et al. 2019) with Segatron. Embeddings Transformer-XL is a memory augmented Transformer with Sentence Index relative position encoding: Embeddings Token Index rel T T T T A = E W W E + E W W R Embeddings i;j xi q k;E xj xi q k;R i−j T T +u Wk;EExj + v Wk;RRi−j Word (1) Embeddings rel where Ai;j is the self-attention score between query i and key j. Exi and Exj are the input representations of query Figure 1: Input representation of Segatron-XL and i and key j, respectively. Ri−j is the relative position em- SegaBERT. bedding. Wk;E and Wk;R are transformation matrices for input representation and position embedding, respectively. u and v are learnable variables. The position embeddings are First, pre-training a masked language model in the setting non-learnable and defined as: of BERT is a practical choice, as BERT is a popular baseline sin( i−j ) k < 1 dim model and requires less computational resources compared R = 100002k=dim 2 (2) i−j;k cos( i−j ) k ≥ 1 dim with more recent large models. For example, BERT-large only 100002k=dim 2 needs about 10% of the resources of RoBERTa-large (Liu dim R k et al. 2019). Hence, in this paper, we first pre-train two base where is the dimension size of i−j, and is the dimen- − − sion index. size models: SegaBERT-base and BERT-base with only Our proposed method introduces paragraph and sentence English Wikipedia data for 500K training steps, to compare segmentation to the relative position encoding. The new posi- BERT pre-trained with Transformer and Segatron fairly. We tion embeddings RI;J are defined as: then pre-train a large size model SegaBERT-large with Wiki- books dataset and 1M training steps, same as BERT-large. 8 Rt k < 1 dim > ti−tj ;k 3 Input Representation. Input X of SegaBERT is a sequence < s 2 1 R = R 1 dim > k ≥ dim of tokens, which can be one or more sentences or paragraphs. I;J;k si−sj ;k− 3 dim 3 3 p 2 :> R 2 k ≥ dim The representation xt for token t is computed by summing pi−pj ;k− 3 dim 3 the corresponding token embedding Et, token index em- (3) t s bedding Pt, sentence index embedding Pt , and paragraph where I = fti; si; pig, J = ftj; sj; pjg. t, s, and p are to- p ken position index, sentence position index, and paragraph index embedding Pt , as shown in Figure 1(b). Two special [CLS] [SEP] position index, respectively. Rt, Rs, and Rp are the rela- tokens and are added to the text sequence tive position embeddings of token, sentence, and paragraph. before the first token and after the last token, and their para- These embeddings are defined in Eq.2 and the dimensions graph/sentence indexes are the same as their adjacent tokens. Following BERT, the text is tokenized into subwords with of each are equal to 1/3 of RI;J. The input representation of our model is shown in Figure 1(a). WordPiece and the maximum sequence length is 512. To equip the recurrence memory mechanism of Training Objective. Following BERT, we use the masked Transformer-XL with the segment-aware relative position LM as our training objective. However, next sentence predic- encoding, the paragraph position, the sentence position, and tion (NSP) is not used in our model, as our input contains the token position indexes of the previous segment should more than two sentences. also be cached together with the hidden states. Then, the Data preparation. For the pre-training corpus we use En- relative position can be calculated by subtracting the cached glish Wikipedia and Bookcorpus (Zhu et al. 2015). For each position indexes from the current position indexes. document, we firstly split each into Np paragraphs, and all the sub-tokens in the i-th paragraph are assigned the same Pre-trained Segatron p Paragraph Index Embedding Pi .