Arxiv:2010.04903V1 [Cs.CL] 10 Oct 2020

What Do Position Embeddings Learn? An Empirical Study of Pre-Trained Language Model Positional Encoding Yu-An Wang Yun-Nung Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan [email protected] [email protected] Abstract ordering information can be modeled by this struc- ture. However, RNN and BPTT are very inefficient In recent years, pre-trained Transformers have in modern GPU computation due to the difficulty of dominated the majority of NLP benchmark parallelization with the time dependency. To solve tasks. Many variants of pre-trained Trans- this problem, recent work, such as convolutional formers have kept breaking out, and most fo- cus on designing different pre-training objec- seq2seq (Gehring et al., 2017) and Transformers tives or variants of self-attention. Embedding (Vaswani et al., 2017) which apply convolutional the position information in the self-attention neural network (CNN) (LeCun et al., 1995) and mechanism is also an indispensable factor in self-attention respectively, succeed to eliminate the Transformers however is often discussed at time dependency to take the computational advan- will. Therefore, this paper carries out an tage of GPU. Instead of storing the information of empirical study on position embeddings of ordered sequences, these models utilize the posi- mainstream pre-trained Transformers, which tion information by using a feature-level positional mainly focuses on two questions: 1) Do position embeddings really learn the meaning of encoding. For example, convolutional seq2seq pro- positions? 2) How do these different learned posed learnable position embeddings to represent position embeddings affect Transformers for the positions in a sequence. NLP tasks? This paper focuses on providing Recently, various pre-trained Transformer lan- a new insight of pre-trained position embed- guage models keep breaking state-of-the-art results dings through feature-level analysis and empir- in numerous NLP tasks. There are many different ical experiments on most of iconic NLP tasks. ways to pre-train a Transformer language model. It is believed that our experimental results can guide the future work to choose the suitable For example, using an encoder, decoder, or the positional encoding function for specific tasks whole part of the Transformer, adapting the self- given the application property.1 attention masks, or training with different objectives (Devlin et al., 2018; Liu et al., 2019; Radford 1 Introduction et al., 2018, 2019; Lewis et al., 2019; Raffel et al., 2019; Yang et al., 2019). However, in terms of po- Word ordering often determines the meaning of sitional encoding, most work only used a learned a sentence; therefore how to utilize the position position embedding which is originally proposed arXiv:2010.04903v1 [cs.CL] 10 Oct 2020 information of a word sequence has been an impor- in convolutional seq2seq (Gehring et al., 2017) tant topic in NLP and widely investigated recently. without any analysis, even different objectives may A common approach for modeling word ordering learn completely different position information. is to use recurrent neural networks (RNN), such Motivated by the above observations, our goal as long short-term memory (LSTM) (Hochreiter is to investigate what position information the and Schmidhuber, 1997) or gated recurrent unit pre-trained Transformers could learn under differ- (GRU) (Chung et al., 2014), which use a hidden ent settings. We conduct a deep analysis of the state to represent the information of an ordered se- learned position embeddings among three iconic quence and update model weights by backpropaga- pre-trained Transformer language models: BERT tion through time (BPTT) (Werbos, 1990); thus the (Devlin et al., 2018), RoBERTa (Liu et al., 2019) 1The source code is available at: https://github. and GPT-2 (Radford et al., 2019). To examine the com/MiuLab/PE-Study performance of different NLP types, we conduct the experiments on text classification, language 3 Transformer modeling, and machine translation, and empirically Transformer is an encoder-decoder sequence-to- analyze and explain the meaning and influence of sequence model proposed by Vaswani et al.(2017). position embeddings from different aspects. In the architecture, Transformer is composed of The contributions of this paper are 3-fold: self-attention blocks that are position-insensitive • This paper is among the first study that pro- modules. Therefore, a positional embedding should vides a complete analysis about what learned be considered together with the NLP tasks. To elab- position embeddings capture in different pre- orate on the experiments we conduct, this section trained models. briefly introduces Transformers. Input Representation Due to the property of • This paper empirically examines the perfor- position-insensitive in the attention module, the mance of different position embeddings for input representations should also contain the posi- many NLP tasks. tion information. In Transformers (Vaswani et al., • This paper connects the empirical perfor- 2017), a word embedding is directly added with mance with the task property based on the the positional encoding as the final representation: analysis, providing the guidance of the future zi = WE(xi) + PE(i); work for choosing the suitable positional en- x i WE coding method in the target task. where i is the token at the -th position, is the word embedding, and PE is the positional en- 2 Related Work coding, which can be either a learnable embedding or a pre-defined function. The concept of using position embedding on position-insensitive models was first proposed by Multi-Head Self-Attention The attention mech- convolutional seq2seq (Gehring et al., 2017), which anism is often used in an encoder-decoder architec- built an encoder-decoder architecture on convo- ture, and there are many variants of attention im- lutional neural networks. Vaswani et al.(2017) plementations (Bahdanau et al., 2014; Britz et al., proposed Transformers that used the self-attention 2017). In Transformers, the scaled dot-product mechanism in the basic blocks. Because the atten- attention is applied: tion mechanism is position-insensitive, it proposed QW KT W attention(Q; K; V ) = softmax( p )V W; a pre-defined sinusoidal function as positional en- dk coding. Pre-trained language models became a where W is a linear projection and Q, K, V repre- trend among many NLP tasks after (Peters et al., sent query, key and value matrices respectively. 2018) introduced ELMo. Affected by ELMo, Ope- Transformer blocks are composed of multi-head nAI GPT (Radford et al., 2018) is the first pre- self-attention. Literally, the inputs Q, K, V are the trained language model using a Transformer archi- same and the attention is performed multiple times, tecture, then many different variant of pre-trained and then the output heads are concatenated as the Transformer including BERT (Devlin et al., 2018), final output hidden state h. This process can be RoBERTa (Roberts, 2005) and GPT-2 (Radford formulated as et al., 2019) started evolving the researches of NLP tremendously. In Transformers, the attention val- headi = attention(Q; K; V ) ues are the same in each input position. Thus, Shaw h = concat([head ; :::; head ])W: et al.(2018) proposed a relative position represen- 1 n tation in the attention level to address this issue. Transformer Encoder A Transformer encoder Dai et al.(2019) used a segment-level recurrence layer is composed of multi-head self-attention mechanism on Transformers and also utilized an following a position-wise feed-forward network adaptive version of relative position embeddings (FFN) with the residual connection (He et al., 2016) inspired by Shaw et al.(2018). Furthermore, Wang and layer normalization (Ba et al., 2016): et al.(2019) extended the embedding space from output = layernorm(h + FFN(h)); real numbers to complex values , and also proposed a new learnable positional encoding function and then stacked the layers sequentially to form a instead of a simple position embedding mapping. Transformer encoder. Transformer Decoder The Transformer de- Type PE MAE coder is also stacked by self-attention blocks, and BERT 34:14 it only has two major differences from the encoder: Learned RoBERTa 6:06 1. Each Transformer decoder layer has an addi- GPT-2 1:03 tional sub-layer to perform attention on the Pre-Defined sinusoid 0:0 encoder output. Table 1: Mean absolute error of the reversed mapping 2. To ensure the decoder can only decode tokens function learned by linear regression. depending on the tokens in the past, it uses an attention mask to mask the attention values of Type PE Error Rate the subsequent tokens. BERT 19:72% Therefore, the Transformer decoder can decode Learned RoBERTa 7:23% tokens autoregressively like other conventional lan- GPT-2 1:56% guage models such as RNN. Pre-Defined sinusoid 5:08% 4 Position Embedding Analysis Table 2: Error rate of the relative position regression. In this section, we conduct feature-level analyses of the pre-trained position embeddings of two Trans- position in GPT-2 is trimmed from 1024 to 512 former encoders: BERT (Devlin et al., 2018) and for comparison which BERT and RoBERTa. Be- RoBERTa (Liu et al., 2019), one Transformer de- cause we only have 512 data points for each learned coder: GPT-2 (Radford et al., 2019), and also the si- embedding, a 5-fold cross-validation is applied to nusoidal function proposed by Vaswani et al.(2017) avoid overfitting. The reversed mapping functions is defined as are evaluated by Mean Absolute Error (MAE), and the result is shown in Table1. PE = sin(i=100002j=dmodel ); (i;2j) From the results, the reversed mapping function 2j=dmodel PE(i;2j+1) = cos(i=10000 ); of sinusoid can perfectly represent the absolute positions, and GPT-2 only has a small error. In where i is the position index and j is the dimension contrast, the embeddings learned by Transformer index.

Arxiv:2010.04903V1 [Cs.CL] 10 Oct 2020

Part Vi Neural Machine Translation, Seq2seq and Attention 2

Deep Transformer Models for Time Series Forecasting:The Influenza

Deep Learning

Autograph: Imperative-Style Coding with Graph-Based Performance

Tensorflow, Theano, Keras, Torch, Caffe Vicky Kalogeiton, Stéphane Lathuilière, Pauline Luc, Thomas Lucas, Konstantin Shmelkov Introduction

Convolutional Attention-Based Seq2seq Neural Network for End-To-End ASR

Arxiv:2104.08301V2 [Cs.CL] 7 Jul 2021

Distilling the Knowledge of BERT for Sequence-To-Sequence ASR

Towards Russian Text Generation Problem Using Openai's GPT-2

A Survey on Long Short-Term Memory Networks for Time Series Prediction

Time-Series Prediction Approaches to Forecasting Deformation in Sentinel-1 Insar Data

Seq2seq Model Deep Learning Lecture 9