Arxiv:1910.10202V2 [Cs.LG] 6 Aug 2021 Its Richer Representational Capacity
Total Page:16
File Type:pdf, Size:1020Kb
COMPLEX TRANSFORMER: A FRAMEWORK FOR MODELING COMPLEX-VALUED SEQUENCE Muqiao Yang*, Martin Q. Ma*, Dongyu Li*, Yao-Hung Hubert Tsai, Ruslan Salakhutdinov Carnegie Mellon University, Pittsburgh, PA, USA fmuqiaoy, qianlim, dongyul, yaohungt, [email protected] ABSTRACT model because of their capability of looking at a more ex- tended range of input contexts and representing those in While deep learning has received a surge of interest in a va- different subspaces. Attention is, therefore, promising for riety of fields in recent years, major deep learning models building a complex-valued model capable of analyzing high- barely use complex numbers. However, speech, signal and dimensional data with long temporal dependency. We in- audio data are naturally complex-valued after Fourier Trans- troduce a Complex Transformer to solve complex-valued form, and studies have shown a potentially richer represen- sequence modeling tasks, including prediction and genera- tation of complex nets. In this paper, we propose a Com- tion. The Complex Transformer adapts complex representa- plex Transformer, which incorporates the transformer model tions and operations into the transformer network. We test as a backbone for sequence modeling; we also develop at- the model with MusicNet, a dataset for music transcription tention and encoder-decoder network operating for complex [14], and a complex high-dimensional time series In-phase input. The model achieves state-of-the-art performance on Quadrature (IQ) signal dataset. The specific contributions of the MusicNet dataset and an In-phase Quadrature (IQ) sig- our work are as follows: nal dataset. The GitHub implementation to reproduce the ex- Our Complex Transformer achieves state-of-the-art re- perimental results is available at https://github.com/ sults on the MusicNet dataset [14], a dataset of classical mu- muqiaoy/dl_signal. sic pieces with annotating labels at each time interval, and IQ Index Terms— Deep learning, transformer network, se- signal dataset, a non-public dataset containing WI-FI signals quence modeling, complex-valued deep neural network with fixed-length and their corresponding device labels. Our transformer model contains several complex-valued features which boost its performance, including complex attention 1. INTRODUCTION and complex encoder-decoder. Speech recognition, signal processing, and audio transcrip- 2. RELATION TO PRIOR WORK tion have been advanced by recent deep learning models [1, 2]. Those models only use the real half of the spectral in- Complex-valued neural networks have attracted attentions put. However, developing deep learning models for complex- from the deep learning community since the complex version valued input is crucial because signal based time series mod- of backpropagation algorithm [15] was proposed because of eling is naturally complex-valued through Fourier Transform arXiv:1910.10202v2 [cs.LG] 6 Aug 2021 its richer representational capacity. It was then applied in a (FT). In recent years, complex-valued feed-forward neu- variety of domains, including signal processing [16] and com- ral network [3, 4] has been introduced, and more advanced puter vision [17] where signals and images in their waveform complex-valued deep neural nets for sequence modeling have or Fourier Transform are used as input data. been proposed [5, 6, 7, 8]. Those methods for sequences The work in [5, 6, 7] applied complex operations on RNN, are based on recurrent models like recurrent neural network LSTM [9], and GRU[10], respectively. The work in [18] im- (RNN), Long Short-Term Memory (LSTM) [9], and Gated plemented complex convolutional layer and used the layers to Recurrent Units (GRU) [10], which inherently suffer from predict based on input within each time step. [8] combined the memory bottleneck. FT and sequence modeling methods to explore the temporal Recently, attention mechanism [11, 10] and transformer information. Our method differs from the previous ones as models [12, 13] have become a well-performed sequence follows: Compared to [5, 6, 7, 8], which were based on recur- rent networks, our work used attention-based transformer net- * equal contribution. This work was supported in part by DARPA grant work, which has global view of input over all time steps and FA875018C0150, DARPA SAGAMORE HR00111990016, AFRL CogDe- CON, and Apple. We would also like to acknowledge NVIDIA’s GPU sup- do not have memory bottleneck issue embedded in backbone port. models of RNN, LSTM and GRU; Compared to [18], we uti- lize the temporal dependencies across time steps. Another im- 3.3. Complex Attention portant difference we made that distinguishes this work from For a complex vector x = a + ib, we represent a and b as others is that we have performed sequence generation on both different input parts and simulate complex operations using acoustic input and signal input. real values. This is because for any complex function f : Cn ! Cn and any complex vector x = a + ib, we can represent f as f(a + ib) = α(a; b) + iβ(a; b) where α; β : 3. MODEL ARCHITECTURE n n R ! R [5]. This indicates any complex function could be rewritten into two separate real functions. Our complex transformer consists of an encoder-decoder A complex feed-forward network feeds a and b to sepa- structure, where the encoder maps a complex-valued input rate real-valued feed-forward neural networks and ReLU ac- sequence into a complex-valued representation, and the de- tivations [18]. A complex convolutional neural network [18] coder generates a complex-valued sequence one time step at convolves a complex weight matrix W = A + iB and com- a time given the encoder output. plex vector x as the following: W ∗ x = (A + iB) ∗ (a + ib) (1) = (A ∗ a − B ∗ b) + i(A ∗ b + B ∗ a) 3.1. Problem Statement Inspired by [12], we propose complex building blocks Given n sequences of signal (x1; :::; xn) with d time steps, for attention mechanism in our model. Given complex input we use our model to classify given inputs and generate new X = A+iB, we want to attend high-dimensional information sequences. To start with, we transform the raw real-valued at different time steps just as in real. Thus, we compute the d×n signal (x1; :::; xn) 2 R into (a1 + ib1; :::; an + ibn) = query matrix Q = XWQ, the key matrix K = XWK , and d×n A + iB 2 C by Discrete Fourier Transform, an oper- the value matrix V = XWV (where Q; K; V are complex- ation which decomposes a finite time sequence into a finite valued and Ws are real-valued) and define the complex atten- frequency sequence. The frequency sequence X = A + iB tion: is then fed into the transformer model. QKT V For an arbitrary classification task, to classify input X to a T label y, we first use a stack of encoders to produce the repre- = (XWQ)(XWK ) (XWV ) 0 0 0 T T T T sentation of encoder input Xenc = A + iB for X, where A = (AWQ + iBWQ)(WK A + iWK B )(AWV + iBWV ) 0 and B are separate latent vectors for A and B of X respec- T T T T = (AWQW A AWV − AWQW B BWV tively. We then use a linear layer to predict output label proba- K K − BW WT AT BW − BW WT BT AW ) bility given Xenc. For the generation task, a stack of decoders Q K V Q K V T T T T in the model will be given both Xenc and Xdec = C + iD + i(AWQWK A BWV + AWQWK B AWV (decoder input), and completes the rest of X by generating dec + BW WT AT AW − BW WT BT BW ) C0 and D0 (decoder output). Q K V Q K V = A0 + iB0 (2) 3.2. Complex Encoder and Decoder where A0 and B0 represent the real and the imaginary part Our model architecture is shown in Fig. 1. of the complex attention result respectively. In our imple- Complex Encoder The encoder is composed of six iden- mentation, to have a better resolution of internal similarities tical stacks, each with two sublayers. The first sublayer is a between the real and the imaginary parts, for each term in complex attention layer, and the second is a complex-valued the expanded version of Eq. (2), we calculate the multihead feed-forward network. Both sublayers have residual connec- attentions, as shown in Fig. 2 and Eq. (3)-Eq. (6). tions [19] and layer normalizations [20]. We employ layer normalization before residual connections in the encoder, re- ferred as “Norm & Add” in the encoder part of Fig. 1. Complex Decoder The decoder also has six identical stacks. Each stack has three sublayers: complex attention, complex feed-forward network, and another complex at- tention layer. The first complex attention layer is masked with the additional diagonal masking to prevent attending to subsequent positions. The second complex attention will be performed on the encoded representation Xenc and the decoder input Xdec. Fig. 2. Structure of Complex Attention. Fig. 1. Model Architecture Overview (Left: Encoder; Right: Decoder). Table 1. Experimental results for automatic music transcrip- ComplexAttention(X) tion. = (MH(A; A; A) − MH(A; B; B) Model # Parameters APS (%) − MH(B; A; B) − MH(B; B; A)) (3) cgRNN [7] 2.36M 53.0 + i(MH(A; A; B) + MH(A; B; A) Deep Real Network [18] 10.00M 69.8 Deep Complex Network [18] 17.14M 72.9 + MH(B; A; A) − MH(B; B; B)) Concatenated Transformer 9.79M 71.30 where Complex Transformer 11.61M 74.22 MH(Q; K; V) = MultiHead(Q; K; V) Q K V n O as in [12], because min-max-normalization prevents gradient = Concat(fAttention(QWi ; KWi ; VWi )gi=1)W (4) explosion better in our model. QKT Attention(Q; K; V) = Min-Max-Norm( p )V (5) 4.