Neural Edit Operations for Biological Sequences

Neural Edit Operations for Biological Sequences Satoshi Koide Keisuke Kawano Toyota Central R&D Labs. Toyota Central R&D Labs. [email protected] [email protected] Takuro Kutsuna Toyota Central R&D Labs. [email protected] Abstract The evolution of biological sequences, such as proteins or DNAs, is driven by the three basic edit operations: substitution, insertion, and deletion. Motivated by the recent progress of neural network models for biological tasks, we implement two neural network architectures that can treat such edit operations. The first proposal is the edit invariant neural networks, based on differentiable Needleman-Wunsch algorithms. The second is the use of deep CNNs with concatenations. Our analysis shows that CNNs can recognize regular expressions without Kleene star, and that deeper CNNs can recognize more complex regular expressions including the insertion/deletion of characters. The experimental results for the protein secondary structure prediction task suggest the importance of insertion/deletion. The test accuracy on the widely-used CB513 dataset is 71.5%, which is 1.2-points better than the current best result on non-ensemble models. 1 Introduction Neural networks are now used in many applications, not limited to classical fields such as image processing, speech recognition, and natural language processing. Bioinformatics is becoming an important application field of neural networks. These biological applications are often implemented as a supervised learning model that takes a biological string (such as DNA or protein) as an input, and outputs the corresponding label(s), such as a protein secondary structure [13, 14, 15, 18, 19, 23, 24, 26], protein contact maps [4, 8], and genome accessibility [12]. Invariance, which forces a prediction model to satisfy a desirable property for a specific task, is important in neural networks. For example, CNNs with pooling layers capture the shift invariant property that is considered to be an important property for image recognition tasks. CNNs were first proposed to imitate the organization of the visual cortex [6]. This is often used to explain why CNNs work for visual tasks. Similarly, rotation invariance for image tasks is also studied [25]. Generally, it is important to model the proper invariances for a given application domain. What is, then, the invariance in biological tasks? As is well known in bioinformatics, similar sequences tend to exhibit similar functions or structures (i.e., similar labels in terms of machine learning). Here, the similarity is evaluated by sequence alignment, which is closely related to the edit distance. This implies that labels associated with the biological sequences exhibit (weak) invariance with respect to a small number of edit operations, i.e., substitution, insertion, and deletion. This paper aims to incorporate such invariances, which we call edit invariance, into neural networks. Contribution. We consider two neural network architectures that incorporate the edit operations. First, we propose the edit invariant neural networks. This is obtained by interpreting the classical 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada. Needleman-Wunsch algorithm [17] as a differentiable neural network. Next, we show that deep CNNs with concatenations can treat regular expressions without Kleene stars, indicating that such CNNs can capture edit operations including insertion/deletion. Our experiments demonstrate the validity of our approach. The test accuracy of protein secondary structure prediction on the widely-used CB513 dataset (e.g., [26]) results in 71.5% accuracy, the state-of-the-art performance compared to those of previous studies on non-ensemble models. 2 Edit Invariant Neural Networks (EINN) y1:n T CGC A Differentiable Sequence Alignment. In bioinfor- 0 -2 -4 -6 -8 -10 matics, sequence alignment is key in comparing two biological strings (e.g., DNA, proteins). The Needleman- T -2 1 -1 -3 -5 -7 Wunsch (NW) algorithm [17], a fundamental sequence alignment algorithm, calculates the similarity score be- C -4 -1 2 0 -2 -4 x1:m tween two strings on an alphabet Σ. As shown in Fig. C -6 -3 0 1 1 -1 1, the similarity score is computed via dynamic programming to maximize the total score (illustrated as A -8 -5 -2 -1 0 2 Fm,n the double-lined square) by inserting or deleting characters (depicted as the vertical and horizontal arrows, Figure 1: The NW alignment. The red line respectively). is a path that maximizes the score. The numbers in the cells correspond to Fi;j in Although the original NW algorithm is a function that Algorithm 1. uses strings as its arguments, we can naturally extend it to a differentiable function using embedding. Algorithm 1 shows the proposed dynamic programming procedure to calculate the NW score sNW(x1:m; y1:n; g). Here, the scalar parameter g is the gap cost that represents the cost to insert or delete a character. The differences from the original NW algorithm are three-fold. 1. The input sequences x1:m := [x1; ··· ; xm] and y1:n := [y1; ··· ; yn] are each d-dimensional d time series (i.e., xi and yj are vectors in R ) of length m and n, respectively. 2. Following the modification above, the score function is defined as the inner product instead of a predefined lookup table (Line 7). γ P 3. The softmax function max (x) = γ log( i exp(xi/γ)) is used instead of the hard max function (Line 10). The dynamic programming in Algorithm 1 can be re- garded as a computational graph, allowing us to dif- Algorithm 1: Differentiable Needleman- ferentiate the NW similarity score sNW(x1:m; y1:n; g) Wunsch (forward): s (x1:m; y1:n; g) with respect to x1:m, y1:n, and g. In principle, it is NW possible to apply automatic differentiation to obtain 1 F 0; // (m+2)x(n+2) zero matrix the gradient; however, automatic differentiation might 2 for i = 0 ··· m do be computationally expensive because exponentially 3 Fi;0 −ig there are many backward paths in the computational 4 for j = 1 ··· n do graph. To avoid this problem, we modify the backward 5 F0;j −jg; computation by employing some algebraic substitu- 6 for i = 1 ··· m do tions. Consequently, we can compute the derivatives 7 a Fi−1;j−1 + xi ·yj ; efficiently using dynamic programming, as shown in 8 b Fi−1;j − g; Algorithm 2 and Algorithm 3. See Appendix A in the 9 c Fi;j−1 − g; γ supplementary material for the derivation of these algo- 10 Fi;j max a; b; c rithms. With the matrix Q computed in Algorithm 2, 11 return Fm;n as sNW(x1:m; y1:n; g) we can calculate the derivative of the NW score with respect to xi and yj as follows. n m @sNW X @sNW X = Q exp(H /γ) · y ; = Q exp(H /γ) · x ; (1) @x i;j i;j j @y i;j i;j i i j=1 j i=1 where Hi;j := Fi−1;j−1 + xi ·yj − Fi;j: These derivatives are derived similarly to SoftDTW [3], a differentiable distance function corresponding to the dynamic time warping (hence the gapcost is not involved). For the derivative with respect to 2 Algorithm 2: Calculation of Q (backward). We Algorithm 3: Calculation of P . We denote denote 'γ (a; b) := exp((a − b)/γ). 'γ (a; b) := exp((a − b)/γ). 1 Q 0; // (m+2) x (n+2) zero matrix 1 P 0; // (m+2) x (n+2) zero matrix 2 for i = 1 ··· m do 2 for i = 0 ··· m do 3 Fi;n+1 1 3 Pi;0 −i 4 Fm+1;n+1 Fm;n; Qm+1;n+1 1; 4 for j = 1 ··· n do 5 for j = n ··· 1 do 5 P0;j −j; 6 Fm+1;j 1; 6 for i = 1 ··· m do 7 for i = m ··· 1 do 7 a 'γ (Fi−1;j−1 + xi ·yj ;Fi;j ); 8 a 'γ (Fi;j + xi ·yj ;Fi+1;j+1); 8 b 'γ (Fi−1;j − g; Fi;j ); 9 b 'γ (Fi;j − g; Fi+1;j ); 9 c 'γ (Fi;j−1 − g; Fi;j ); 10 c 'γ (Fi;j − g; Fi;j+1); 10 Pi;j 11 Qi;j aQi+1;j+1 + bQi+1;j + cQi;j+1 aPi−1;j−1 +b(Pi−1;j −1)+c(Pi;j−1 −1) 12 return Q 11 return P @sNW the gapcost g, we can derive it similarly using the matrix P in Algorithm 3: @g = Pm;n: As in the original NW algorithm, the proposed method can consider insertions/deletions. It is well known that the NW score is closely related to the edit distance. Given sequences x1:m and y1:n, let us consider 0 a modified sequence x1:(m−1) where one feature vector xt is deleted from x1:m. In such a case, 0 the calculated scores sNW(x; y) and sNW(x ; y) show a similar value. We call this property the edit invariance, which is expected to be important for tasks involving biological sequences. Convolutional EINN. Here, we extend the traditional CNNs by the NW score introduced above. Let us consider an embedded sequence X 2 Rd×L of length L, and a convolutional filter w 2 Rd×K of kernel size K. Let x 2 Rd×K be a frame of length K at a certain position in the embedded sequence X. In CNNs, the similarity is computed by the (Frobenius) inner product, i.e., w·x. Our idea is to replace this inner-product-based similarity with the above proposed sNW(x; w; g).

Neural Edit Operations for Biological Sequences

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support