Progressive Training in Recurrent Neural Networks for Chord Progression Modeling

Progressive Training in Recurrent Neural Networks for Chord Progression Modeling Trung-Kien Vu, Teeradaj Racharak, Satoshi Tojo, Nguyen Ha Thanh and Nguyen Le Minh School of Information Science, Japan Advanced Institute of Science and Technology, Ishikawa, Japan fkienvu, racharak, tojo, nguyenhathanh, [email protected] Keywords: Recurrent Neural Network, Chord Progression Modeling, Sequence Prediction, Knowledge Compilation. Abstract: Recurrent neural networks (RNNs) can be trained to process sequences of tokens as they show impressive results in several sequence prediction. In general, when RNNs are trained, their goals are to maximize the likelihood of each token in the sequence where each token could be represented as a one-hot representation. That is, the model learns for its sequence prediction from true class labels. However, this creates a potential drawback, i.e., the model cannot learn from the mistakes. In this work, we propose a progressive learning strategy that can mitigate the mistakes by using domain knowledge. Our strategy gently changes the training process from using the class labels guiding scheme to the similarity distribution of class labels instead. Our experiments on chord progression modeling show that this training paradigm yields significant improvements. 1 INTRODUCTION tion learning requires a large amount of training data, which may be not easily available. On the other hand, Recurrent neural networks (RNNs) can be used to a learning efficiency can be improved by injecting do- produce sequences of tokens, given input/output main knowledge for guiding the learning process. For pairs. Generally, they are trained to maximize the instance, (Song et al., 2019) injected medical ontolo- likelihood of each target token given the current gies to regularize learnable parameters. In this work, state of the model and the previous target token, in we concentrate on the latter approach, i.e., how do- which target tokens could be in one-hot representa- main knowledge can be incorporated at training time? tions (Bengio et al., 2015). This training process helps Intuitively, to address this difficulty, we observe the model to learn a language model over sequences that the model could be guided by domain knowledge of tokens. In fact, RNNs and their variants, such as when it is learnt to produce a wrong prediction. Like the Long Short-Term Memory (LSTM) and Gated Re- the work done by (Bengio et al., 2015), the model current Unit (GRU), have shown impressive results in can be guided by more than one target vectors dur- machine translation such as (Sutskever et al., 2014; ing training. RNNs usually optimize the loss func- Bahdanau et al., 2015; Meng and Zhang, 2019), im- tion between a predicted sequence and the true se- age captioning such as (Vinyals et al., 2015), and quence. This kind of optimization causes a potential chord progression such as (Choi et al., 2016). drawback, i.e., the model may be mistaken when its When the model is trained from input/output pairs, predicted sequence is incorrect. To alleviate this prob- it may cause the model per se to mistake the predic- lem, we compile domain knowledge in terms of sim- tion during training steps. Indeed, the model is fully ilarity distribution of class labels. Then, we propose guided by the true labels at its correct prediction; but, to change the training process by forcing the model is less guided otherwise. Different kinds of method- to learn from two different vectors (i.e. the true tar- ologies were proposed to mitigate this kind of learn- get distribution and the similarity target distribution). ing efficiency by enriching the input’s information via For this, we introduce a progressive learning strategy transfer learning like word embedding (cf. (Mikolov which gradually switches from similarity target dis- et al., 2013; Pennington et al., 2014)) or knowledge tribution to the true target distribution. Doing so, the graph embedding (cf. (Ristoski et al., 2019)). This model can be learnt to predict a reasonable outcome technique exploits a learnt vector representation at when the model generates a token deviating from the training time and the learning paradigm can be done true label. This enables the model to be learnt for pre- as usual. Despite its promising result, representa- dicting correctly later when it mistakes its prediction earlier. For instance, in chord progression modeling, a To solve such problems of the recurrent neu- mistake for predicting G[maj instead of F]maj should ral network architecture, Long short-term memory not gain feedback for training equally as a mistake (LSTM) network was proposed with the idea that prediction of Cmaj for the same chord F]maj. This not all inputs in the sequence contribute with impor- observation is reasonable because G[maj and F]maj tant information. The LSTM architecture allows the are fundamentally the same chord (which are named model to choose which inputs in the sequence are im- differently); and, Cmaj and F]maj have no relation- portant and forget the others. A standard LSTM con- ship to each other. A neural network should be able to tains an input gate, an output gate and a forget gate as utilize feedback from similarity information between shown in Figure 1. class labels at training time. The contributions of this paper are twofold. First, we introduce an approach to compile domain knowledge as similarity distributions for injecting at training time. Second, we propose a progressive learning strategy to gradually switch the training target from the similarity distribution to the true distribution. We elaborate our proposed approach with more techni- cal details in Section 3. Furthermore, Section 2 and Section 4 present preliminaries and experimental results, respectively. Section 5 makes comparisons of our work with others. Finally, Section 6 discusses the Figure 1: LSTM architecture can capture longer input sequence. conclusion and future directions. Let i, f and o denote the input, forget and output gates, respectively. Then, the hidden state ht in a 2 RNN-BASED SEQUENCE LSTM network is calculated as follows: PREDICTION MODEL i i i it = s(xtU + ht−1W + b ) (1) f f f The Hopfield Networks (John, 1982) is the precur- ft = s(xtU + ht−1W + b ) (2) sor to recurrent neural networks. The architecture of o o o ot = s(xtU + ht−1W + b ) (3) these neural networks allows the reception of signals ˜ g g g from consecutive inputs. RNNs were invented with Ct = tanh(xtU + ht−1W + b ) (4) the main idea of using information from the previous Ct = s( ft ∗Ct−1 + it ∗C˜t ) (5) steps to give the most accurate prediction for the cur- ht = tanh(Ct ) ∗ ot (6) rent prediction step. In a feed-forward neural network, inputs are pro- In the equations, U is the weight matrix from the input cessed separately; as a result, it cannot capture the re- and W is the weight matrix from the hidden layer in the previous time step. Ct is the memory of the unit lation information between entities in a sequence. In ˜ contrast, a recurrent neural network maintains a loop and Ct is the candidate for cell state at timestamp t. s of information in its architecture. After being pro- denotes the sigmoid function and ∗ is the elementwise cessed, an output is fed back into the network to be multiplication. processed with the next inputs. RNN has many vari- LSTM models are effective in sequential tasks. In ations, but the most famous one is LSTM (Hochreiter sequence tagging, (Huang et al., 2015) proposed a and Schmidhuber, 1997) and GRU (Cho et al., 2014). model that processes the feature in both directions. Despite the ability to combine information from In speech recognition, (Graves et al., 2013) used the inputs of sequences, the recurrent neural network bidirectional LSTM and achieved state-of-the-art re- models have a weakness namely “vanishing gra- sults in phoneme recognition. In language modeling, dients”. Processing a long sequence, the model (Sundermeyer et al., 2012) proved that using LSTM feeds information across multiple layers and multi- can bring significant improvement in this task. ple timesteps; as a result, the parameter sequence becomes longer. During the training process, loss value is back-propagated from the output layer to previous 3 PROPOSED APPROACH layers for updating all the weights. However, in a long sequence of parameters, the loss becomes zero at the Multi-class classification is the task of classifying in- beginning of the sequence. stances into one of K classes and can be trained for sequence prediction (cf. Section 2). For that, a neu- property set of class a by Pa, in which Pa ⊆ P . To ral network classifier is given a vectorized represen- capture the notion of relevancy, we define a function tation of an input and produces a K-dimensional pre- w : P ! R≥0 capturing the importance of properties, dicted vector yˆ := [yˆ1;:::;yˆK], which is a probability in which w(pk) > 0 indicates that pk has more impor- distribution representing the confidence of the clas- tance and w(pk) := 0 indicates that pk is of no impor- sifier over K classes. In the standard training pro- tance in an application context. Then, the similarity cess, the network is trained to generate a high confi- between class a and b (denoted by s(a;b)) can be de- dence towards the correct class by updating its weight fined in the following equation. q to minimize Kullback-Leibler divergence between its prediction vector yˆ and the one-hot representation s(a;b) := ∑ w(p) (8) y of the correct class (cf. Equation 7). p2Pa\Pb K Intuitively, Equation 8 calculates the summation of yi DKL(y;yˆ) := ∑ yi log (7) weighted common properties between two classes.

Load more