Neural Melody Composition from Lyrics Hangbo Bao1∗, Shaohan Huang2, Furu Wei2, Lei Cui2, Yu Wu3, Chuanqi Tan3, Songhao Piao1, Ming Zhou2 1School of Computer Science, Harbin Institute of Technology, Harbin, China 2Microsoft Research, Beijing, China 3State Key Laboratory of Software Development Environment, Beihang University, Beijing, China [email protected],{shaohanh,fuwei,lecu,mingzhou}@microsoft.com, [email protected], [email protected], [email protected]
Abstract rest A4 E5 D5 B4 A4 A4C5 G4 E4 G4rest E4 D5 C5 A4 G4 C5 C5 A4 In this paper, we study a novel task that learns to compose 4 4 music from natural language. Given the lyrics as input, we Lyrics: 爱 恨 两茫 茫 问君 何 时 恋 propose a melody composition model that generates lyrics- Pinyin: ài hèn liǎng máng máng wèn jūn hé shí liàn conditional melody as well as the exact alignment between It is so far away between love and hate When do you want to fall in love again the generated melody and the given lyrics simultaneously. More specifically, we develop the melody composition model Figure 1: A fragment of a Chinese song “Drunken Con- based on the sequence-to-sequence framework. It consists of cubine (new version)”. The blue rectangles indicate rests, two neural encoders to encode the current lyrics and the con- some intervals of silence in a piece of melody. The red rect- text melody respectively, and a hierarchical decoder to jointly angles indicate the alignment between the lyrics and the produce musical notes and the corresponding alignment. Ex- melody, meaning a mapping from syllable of lyrics to mu- perimental results on lyrics-melody pairs of 18,451 pop songs demonstrate the effectiveness of our proposed methods. In sical notes. Pinyin indicates the syllables for each Chinese addition, we apply a singing voice synthesizer software to character. We can observe that the second Chinese character synthesize the “singing” of the lyrics and melodies for hu- ‘恨’ (hate) aligns one note ‘E5’ and the last Chinese char- man evaluation. Results indicate that our generated melodies acter ‘恋’ (love) aligns two notes ‘C5’ and ‘A4’ in melody, are more melodious and tuneful compared with the baseline which describes the “one-to-many” relationship in the align- method. ment between the lyrics and melody.
Introduction We study the task of melody composition from lyrics, which composition task as a classification or sequence labeling consumes a piece of text as input and aims to compose the problem. They first determine the number of musical notes corresponding melody as well as the exact alignment be- by counting the syllables in the lyrics, and then predict the tween generated melody and the given lyrics. Specifically, musical notes one after another by considering previously the output consists of two sequences of musical notes and generated notes and corresponding lyrics. However, these lyric syllables1 with two constraints. First, each syllable in works only consider the “one-to-one” alignment between the the lyrics at least corresponds to one musical note in the melody and lyrics. According to our statistics on 18,451 Chi- melody. Second, a syllable in the lyrics may correspond to a nese songs, 97.9% songs contains at least one syllable that sequence of notes, which increases the difficulty of this task. corresponds to multiple musical notes (i.e. “one-to-many” Figure 1 shows a fragment of a Chinese song. For instance, alignment), thus the simplification may introduce bias into 恋 the task of melody composition. arXiv:1809.04318v1 [cs.CL] 12 Sep 2018 the last Chinese character ‘ ’ (love) aligns two notes ‘C5’ and ‘A4’ in the melody. In this paper, we propose a novel melody composition There are several existing research works on generat- model which can generate melody from lyrics and well ing lyrics-conditional melody (Ackerman and Loker 2017; handle the “one-to-many” alignment between the generated Scirea et al. 2015; Monteith, Martinez, and Ventura 2012; melody and the given lyrics. For the given lyrics as input, we Fukayama et al. 2010). These works usually treat the melody first divide the input lyrics into sentences and then use our model to compose each piece of melody from the sentences ∗ Contribution during internship at Microsoft Research. one by one. Finally, we merge these pieces to a complete Copyright c 2019, Association for the Advancement of Artificial melody for the given lyrics. More specifically, it consists of Intelligence (www.aaai.org). All rights reserved. 1A syllable is a word or part of a word which contains a single two encoders and one hierarchical decoder. The first encoder vowel sound and that is pronounced as a unit. Chinese is a monosyl- encodes the syllables in current lyrics into an array of hid- labic language which means words (Chinese characters) predomi- den vectors with a bi-directional recurrent neural network nantly consist of a single syllable (https://en.wikipedia. (RNN) and the second encoder leverages an attention mech- org/wiki/Monosyllabic_language). anism to convert the context melody into a dynamic context
Powered by TCPDF (www.tcpdf.org) vector with a two-layer bi-directional RNN. In the decoder, we employ a three-layer RNN decoder to produce the musi- Table 1: Notations used in this paper cal notes and the alignment jointly, where the first two layers Notations Description X the sequence of syllables in given lyrics are to generate the pitch and duration of each musical note xj the j-th syllable in X and the last layer is to predict a label for each generated mu- M the sequence of musical notes in context melody sical note to indicate the alignment. mi the i-th musical note in M mi , mi the pitch and duration of mi, respectively We collect 18,451 Chinese pop songs and generate the pit dur Y the sequence of musical notes in predicted melody lyrics-melody pairs with precise syllable-note alignment to yi the i-th musical note in Y 1 i−1 conduct experiments on our methods and baselines. Auto- ysheet music: The contributions of our work in this paper are summa- rest rest rized as follows. A4 E5 D5 B4 A4 A4C5 G4 E4 G4E4 D5 C5 A4 G4 C5 C5 A4 4 • To the best of our knowledge, this paper is the first work to 4 use end-to-end neural network model to compose melody Lyrics: 爱 恨 两茫 茫 问君 何 时 恋 from lyrics. Pinyin: ài hèn liǎng máng máng wèn jūn hé shí liàn It is so far away between love and hate When do you want to fall in love again • We construct a large-scale lyrics-melody dataset with Lyrics-melody aligned data: 18,451 Chinese pop songs and 644,472 lyrics-context- melody triples, so that the neural networks based ap- Pinyin ai` hen` liangˇ mang´ mang´ wen` jun¯ he´ sh´ı lian` Lyrics 爱 恨 两 茫 茫 问 君 何 时 恋 proaches are possible for this task. P itch R A4 E5 D5 B4 A4 C5 A4 G4 E4 G4 R E4 D5 C5 A4 G4 C5 C5 A4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Duration 4 4 4 8 8 8 16 16 8 8 2 4 4 4 8 8 8 8 4 2 • Compared with traditional sequence-to-sequence models, Label 01 1 01 001 001 01 1 01 01 01 our proposed method can generate the exact alignment as well as the “one-to-many” alignment between the melody Figure 2: An illustration for lyrics-melody aligned data. The and lyrics. P itch and the Duration respectively represent the pitch • The human evaluation verifies that the synthesized pop and duration of each musical note. In addition, the Label songs of the generated melody and input lyrics are melo- provides the information on alignment between the lyrics dious and meaningful. and melody. To be specific, a musical note is assigned with label 1 that denotes it is a boundary of the musical note sub- Preliminary sequence aligned to the corresponding syllable otherwise it is assigned with label 0. Additionally, we always align the We first introduce some basic definitions from music theory rests with their latter syllables. and then give a brief introduction to our lyrics-melody par- allel corpus. Table 1 lists some mathematical notations used in this paper. interval of silence in a piece of music and we use ‘R’ to rep- Concepts from Music Theory resent it and treat it as a special pitch. Duration is a particular time interval to describe the length of time that the pitch or Melody can be regarded as an ordered sequence of many tone sounds4, which is to judge how long or short a musical musical notes. The basic unit of melody is the musical note note lasts. which mainly consists of two attributes: pitch and duration. The pitch is a perceptual property of sounds that allows their Lyrics-Melody Parallel Corpus ordering on a frequency-related scale, or more commonly, the pitch is the quality that makes it possible to judge sounds Figure 2 shows an example of a lyrics-melody aligned pair as “higher” and “lower” in the sense associated with musical with precise syllable-note alignment, where each Chinese melodies2. Therefore, we use a sequence of numbers to rep- character of the lyrics aligns with one or more notes in the resent the pitch. For example, we represent ‘C5’ and ‘Eb6’ melody. as 72 and 87 respectively based on the MIDI3. A rest is an The generated melody consists of three sequences P itch, Duration and Label where the Label sequence represents 2https://en.wikipedia.org/wiki/Pitch_ the alignment between melody and lyrics. We are able to re- (music) 3https://newt.phys.unsw.edu.au/jw/notes. 4https://en.wikipedia.org/wiki/Duration_ html (music)
Powered by TCPDF (www.tcpdf.org) 0 1 1 0 1 0 1 0 1 Label Label Layer
¼ ¼ ¼ ⅛ ⅛ ⅛ ⅛ ¼ ½ Duration Duration Layer
푅 퐸4 퐷5 퐶5 퐴4 퐺4 퐶5 퐶5 퐴4 Pitch Pitch Layer
Attention Melody Decoder
¼ ¼ ⅛ ½ Context Melody Lyrics Encoder ... Encoder
푅 A4 퐸4 퐺4 问 君 何 时 恋 Lyrics ... wèn jūn hé shí liàn Pinyin
Figure 3: An illustration of Songwriter. The lyrics encoder and context melody encoder encode the syllables of given lyrics and the context melody into two arrays of hidden vectors, respectively. For decoding the i-th musical note yi, Songwriter uses i attention mechanism to obtain a context vector ccon from the context melody encoder (green arrows) and counts how many j i label 1 has been produced in previously musical notes to obtain hcon to represent the current syllable corresponding to y from the lyrics encoder (red arrows) to melody decoder. In melody decoder, the pitch layer and duration layer first predict the pitch i i i i i ypit and duration ydur of y , then the label layer predicts a label ylab for y to indicate the alignment. build the sheet music with them. P itch sequence represents (y1, ..., y|Y |) representing the predicted musical notes for i i i i the pitch of each musical note in melody and ‘R’ represents corresponding lyrics, where the y = {ypit, ydur, ylab}. In the rest in P itch sequence specifically. Similarly, Duration addition, the output sequence Y should satisfy the following sequence represents the duration of each musical note in restriction: melody. P itch and Duration consist of a complete melody |Y | X i but do not include information on the alignment between the |X| = ypit (1) given lyrics and corresponding melody. i=1 Label contains the information of alignment. Each item which restricts the generated melody can be exactly aligned of the Label is labeled as one of {0, 1} to indicate the align- with the given lyrics. ment between the musical note and the corresponding sylla- ble in the lyrics. To be specific, a musical note is assigned Approach with label 1 that denotes it is a boundary of the musical note sub-sequence, which aligned to the corresponding syllable, In this section, we present the end-to-end neural networks otherwise it is assigned with label 0. We can split the musi- model, termed as Songwriter, to compose a melody which cal notes into the n parts by label 1, where n is the number aligns exactly to the given input lyrics. Figure 3 provides an of syllables of the lyrics, and each part is a musical note sub- illustration of Songwriter. sequence. Then we can align the musical notes to their cor- responding syllables sequentially. Additionally, we always Overview align the rests to their latter syllables. For instance, we can Given lyrics as the input, we first divide the lyrics into sen- observe that the second rest aligns to the Chinese character tences and then use Songwriter to compose each piece of the ‘问’ (ask). melody sentence by sentence. For each sentence in lyrics, Songwriter takes the syllables in the sentence lyrics and the Task Definition context melody, which are some previous predicted musical notes, as input and then predicts a piece of melody. When Given lyrics as the input, our task is to generate the melody the last piece of melody has been predicted, we merge these and alignment that make up a song with the lyrics. We can pieces of melody to make a complete song with the given formally define this task as below: lyrics. This procedure can be considered as a sequence gen- The input is a sequence X = (x1, ..., x|X|) represent- eration problem with two sequences as input, syllables of ing the syllables of lyrics. The output is a sequence Y = the current lyrics X and the context melody M. We develop our melody composition model based on a modified RNN RNN that encodes pitch, duration and label of a musical note encoder-decoder (Cho et al. 2014a) to support multiple se- respectively at each time step. Each layer is a bi-directional quences as input. RNN which is built by two GRUs. For the first layer, we Songwriter employs two neural encoders, lyrics encoder describe the forward directional GRU and the backward di- and context melody encoder, to respectively encode the syl- rectional GRU at time step i as follows: lables of the current lyrics X and the context melody M, ~ i ~ i−1 i and leverages a hierarchical melody decoder to produce mu- hpit = fGRU(hpit , mpit) (9) sical notes and the alignment Y . To be specific, the lyrics i i+1 h~ = f (h~ , mi ) (10) encoder and context melody encoder encode X and M into pit GRU pit pit two arrays of hidden vectors, respectively. At the time step where mi is the pitch attribute of i-th note mi. Then, we i pit i, melody decoder obtains a context vector ccon from the concatenate them into one vector: hj context melody encoder and a hidden vector lrc from the " ~ i # i i i hpit lyrics encoder to produce the i-th musical note y . ccon is h = (11) pit i~ computed dynamically by the attention mechanism from the hpit output of the context melody encoder. hj is one of out- lrc The bottom layer encodes the output of the first layer and put hidden vectors of the lyrics encoder, which represents the duration attribute of melody. The employment can be j xj the -th syllable in the current lyrics. In the melody de- described as follows: coder, which is a three-layer RNN, the pitch layer and dura- i i ~ i ~ i−1 i i tion layer first predict the pitch ypit and duration ydur, then hdur = fGRU(hdur , mdur, hpit) (12) i i i i+1 the label layer predicts a label ylab of y to indicate the align- ~ ~ i i ment. hdur = fGRU(hdur, mdur, hpit) (13) " ~ i # i hdur Gated Recurrent Units h = i (14) dur h~ We use Gated Recurrent Unit (GRU) (Cho et al. 2014b) in- dur stead of basic RNN. We describe the mathematical model of We concatenate the two output arrays of vectors to an array the GRU as follows: of vectors to represent the context melody sequence: i i h i i−1 i h = pit (15) z = σ(Whzh + Wxzx + bz) (2) con i hdur ri = σ(W hi−1 + W xi + b ) (3) hr xr r Melody Decoder hi = tanh W (ri ◦ hi−1) + W xi + b (4) b h x The decoder predicts the next note yi from all previously i i i−1 i i 1 i−1 h = (1 − z ) ◦ h + z ◦ bh (5) predicted notes {y , ..., y } (y
i i j online Karaoke app. Then preprocess the dataset with rules c = ccon + hlrc (21) as described in Zhu et al. (2018) to guarantee the reliability where ci is a context vector from context melody encoder of the melodies. For each song, we convert the melody to con C major or A minor that can keep all melodies in the same and hj is one of output hidden vectors of lyrics encoder, lrc tune and we set BPM (Beats Per Minute) to 60 to calculate which represent the xj that should be aligned to the current i i the duration of each musical note in the melody. We further predicting y . In particular, we set ccon as a zero vector if divide the lyrics into sentences with their corresponding mu- there is no context melody as input. From our representation sical notes as lyrics-melody pairs. Besides, we set a window method for lyrics-melody aligned pairs, it is not difficult to j i size as 40 to the context melody and use the previously mu- understand how to get the x that y should be aligned to: sical notes as the context melody for each lyrics-melody pair i−1 to make up lyrics-context-melody triples. Finally, we obtain X t 644,472 triples to conduct our experiments. We randomly j = ylab (22) t=1 choose 5% songs for validating, 5% songs for testing and the rest of them for training. i ccon is recomputed at each step by alignment model (Bah- danau, Cho, and Bengio 2014) as follows: Baselines As melody composition task can generally be regarded as |M| i X i,t t a sequence labeling problem or a machine translation prob- ccon = α hcon (23) lem, we select two state-of-the-art models as baselines. t=1 • CRF A modified sequence labeling model based on t where hcon is one hidden vector from the output of melody CRF (Lafferty, McCallum, and Pereira 2001) which con- encoder and the weight αi,t is computed by: tains two layers for predicting P itch and Duration, re- spectively. For “one-to-many” relationships, this model exp(ei,t) αi,t = (24) uses some special tags to represent a series of original P|M| i,k tags. For instance, if a syllable aligns two notes ‘C5’ and k=1 exp(e ) ’A4’, we use a tag ‘C5A4’ to represent them. | i−1 k ei,k = va tanh(Was + Uahcon) (25) • Seq2seq A modified attention based sequence to sequence where va, Wa and Ua are learnable parameters. Finally, we model which contains two encoders and one decoder. i i i i i obtain the c and then employ the sp, sd, sl and s as follows: Compared with Songwriter, Seq2seq uses attention mech- anism (Bahdanau, Cho, and Bengio 2014) to capture in- i i−1 i−1 i−1 j spit = fGRU(spit , c , ypit , hlrc) (26) formation on the given lyrics. Seq2seq may not guarantee the alignment between the generated melody and sylla- si = f (si−1, ci−1, yi−1, yi , si ) (27) dur GRU dur dur pit pit bles in given lyrics. To avoid this problem, Seq2seq model i i−1 i−1 i−1 i i i slab = fGRU(slab , c , ylab , ypit, ydur, ddur) (28) stops predicting when the number of the label 1 in pre- dicted musical notes is equal to the number of syllables in i i | i | i| | s = [sp ; sd ; sl ] (29) the given lyrics. Objective Function Implementation Given a training dataset with n lyrics-context-melody Model Size For all the models used in this paper, the num- (i) (i) (i) n (i) triples D = {X ,M ,Y }i=1, where X = ber of recurrent hidden units is set to 256. In the context (i) (i) {x(i)1, ..., x(i)|X |}, M (i) = {m(i)1, ..., m(i)|M |} and melody encoder and melody decoder, we treat the pitch, (i) (i)1 (i)|Y | (i)j duration label Y = {y , ..., y (i) }. In addition, ∀(i, j), y = , and as tokens and use word embedding to (i)j (i)j (i)j represent them with 128, 128, and 64 dimensions, respec- (ypit , ydur , ylab ). Our training objective is to minimize the tively. In the lyrics encoder, we use GloVe (Pennington, negative log likelihood loss L with respect to the learnable Socher, and Manning 2014) to pre-train a char-level word model parameter θ: embedding with 256 dimensions on a large Chinese lyrics 5 (i) corpus and use Pinyin as the syllable features with 128 di- n |Y | 1 X X (i)j (i)j (i)j mensions. L = − log P (y , y , y |θ, X(i),M (i), y ) n pit dur lab PPL P R F1 PRF1 PRF1 BLEU DW CRF / 41.23 42.02 40.98 49.82 53.12 50.84 / / / 2.02 25.53 Seq2seq 2.21 54.76 55.01 54.56 64.66 67.88 65.33 93.14 93.06 92.60 3.96 37.04 Songwriter 2.01 63.23 63.24 62.90 69.18 71.28 69.69 93.54 93.61 93.31 6.63 38.31 layer and duration layer in the melody decoder in Songwriter model, we use a special tag to represent multiple musical and Seq2seq. We use zero vectors to initialize the hidden notes if a syllable aligns more than one musical note, which states in the lyrics encoder and context melody encoder. will produce a large number of different kinds of tags and result in the CRF model is difficult to learn from the sparse Training We use Adam (Diederik P. Kingma 2015) with data. an initial learning rate of 0.001 and an exponential decay rate of 0.9999 as the optimizer to train our models with batch size as 64, and we use the cross entropy as the loss function. Human evaluation Similar to the text generation and dialog response genera- Automatic evaluation tion (Zhang and Lapata 2014; Schatzmann, Georgila, and We use two modes to evaluate our model and baselines. Young 2005), it is challenging to accurately evaluate the quality of music composition results with automatic met- • Teacher-forcing: As in (Roberts et al. 2018), models use rics. To this end, we invite 3 participants as human annota- the ground truth as input for predicting the next-step at tors to evaluate the generated melodies from our models and each time step. the ground truth melodies of human creations. We randomly • Sampling Models predict the melody from given lyrics select 20 lyrics-melody pairs, the average duration of each without any ground truth. melody approximately 30 seconds, from our testing set. For each selected pair, we prepare three melodies, ground truth Metrics We use the F1 score to the automatic evaluation from Roberts et al. (2018). Additionally, we select three au- of human creations and the generated results from Song- tomatic metrics for our evaluation as follows. writer and Seq2seq. Then, we synthesized all melodies with the lyrics by NiaoNiao 7 using default settings for the gen- • Perplexity (PPL) This metric is a standard evaluation erated songs and ground truth, which is to eliminate the in- measure for language models and can measure how well fluences of other factors of singing. As a result, we obtain a probability model predicts samples. Lower PPL score is 3 (annotators) × 3 (melodies) × 20 (lyrics) samples in to- better. tal. The human annotations are conducted in a blind-review 6 mode, which means that human annotators do not know the • (weighted) Precision, Recall and F1 These metrics measure the performance of predicting the different at- source of the melodies during the experiments. tributes of the musical notes. Metrics We use the metrics from previous work on hu- • BLEU This metric (Papineni et al. 2002) is widely used man evaluation for music composition as shown below. We in machine translation. We use it to evaluate our predicted also include an emotion score to measure the relationship pitch. Higher BLEU score is better. between the generated melody and the given lyrics. The hu- • Duration of Word (DW) This metric checks the sum of man annotators are asked to rate a score from 1 to 5 after the duration of all notes which aligned to one word is listening to the songs. Larger scores indicate better quality equal to the ground truth. Higher DW score is better. in all the three metrics. Results The results of the automatic evaluation are shown • Emotion Does the melody represent the emotion of the in Table 2. We can see that our proposed method outper- lyrics? forms all models in all metrics. As Songwriter performs bet- • Rhythm (Zhu et al. 2018; Watanabe et al. 2018) When lis- ter than Seq2seq, it shows that the exact information of the tening to the melody, are the duration and pause of words syllables (Eq. (22)) can enhance the quality of predicting natural? the corresponding musical notes relative to attention mech- anism in traditional Seq2seq models. In addition, the CRF • Overall (Watanabe et al. 2018) What is the overall score model demostrates lower performance in all metrics. In CRF of the melody? 6We calculate these metrics by scikit-learn with the parameter average set as ‘weighted’: http://scikit-learn.org/ 7A singing voice synthesizer software which can synthesize stable/modules/classes.html#module-sklearn. Chinese song, http://www.dsoundsoft.com/product/ metrics niaoeditor/ namic programming to compute a melody from Japanese Table 3: Human evaluation results in blind-review mode lyrics, the calculation needs three human well-designed con- Model Overall Emotion Rhythm straints. Monteith, Martinez, and Ventura (2012) propose a Seq2seq 3.28 3.52 2.66 melody composition pipeline for given lyrics. For each given Songwriter 3.83 3.98 3.52 lyrics, it first generates hundreds of different possibilities for Human 4.57 4.50 4.17 rhythms and pitches. Then it ranks these possibilities with a number of different metrics in order to select a final out- put. Scirea et al. (2015) employ Hidden Markov Models Results Table 3 shows the human evaluation results. Ac- (HMM) to generate rhythm based on the phonetics of the cording to the results, Songwriter outperforms Seq2seq lyrics already written. Then a harmonical structure is gen- in all metrics, which indicates its effectiveness over the erated, followed by generation of a melody matching the Seq2seq baseline. On the “Rhythm” metrics, human anno- underlying harmony. Ackerman and Loker (2017) design a tators give significantly lower scores to Seq2seq than Song- co-creative automatic songwriting system ALYSIA base on writer, which shows that the generated melodies from Song- machine learning model using random forests, which ana- writer are more natural on the pause and duration of words lyzes the lyrics features to generate one note at a time for than the ones generated by Seq2seq. The results further sug- each syllable. gest that using the exact information of syllables (Eq. (22)) is more effective than the soft attention mechanism in tradi- Conclusion and Future Work tional Seq2seq models in the melody composition task. We can also observe from Table 3 that the gaps between the sys- In this paper, we propose a lyrics-conditional melody com- tem generated melodies and the ones created by human are position model which can generate melody and the ex- still large on all the three metrics. It remains an open chal- act alignment between the generated melody and the given lenge for future research to develop better algorithms and lyrics. We develop the melody composition model under the models to generate melodies with higher quality. encoder-decoder framework, which consists of two RNN en- coders, lyrics encoder and context melody encoder, and a hierarchical RNN decoder. The lyrics encoder encodes the Related Work syllables of current lyrics into a sequence of hidden vec- A variety of music composition works have been done over tors. The context melody leverages an attention mechanism the last decades. Most of the traditional methods compose to encode the context melody into a dynamic context vec- music based on music theory and expert domain knowledge. tor. In the decoder, it uses two layers to produce musical Chan, Potter, and Schubert (2006) design rules from music notes and another layer to produce alignment jointly. Exper- theory to use music clips to stitch them together in a reason- imental results on our dataset, which contains 18,451 Chi- able way. With the development of machine learning and the nese pop songs, demonstrate our model outperforms base- increase of public music data, data-driven methods such as line models. Furthermore, we leverage a singing voice syn- Markov chains model (Pachet and Roy 2011) and graphic thesizer software to synthesize “singing” of the lyrics and model (Pachet, Papadopoulos, and Roy 2017) have been in- generated melodies for human evaluation. Results indicate troduced to compose music. that our generated melodies are more melodious and tune- Recently, deep learning has been revealed the potentials ful. For future work, we plan to incorporate the emotion and for musical creation. Most of these deep learning approaches the style of lyrics to compose the melody. use the recurrent neural network (RNN) to compose the mu- sic by regarding as a sequence. The MelodyRNN (Waite References 2016) model, proposed by Google Brain Team, uses look- Ackerman, M., and Loker, D. 2017. Algorithmic songwriting with ing back RNN and attention RNN to capture the long-term alysia. In International Conference on Evolutionary and Biologi- dependency of melody. Chu, Urtasun, and Fidler (2016) pro- cally Inspired Music and Art, 1–16. Springer. pose a hierarchical RNN based model which additionally in- Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural ma- corporates knowledge from music theory into the represen- chine translation by jointly learning to align and translate. CoRR tation to compose not only the melody but also the drums abs/1409.0473. and chords. Some recent works have also started explor- Brunner, G.; Konrad, A.; Wang, Y.; and Wattenhofer, R. 2018. ing various generative adversarial networks (GAN) models Midi-vae: Modeling dynamics and instrumentation of music with to compose music (Mogren 2016; Yang, Chou, and Yang applications to style transfer. In Proc. Int. Society for Music Infor- 2017; Dong et al. 2017). Brunner et al. (2018) design re- mation Retrieval Conf. current variational autoencoders (VAEs) with a hierarchical Chan, M.; Potter, J.; and Schubert, E. 2006. Improving algorithmic decoder to reproduce short musical sequences. music composition with machine learning. In Proceedings of the Generating a lyrics-conditional melody is a subset of mu- 9th International Conference on Music Perception and Cognition, sic composition but under more restrictions. Early works ICMPC. first determine the number of musical notes by counting Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; the syllables in lyrics and then predict the musical notes Bougares, F.; Schwenk, H.; and Bengio, Y. 2014a. Learning phrase one after another by considering previously generated notes representations using rnn encoder–decoder for statistical machine and corresponding lyrics. Fukayama et al. (2010) use dy- translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1724–1734. model. In Proceedings of the 2018 Conference of the North Ameri- Doha, Qatar: Association for Computational Linguistics. can Chapter of the Association for Computational Linguistics: Hu- Cho, K.; van Merrienboer, B.; Gulc¸ehre,¨ C¸.; Bahdanau, D.; man Language Technologies, Volume 1 (Long Papers), volume 1, Bougares, F.; Schwenk, H.; and Bengio, Y. 2014b. Learning phrase 163–172. representations using RNN encoder-decoder for statistical machine Yang, L.-C.; Chou, S.-Y.; and Yang, Y.-H. 2017. Midinet: A convo- translation. In Proceedings of the 2014 Conference on Empirical lutional generative adversarial network for symbolic-domain music Methods in Natural Language Processing, EMNLP 2014, October generation. arXiv preprint arXiv:1703.10847. 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Zhang, X., and Lapata, M. 2014. Chinese poetry generation with Group of the ACL, 1724–1734. recurrent neural networks. In EMNLP, 670–680. Chu, H.; Urtasun, R.; and Fidler, S. 2016. Song from pi: A mu- Zhu, H.; Liu, Q.; Yuan, N. J.; Qin, C.; Li, J.; Zhang, K.; Zhou, G.; sically plausible network for pop music generation. arXiv preprint Wei, F.; Xu, Y.; and Chen, E. 2018. Xiaoice band: A melody and arXiv:1611.03477. arrangement generation framework for pop music. In Proceedings Diederik P. Kingma, J. B. 2015. Adam: A method for stochastic of the 24th ACM SIGKDD International Conference on Knowledge optimization. In Proceedings of the International Conference on Discovery & Data Mining, 2837–2846. ACM. Learning Representations (ICLR). Dong, H.-W.; Hsiao, W.-Y.; Yang, L.-C.; and Yang, Y.-H. 2017. Musegan: Symbolic-domain music generation and accompaniment with multi-track sequential generative adversarial networks. arXiv preprint arXiv:1709.06298. Fukayama, S.; Nakatsuma, K.; Sako, S.; Nishimoto, T.; and Sagayama, S. 2010. Automatic song composition from the lyrics exploiting prosody of the japanese language. In Proc. 7th Sound and Music Computing Conference (SMC), 299–302. Lafferty, J.; McCallum, A.; and Pereira, F. C. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Mogren, O. 2016. C-rnn-gan: Continuous recurrent neural net- works with adversarial training. arXiv preprint arXiv:1611.09904. Monteith, K.; Martinez, T. R.; and Ventura, D. 2012. Automatic generation of melodic accompaniments for lyrics. In ICCC, 87–94. Pachet, F., and Roy, P. 2011. Markov constraints: steerable gener- ation of markov sequences. Constraints 16(2):148–172. Pachet, F.; Papadopoulos, A.; and Roy, P. 2017. Sampling varia- tions of sequences for structured music generation. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’2017), Suzhou, China, 167–173. Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Pro- ceedings of the 40th annual meeting on association for computa- tional linguistics, 311–318. Association for Computational Lin- guistics. Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532–1543. Roberts, A.; Engel, J.; Raffel, C.; Hawthorne, C.; and Eck, D. 2018. A hierarchical latent vector model for learning long-term structure in music. arXiv preprint arXiv:1803.05428. Schatzmann, J.; Georgila, K.; and Young, S. 2005. Quantitative evaluation of user simulation techniques for spoken dialogue sys- tems. In 6th SIGdial Workshop on DISCOURSE and DIALOGUE. Schuster, M., and Paliwal, K. 1997. Bidirectional recurrent neural networks. Trans. Sig. Proc. 45(11):2673–2681. Scirea, M.; Barros, G. A.; Shaker, N.; and Togelius, J. 2015. Smug: Scientific music generator. In ICCC, 204–211. Waite, E. 2016. Generating long-term structure in songs and sto- ries. Magenta Bolg. Watanabe, K.; Matsubayashi, Y.; Fukayama, S.; Goto, M.; Inui, K.; and Nakano, T. 2018. A melody-conditioned lyrics language