Quick viewing(Text Mode)

Automatic Generation of Rap Lyrics Using Sequence-To-Sequence Learning

Automatic Generation of Rap Lyrics Using Sequence-To-Sequence Learning

DeepLyricist: Automatic Generation of Rap Lyrics using Sequence-to-Sequence Learning

Nils Hulzebosch Student ID: 10749411

A thesis presented for the degree of Bachelor Artificial Intelligence 18 ECTS

Faculty of Science University of Amsterdam Netherlands July 2, 2017

M.Sc. Mostafa Dehghani Dr. Sander van Splunter University of Amsterdam University of Amsterdam [email protected] [email protected] First supervisor Second supervisor Abstract

This thesis demonstrates the use of sequence-to-sequence learning for the automatic generation of novel English rap lyrics, with the goal of generating lyrics with similar qualities to those of humans in terms of , idiom, structure, and novelty. The sequence-to-sequence model tries to learn the best parameters for generating target sequences given source sequences, and is trained on over 1.6 million source-target pairs, containing lyrics from 348 different rap artists. The automatic evaluations of each of the four characteristics show that idiom and structure have the best performance, whereas rhyme and novelty should be improved to be in a similar range of human lyrics. Future research could focus on using hierarchical models to improve the learning and generation of rhyme, structure, and possibly novelty, and implementing a sampled probability to increase the uniqueness of generated lyrics.

1 Contents

Acknowledgements 4

1 Introduction 5

2 Related work 6

3 Characteristics of rap lyrics 7 3.1 Rhyme ...... 7 3.2 Rhyme schemes ...... 9 3.3 Song structure ...... 10 3.4 Idiom ...... 11 3.5 Novelty ...... 12

4 Evaluation Methodology 12 4.1 Rhyme ...... 12 4.1.1 Modified rhyme density ...... 13 4.1.2 End rhyme score ...... 14 4.2 Song structure ...... 15 4.3 Idiom ...... 16 4.4 Novelty ...... 16

5 Model architecture 16 5.1 Explanation of model ...... 16 5.2 Attention mechanism ...... 17 5.3 Modified loss function ...... 18

6 Dataset 19 6.1 Gathering, selection, and cleansing of data ...... 19 6.2 Addition of structure ...... 19 6.3 Preparing the data for the model ...... 22 6.3.1 Tokenization ...... 22 6.3.2 Source-target pairs ...... 22 6.3.3 Vocabulary ...... 22 6.3.4 Word embeddings ...... 23

7 Experiments 23 7.1 Training the model ...... 23 7.2 Generation of lyrics ...... 24

8 Results 25 8.1 Evaluation of rhyme density ...... 26 8.2 Evaluation of end rhyme score ...... 26 8.3 Evaluation of structure ...... 27 8.4 Evaluation of idiom ...... 29

2 8.5 Evaluation of novelty ...... 29 8.6 Unknown token ...... 31

9 Conclusion 31

10 Discussion 32 10.1 Interpretation of results ...... 32 10.2 Possible limitations ...... 33 10.3 Future work ...... 34

Appendix A. List of 348 rappers and rap groups used for training data. 39

Appendix B. Example of generated lyrics. 41

Appendix C. Example of learned word embedding. 42

3 Acknowledgements

”Under pressure, I’ve been feeling under pressure.” - (Under Pressure)

The line above, which is cited from one of my favourite rappers, pretty much describes my state of mind when finishing this thesis. I’ve had a lot of setbacks during this project, but due to the support of several people, I now finished the project with a sense of satisfaction.

First of all, I would like to express my thanks to Mostafa Dehghani for giving me the opportunity to research this subject. His knowledge, enthusiasm, and willingness were the main factors in making this project possible. I enjoyed our collaboration!

Secondly, I would like to thank Sander van Splunter for the analytic guidance throughout this project, which helped me approach the scientific edge of this fairly subjective subject.

Furthermore, I would like to thank my parents and brother, along with my family and friends for their support, with a special thanks to my father, who helped me develop the criteria for the model and evaluation, using his knowledge of cognitive science and music.

Finally, I thank my girlfriend for her loving support during this intensive project.

Nils Hulzebosch

Amsterdam, July 2, 2017.

4 1 Introduction

The writing of rap lyrics is a complex process that requires creativity and skill to create an interesting story that involves stylistic elements such as rhyme. This thesis describes the research of the automated generation of rap lyrics, which is motivated as follows: first, to explore the capacities of a deep-learning model with respect to the generation of completely new lyrics instead of the reproduction of existing ones. Secondly, to research how deep-learning can be used to generate unique, creative lyrics instead of choosing or existing ones. Lastly, to determine the best evaluation methods for generated rap lyrics. Rap lyrics often have a clear structure, due to rhyme, along with structural parts such as intro, chorus, bridge, verse, and outro. One aim is to let the model learn these aspects to enable the generation of new lyrics with rhyme and structural parts. Furthermore, characteristics such as idiom (the words used in rap) and novelty (the uniqueness of generated lyrics) will be addressed. In this research, the generation of rap lyrics is seen as a sequence-to-sequence problem, in which given a sequence of tokens, a new sequence is generated (Sutskever et al., 2014). This approach is for example used in Neural Machine Translation, where a sentence in language maps to a sentence in the target language (Bahdanau et al., 2014; Cho et al., 2014). Furthermore, it is used in automatic summarisation, where each sequence containing a full article maps to a summary sequence (Chopra et al., 2016; Nallapati et al., 2016a,b; Rush et al., 2015). For the generation of rap lyrics, the input sequence is a line from a song and the output sequence is the following line. This way, the model could learn to predict an appropriate next line in terms of content and style. The main research question is how a sequence-to-sequence model can be used to generate novel English rap lyrics with similar qualities to those of human rappers. To evaluate the novelty of the generated lyrics, two methods are used, inspired by the methods of Potash et al. (2016): the first is a measurement of amount of uniquely generated sequences, duplicates within the generated dataset, and duplicates compared to the training dataset. The second is a comparison of cosine-similarities of training and generated data. For the evaluation of similarity to human rap lyrics, three important properties of rap lyrics are examined: the first is rhyme, using a method similar to rhyme density (Malmi et al., 2015; Potash et al., 2016), along with ’end rhyme score’, which is a newly developed method. Secondly, idiom is measured by comparing the amount of characteristic words for rap in the training and generated datasets. Thirdly, the structure is measured by calculating the amount of correctly generated structure tokens, which are added to the training dataset during the pre-processing of the data. The hypothesis is that a sequence-to-sequence model could generate novel English rap lyrics with similar qualities to those of human rappers, when trained on a sufficiently large dataset enriched with structural information, including an attention mechanism to improve rhyme. The proposed approach is a data-driven machine learning method, that needs the construction of a large dataset for the training of the model. The resulting dataset contains over 2 million lines from 348 different rap artists. The sequence-to-sequence model is implemented in TensorFlow (Abadi et al., 2016). The model in general consists of two

5 LSTM Recurrent Neural Networks (Hochreiter and Schmidhuber, 1997) for encoding and decoding. Furthermore, a word-level attention mechanism is employed during decoding (Bahdanau et al., 2014). The contributions of this thesis are the following: 1) the demonstration of a sequence- to-sequence approach for the generation of novel rap lyrics, 2) the demonstration of computational, quantitative evaluation methods for rhyme, idiom, structure, and novelty 3) the evaluation of lyrics generated by a sequence-to-sequence model with attentional LSTM Recurrent Neural Networks compared to those of human rappers, and 4) the construction of a large dataset of structured rap lyrics. The thesis is divided into the following sections. The first is a discussion of relevant work. The second is an overview of the characteristics of rap lyrics, followed by a section that discusses the evaluation methodology. Next, the architecture of the sequence-to- sequence model is explained, along with the attention mechanism and a modification to the model. The fifth section discusses the acquisition of the dataset, including the gathering, pre-processing, and structuring of data. The sixth discusses the experiments, including training the model and the generation of lyrics. This is followed by an examination of the results of generation and evaluation. Finally, the research is concluded, along with a discussion regarding interpretations of results, possible limitations, and future work.

2 Related work

Recent research on the automated generation of rap lyrics has been done by Malmi et al. (2015), using a retrieval based model. When this deep-learning model is trained with a large corpus of rap lyrics, it learns the automated combination of existing lyrics with substantive and stylistic similarities, having a new verse with closeness in content and style as a result. However, this model has the following limitations: 1) the requirement of pre-defined functions for the measurement of content and style, 2) ambiguity about the uniqueness of generated lyrics, and 3) the limited granularity of the model’s generative capacities, which is on the line-level and not on the word-level. An alternative deep-learning architecture, a neural language model, is used by Potash et al. (2015) to learn model of rap lyrics. The model is based on LSTM Recurrent Neural Networks (Hochreiter and Schmidhuber, 1997), and works by predicting the next most likely token to follow the input sequence, based on the learned language model. In contrast to the model of Malmi et al. (2015), this model is capable of the automated learning of features and the word-for-word generation of lyrics, which results in an increase of uniqueness of the generated lyrics. However, this model incapable of recalling the words that it previously generated, which are simply added to the next input sequence, possibly resulting in a decrease of the generated lyrics’ qualities. Therefore, our research aims at extending the work of Potash et al. (2015) by mapping the problem to a sequence-to-sequence model (Sutskever et al., 2014), which is aware of the generated words during the generation of one sequence. LSTM Recurrent Neural Networks are used for encoding the source sequence and decoding the following sequence. This research uses the more specific ”vanilla” sequence-to-sequence model (Cho et al., 2014) and customizes it for the use of rap lyrics, with allowance for capturing the structure of songs by adding structural information to the data, as explained in section 6.

6 Sequence-to-sequence architectures have potential for improvement by using atten- tional mechanisms (Bahdanau et al., 2014). These enable the encoding of information by using a soft-search for relevant words instead of using a fixed-size matrix, limiting information-loss (Bahdanau et al., 2014). An attention mechanism will be implemented in the model, since this might result in extra focus on rhyme words or other characteristic words during training, as explained in section 5. In our research, a sequence-to-sequence model consisting of attentional LSTM Recur- rent Neural Networks for the encoder and decoder, is used for the automated word-for-word generation of rap lyrics, with the aim of creating unique lyrics that have similar sub- stantive and stylistic qualities to those of human rappers. Using a sequence-to-sequence model instead of a neural language model (Potash et al., 2015) might result in lyrics with better qualities, since the architecture is more complex and the model is aware of previously generated words. However, due to the complexity of the model, more training data is required, which demanded for the construction of a large dataset as mentioned in the introduction, and described in detail in section 6.

3 Characteristics of rap lyrics

This section describes important characteristics of rap lyrics. Within many genres, artists try to create lyrics with an interesting story that sound good in combination with instruments when sang, spoken or rapped. Because this research focuses on the generation of rap lyrics, and not the way they sound when performed, some characteristics of rap in general are considered out of scope for this thesis, such as rhythm, vocal techniques, selection of beats, and performance over beats (Edwards, 2009). Furthermore, narration is considered an important aspect (Edwards, 2009). However, it is not evaluated because this requires human evaluation, which is not considered feasible within the scope of this thesis. Four important characteristics of rap lyrics are within the scope of this thesis: rhyme, structure, idiom and novelty. Furthermore, rhyme schemes are explained, but not evaluated, due to their relative complexity.

3.1 Rhyme In rap, the use of sound techniques such as rhyme is one of the most important charac- teristics, as discussed in (Edwards, 2009, p. 82): ”Rhyme is often thought to be the most important factor in rap writing - MCs often refer to rap lyrics as ’’. Along with rhythm, rhyme is what gives rap lyrics their musicality, because similar sounds being repeated are interesting to listen to.” Rhyme can be divided into five types, as described in How To Rap (Edwards, 2009, p. 81-91). These are perfect rhyme, assonance, consonance, and multi-syllable rhyme. Each type is discussed below, illustrated with an example, most of which are sampled from the dataset used for this thesis.

7 Perfect rhyme The ending of two words share the same vowel and consonant sound. In the example below, ’gosh’ and ’Tosh’ both share a aa (vowel) and sh (consonant) sound. ”Oh-my-gosh! Oh my gosh! Eating Ital Stew like the one Peter Tosh” Figure 1: Example of perfect rhyme in - Scenario.

Assonance Two words share the same vowel sound, while the consonants are different. In the example below, ’same’ and ’pane’ share the ey (vowel) sound, while the consonant sounds are m and n. ”Take aim, here I came, I’m the same Back in ’86, I’da tag my name upon your window pane” Figure 2: Example of assonance in - 1597.

One specific type of assonance is the bending of words: one word is pronounced in such a way that the vowel has a similar sound to the vowel in the other word, while the vowels are in fact not equal. This is done perfectly by as shown in the example below. Note that the ah sounds are slightly pronounced like an ih sound, creating the rhyme. Furthermore, an extra ih vowel sound is added to the last word. ”I put my orange Four Inch Door hinge in storage And ate porridge With George” Figure 3: Example of bending words in Eminem - Rhyme Time With Eminem1.

Consonance Two words share the same consonant sound, while the vowel sound is different.

”I’m trying to live it to the limit and love it a lot” Figure 4: Example of consonance in Jay Z - D’Evils2.

1https://genius.com/Eminem-rhyme-time-with-eminem-lyrics, retrieved at June 12, 2017. 2https://genius.com/posts/24-Rap-genius-university-rhyme-types, retrieved at June 11, 2017.

8 Alliteration Words begin with the same consonant or sound, without the requirement for a shared vowel sound. In the example below, many of the (alliterating) words also share a vowel sound at the beginning or ending of the word. Note that many rhymes are within the line, not only at the ending, which is called ”internal rhyme”.

”I black out like Howard University When I verbally turn into Hercules Hurtin’ these Emcees early and urgently Urkin’ my nerves searching for certain Certified nouns and verbs On the verge of vigorous victory Kurupted” Figure 5: Example of alliteration in - Kurupted.

Compound / multi-syllable rhyme Two phrases have multiple rhyming syllables, which could be perfect rhyme or assonance. Usually, one or both of the rhyming phrase consists of multiple words.

”The pot doubles, now they really got troubles Madman never go pop! ... like snot bubbles”

Figure 6: Example of compound rhyme in Madvillain - All Caps.

3.2 Rhyme schemes Rhyme schemes are a way of structuring rhyme words in a specific sequence in a song or verse. As described in How To Rap (Edwards, 2009, p. 95-110), there are three basic types of rhyme schemes. When combined with ’extra rhymes’, complex rhyme schemes can be constructed. Each of these types is discussed below. For the understanding of each type, the first four lines of Eminem - Lose Yourself (sampled from the training dataset) are used for visualization, as shown in Figures 7, 8, 9, and 10.

Couplet A consists of two sequential lines that share a rhyme word at the end. Figure 1 is a good example of this.

Single-liner A single-liner consists of two or more rhyme words within one line, often involving the line’s last word. The first line of Figure 6 and the underlined words of Figure 7 are good examples of this.

”Yo, his palms are sweaty, knees weak, arms are heavy There’s vomit on his sweater already: Mom’s spaghetti He’s nervous, but on the surface he looks calm and ready To drop bombs, but he keeps on forgetting” Figure 7: Example of single-liner in Eminem - Lose Yourself.

9 Multi-liner A multi-liner consists of 3 or more sequential lines that share a rhyme word at the end. In Figure 5, the first three lines share a rhyming word at the end (”University”, ”Hercules” and ”urgently”). Furthermore, the last words of each line in Figure 8 are part of a multi-liner .

”Yo, his palms are sweaty, knees weak, arms are heavy There’s vomit on his sweater already: Mom’s spaghetti He’s nervous, but on the surface he looks calm and ready To drop bombs, but he keeps on forgetting” Figure 8: Example of multi-liner in Eminem - Lose Yourself.

Extra rhymes Extra rhymes are rhymes that are not necessarily part of the rhyme scheme, but could be added for extra flow or sound of the lyrics, as shown in Figure 10 and Figure 9.

”Yo, his palms are sweaty, knees weak, arms are heavy There’s vomit on his sweater already: Mom’s spaghetti He’s nervous, but on the surface he looks calm and ready To drop bombs, but he keeps on forgetting” Figure 9: Example of extra rhymes in Eminem - Lose Yourself.

Combinations / Complex rhyme schemes Using the three basic types in combina- tion with extra rhymes, all kinds of complex rhyme schemes could be composed. Figure 10 combines the types above, where multi-liners are bold, single-liners are underlined, and extra rhymes are in small caps.

”Yo, his palms are sweaty, knees weak, arms are heavy There’s vomit on his sweater already: Mom’s spaghetti He’s nervous, but on the surface he looks calm and ready To drop bombs, but he keeps on forgetting” Figure 10: Example of complex rhyme scheme in Eminem - Lose Yourself.

3.3 Song structure As with the lyrics of any genre, each part in a song can fulfill a specific goal, such as setting the scene, telling the story, or revealing the plot. In rap, a clear structure can often be recognized. This subsection discusses the most important structural parts for rap lyrics, which could occur in any combination and are not bound to a specific number of occurrences per song. Before discussing each part, there will be an explanation for the selection of these parts.

10 Genius.com3 distinguishes 17 structural parts for lyrics and melodies: introduction (intro), verse, refrain, pre-chorus (climb), chorus, post-chorus, hook, riff / bassline, turntablism, bridge, interlude, skit, collision, instrumental / solo, ad lib, segue, and outro4. After an examination of the data, five parts are chosen to use in the training data: intro, chorus, bridge, verse, and outro. This selection is motivated by the following four reasons: first, all structural elements that are solely used for melodies are ignored, since the model should focus on learning the structure of lyrics, not melodies. Secondly, not all structural elements are used correctly by the lyrics’ annotators. Using one element for several elements with a high similarity could improve the robustness of the data. Thirdly, some of these structural elements have a very low occurrence, making it too hard for the model to learn these. When possible, these are also mapped to one of the five structural elements. Lastly, the lower the amount of structural elements, the more training examples per element, which could result in a better representation of each element. Therefore, five elements might be an adequate amount to learn such representations, given the size of training dataset.

Intro The introduction of the song. This could either be a few lines of or prior to the actual (rapping) verse, or it could be one or multiple people talking with or without a beat. This structural part also represents bridges that occur at the very beginning of a song in the dataset.

Chorus The repetitive (catchy) vocal phrase of the song, which is often easy to remember. This structural element also represents refrains and hooks in the dataset.

Bridge A (catchy) vocal phrase that often, but not necessarily, precedes or follows the chorus. This structural element is used to also represent pre- and post-choruses, interludes and skits in the dataset. Furthermore, it is used for talking parts in the middle of a song.

Verse The actual rapping or singing in a song. This is often the longest or main part in a song that tells the greatest part of the story. This structural part is also used for all unlabelled parts after the addition of structure, as described in Section ??.

Outro The ending of a song, which is often talking. This structural part also represents bridges that occur at the very ending of a song in the dataset.

3.4 Idiom The third characteristic is idiom. As described by Escoto and Torrens (2012), important characteristics of the language used in rap lyrics, sometimes referred to as ’ nation language’ (Alim, 2009), could be ’slang’ words, swear words, and alterations in grammar. Slang words are existing words that are twisted or self-made words (Edwards, 2009, p. 95-110), which might become a normal part of the language. A few examples of slang words sampled from the training dataset are shown in Table 1, with corresponding

3An online platform for song lyrics and their meanings, with over 2 million registered contributors, editors, and musicians. 4https://genius.com/5067044, retrieved at June 15, 2017

11 meanings from Urban Dictionary, a website that provides definitions for many slang words5. Of course, not all rappers use these characteristics, but they are often present in the training dataset.

Table 1: Examples of slang words, sampled from the training dataset.

Slang word Meaning G ”gangster” lil ”little” tryna ”trying to” breed/cheese/dough ”money” (having) beef ”(having or wanting) a fight/conflict” thang ”thing” lab ”recording studio”

3.5 Novelty Rap lyrics contain mostly new and sometimes existing expressions. Therefore, novelty is an important aspect of generated rap lyrics: it should be avoided that complete lines are reproduced.

4 Evaluation Methodology

This section describes the methods that will be used to evaluate rhyme, structure, idiom, and novelty of the lyrics of training and generated datasets. To be able to do this, a sufficient amount of lyrics needs to be generated, which is described in further detail in section 7.

4.1 Rhyme To evaluate the rhyming qualities, perfect and near rhyme (assonance) are measured, since these are the most obvious and most present forms of rhyme in rap music. This will be done using two methods: the first is a modified version of rhyme density, inspired by Malmi et al. (2015). The second is a qualitative measure, that includes several parameters to calculate a score. However, due to higher complexity, it only measures end rhyme. Both methods are discussed in the two subsections, after the following explanation of phonetic transcription. For both methods, the phonetic transcription of each word in the dataset is required to be able to measure rhyme. This could be done using an existing library such as NLTK - Natural Language Toolkit (Bird, 2006). However, this library is incapable of generating phonetic transcriptions for unknown words. Since rap is characterized by slang words, and the training data is not free of spelling errors, many words would not get a phonetic transcription using such a library. Therefore, another approach is desirable. A database of phonetic transcriptions is created beforehand using the LOGIOS Lexicon

5http://www.urbandictionary.com, retrieved at June 22, 2017.

12 Tool 6. This tool provides phonetic transcriptions for almost all unknown words using a rule-based approach for transforming syllables. The resulting database consists of phonetic transcriptions for almost all words in the vocabulary of the training dataset, which is used for the following rhyme measurements.

4.1.1 Modified rhyme density The rhyme density, as described by Malmi et al. (2015), is calculated by removing all consonants and looping over the vowel phonemes of each word in a sentence, along with the calculation of the maximum amount of matching (rhyming) phonemes in the proximity of the word. This is divided by the amount of phonemes, to get the average amount of rhyme. This method is modified in the following way. Instead of looping over all words in a sentence, the longest common substring of vowel phonemes between the current and previous sentence (substring A), and current and next sentence (substring B) is found. This is not to be confused with the longest common subsequence (Paterson and Danˇc´ık, 1994), which allows for ’interruptions’ within the string, as shown in Figure 11.

Longest common substring: ABCABC ABC BABACB ABC

Longest common subsequence: A B C ABCABC BAB A C B ABC Figure 11: Examples of longest common substring and longest common subsequence.

The longest common substring calculated above (either substring A or B) is divided by the amount of tokens in the current sentence. Using this method, the output score of one sentence represents the maximum amount of rhyming syllables divided by the amount of tokens. This will be averaged over all sentences to get the average proportion of most rhyming syllables in a sentence. While the rhyme density represents the ”average length of longest rhyme per word” Malmi et al. (2015), the modified rhyme density represents the ”average length of longest rhyme per line”. Furthermore, this evaluation will be performed twice: the first time, equal (repeated) words will count as rhyme words. For example, when a complete line is reproduced, it will be labelled as a complete rhyming line. The second time, equal words will not count as rhyme. When the same word is used in two sequential lines, it will not count as rhyme. This results in two scores for the modified rhyme density, as shown in section 8.

6http://www.speech.cs.cmu.edu/tools/lextool.html, retrieved at June 20, 2017.

13 4.1.2 End rhyme score The second rhyme measurement was developed, because no similar method existed in the literature. It is a qualitative, rule-based measure for end rhyme, based on the last two words of a line, enabling the measurement of multi-syllable rhyme spanned over the two ending words. It is complementary to the previous method, since it focuses on qualitative rhyme aspect of the last words. Nevertheless, it is not capable of measuring internal rhyme, due to the complexity of such an implementation. Therefore, the evaluation of all words in a line using this qualitative method is considered out of scope for this thesis. For the measurement of the end rhyme score the same phonetic database is used, as described in this section’s introduction. The score is calculated based on six aspects: 1) rhyme quality (perfect or near rhyme), 2) amount of rhyming vowels, 3) whether one word is a subset of the other, 4) amount of repeating syllables at the end of a word, 5) whether the words are exactly equal, 6) whether both words contain a number. These features result in the following algorithm for the calculation of the end rhyme score:

Algorithm 1: Calculates the end rhyme score of two sentences. 1 function endRhymeScore (w1, w2); Input : Two nonempty strings w1 and w2 each containing the phonemes of a word Output : endRhymeScore 2 rhymeV owels, perfectRhyme, isSubset, hasRepetition, bothNumber = calcRhymeP roperties(w1, w2) 3 if rhymeV owels == 0 then 4 return 0; 5 else 6 endRhymeScore = rhymeV owels 7 if perfectRhyme = F alse then 8 endRhymeScore = 0.5 ∗ endRhymeScore; 9 end 10 if isSubset = T rue then 11 endRhymeScore = 0.5 ∗ endRhymeScore; 12 end 13 if hasRepetition = T rue ∨ w1 == w2 ∨ bothNumber then 14 if endRhymeScore > 1 then 15 endRhymeScore = 1; 16 end 17 end 18 return endRhymeScore; 19 end

The scoring algorithm is used as follows: first, remove punctuation and structural tokens at the end of the line. Take the last two words of the remaining line, and one if it contains only one word. Secondly, retrieve the phonetic transcriptions of the words, and return 0 if these are not known. Thirdly, calculate the end rhyme score of each word

14 with the algorithm 1, using the rule-based calcRhymeProperties algorithm. The motivations for the calculation of the endRhymeScore are the following five: firstly, perfect rhymes are considered ’better’ than near rhymes and therefore score twice as good as near rhymes. Secondly, if one word is a subset of the other, the score is again halved. For example, the words ’advantage’ and ’disadvantage’, and the words ’regular’ and ’irregular’ have a high rhyme score (3 equal vowels) and are both perfect rhyme. However, the exact repetition of one word might be considered less good than the use of another word, resulting in penalizing the score for such occasions. Thirdly, if one of the words has repeating vowels or consonants at the ending, the final score is set to 1 if the current score is higher than 1. This is done to penalize words with a high rhyme score merely due to their spelling. For example, the following pairs of words, sampled from the training dataset, would get a high score due to their deviant spelling: ’soooooooo’ and ’noooooooo’ would have a score of 4 (due to ’oh oh oh oh’), and ’grrrrr’ and ’brrrrr’ would have a score of 5 (due to ’er er er er er’). Using the method above, each intuitively gets a score of 1. The fourth motivation follows the ambiguity towards the acceptance of using the same word as rhyme word, as described in How To Rap (Edwards, 2009, p. 83-84). Due to this ambiguity, the choice has been made to penalize such words by setting the score to 1 when the score is higher than 1. Fifthly, when both words contain a number, the scored is again set to one if the score is higher than 1. This is done because numbers are transformed separately to their phonetic transcription. For example, the number ’34000’ is transformed to ’th r iy f ow r z ih r ow z ih r ow z ih r ow’. When there is another number (i.e. ’3000’), it would get a high rhyme score due to the ”rhyming” zeros, which is no longer the case. Using the rhyme examples from section 3, the end rhyme scores are 1 for Figure 1 (’gosh’ and ’Tosh’), 0.5 for Figure 2 (’same’ and ’pane’), and 3 for Figure 6 (’got troubles’ and ’snot bubbles’).

4.2 Song structure During the transformation of data, three datasets are created that contain structural information, and two datasets are created without structural information, as explained in section 6. For the structural datasets, the beginning of each sentence is labelled with a structural token, that contains information about the type of structural part, number of structural part, and line number. For a preview, see Table 4. The evaluation of song structure is done by comparing the generated structural token to the target structural token. This comparison includes four aspects: 1) correct type, 2) correct type number, 3) correct line number, and 4) perfectly correct token. This results in an average amount of correctly generated tokens for each structural dataset, as examined in section 8. For the generation of a new song, the model would often have multiple options when generating a structure token. For example, when the current sequence is the fifth line of the second verse, the next line could be: 1) the sixth line of the second verse, 2) the first line of the third verse, 3), a bridge, 4) the chorus, or 5) the outro. However, due to the complexity of such an evaluation, the generated structural token is only compared to the token of the target sequence.

15 4.3 Idiom For the evaluation of idiom, 125 words that are characteristic for rap have been chosen for evaluation, based on Edwards (2009); Escoto and Torrens (2012) and occurrences in the training data. Using these words, two aspects will be evaluated: 1) the proportion of sequences in the generated and training datasets containing at least one of these words, and 2) the proportion of characteristic words in the generated and training datasets. Because the generated dataset is much smaller than the training data, the vocabularies cannot be directly compared. Therefore, the proportion of each word is calculated by dividing the occurrences of a word by the occurrences of all word in the corresponding vocabulary. The results are shown in section 8.

4.4 Novelty The evaluation of novelty will be done using methods that are inspired by the methods of Potash et al. (2016). The following two methods will be used: the first method is a calculation of the amount of uniquely generated sequences, amount of duplicates within the generated dataset, and amount of duplicates compared to the training dataset. This gives an impression of how often the model generates novel lyrics, or how often it generates lyrics that already exist. The second method is the calculation of average cosine-similarity between the source and target pairs, and source and generated pairs in the generated dataset. These values will give an impression of the quality of the generated lyrics. For example, if the average cosine-similarity is 0, the generated sequences have little to no relation to the source sequence, while if it is 1, the generated sequences always consists of the same words as the source sequence (not necessarily in the same order).

5 Model architecture

This section discusses the model architecture and how it operates, the attention mechanism implemented in the model, and the model’s modified loss function.

5.1 Explanation of model The model is a ”vanilla” sequence-to-sequence model (Cho et al., 2014), consisting of a bi-directional LSTM RNN for the encoder, and a LSTM RNN for the decoder. The encoder has an attentional layer Bahdanau et al. (2014), which can give more attention to specific tokens on the word-level, as described in further detail in the next subsection. A simplified architecture of the model is shown in Figure 12. The sequence-to-sequence model tries to map a source sequence (the input) to a target sequence (the desirable output), which works the following way: the encoder transforms an input (sequence of words) into word embeddings, which are abstract representations of words. During one training step, the word embeddings of the source sequence are processed through the network of the encoder. These outputs are fed to the network of the decoder, which processes them and generates a new sequence of words. This generated sequence of words, based on the source sequence, is compared with the target sequence, by calculating the loss, also known as the error. The loss is a measure for the quality of the generated output, and

16 is calculated using the log perplexity. The higher the perplexity, the higher the loss, the worse the generated sequence, and vice versa. Using the calculated loss, the parameters of the word embeddings and networks are slightly updated in the ’right’ direction. During training, the loss decreases with each iteration, until it reaches convergence, meaning it cannot become any lower. To obtain a high performance, it is required to design a proper model architecture, define a proper loss function, and have a substantial amount of clean training data.

Figure 12: Simplified architecture of the sequence-to-sequence model, consisting of LSTM RNNs for the encoder and decoder.

5.2 Attention mechanism In other linguistic generative tasks such as neural machine translation (Bahdanau et al., 2014), the attention mechanism is used with the intention to let it automatically learn to attend to specific words of the target sequence, given a source sequence. One example of this is shown in Figure 13 below. In this made-up example, words of the source sequence in Dutch are aligned to words of the target sequence in English, as might be learned by the model. The thickness of the purple lines show how ’sure’ the model is of this alignment. However, in rap lyrics, there is often no one-to-one mapping between source and target sequences (only in repetitive parts such as choruses). Nevertheless, the use of the attention mechanism could still improve the model, for example for the learning and generation of internal rhyme, end rhyme and coherent content. An example of this is shown in Figure 14, consisting of one line from a song as the source, and the following line as the target. In this example, the attention mechanism might be useful to learn to align end rhyme words, since these often occur at the ending of sequences.

17 Figure 13: Visualization of how the attention mechanism might work for a made-up example in the application of Neural Machine Translation.

Figure 14: Visualization of how the attention mechanism might work for lyrics of Madvillain - All Caps, sampled from the training dataset.

5.3 Modified loss function The standard sequence-to-sequence model described above is trained on four different datasets (as described in section 7). Furthermore, an alternative model is trained on one of these datasets: the one containing only verses and no structural information. This alternative model has a modified loss function, which is calculated based on the loss of the generated sequence compared to the target sequence (as usual), and the loss compared to the source sequence. The last is unconventional for other tasks such as Neural Machine Translation, since the comparison of source and generated sequence is not useful in this task. However, for the generation of rap lyrics it might be desirable to compare the generated sequence to the source sequence, because of the following two reasons: first, the content of the source and target sequence is often similar, so a loss that involves this aspect could improve the generation of coherent content. Secondly, it might improve the generation of rhyme, which could be measured by comparing source and generated sequence, but not by comparing target and generated sequence.

18 6 Dataset

This section discusses the gathering, selection and cleansing of data, addition of structure, and preparation of the data for the model.

6.1 Gathering, selection, and cleansing of data From all rappers and rap groups, 348 (see Appendix A) have been selected for training, based on multiple websites that rank all-time top rap artists.7,8,9,10,11,12,13,14. Each website has different criteria for ranking these rappers, but the most import ones are: originality, (cultural) impact / influence, (lyrical) quality, consistency / longevity, and popularity / public opinion. It is assumed the lyrics of these artists are all written by humans. Based on this selection and availability, the lyrics of 47.740 songs have been scraped from the internet, during the period of May 2017. After the automatic removal of duplicates and manual removal of songs with only talking (intro’s, interludes, skits, and outro’s), 29.100 unique songs remained. This was followed by the automatic removal of all meta-data (song title, artist, , typed by), unnecessary white space (new lines, tabs, and double spaces), and other irrelevant data such as remarks on the song.

6.2 Addition of structure To enable the generation of songs with structure, a lot of structure should be present in the training data. The scraped dataset was very noisy and had a lot of different structural elements that each had many different spellings. First, all existing structural elements are mapped to one of the five parts described in section 3, by labeling the part with one of the structure tokens in Table 2 and removing the original structural elements. For verses, this was quite a challenge since they were often label as [Name of artist] or (Name of artist), with again numerous different spelling.

7https://www.thoughtco.com/greatest-rappers-of-all-time-2858004, retrieved at May 10, 2017. 8http://topxbestlist.com/best-rappers/, retrieved at May 10, 2017. 9http://uk.complex.com/music/2012/11/the-50-most-slept-on-rappers-of-all-time/, re- trieved at May 11, 2017. 10https://digitaldreamdoor.com/pages/best_rap-artists.html, retrieved at May 11, 2017. 11https://www.thetoptens.com/new-age-rappers-2000-2015/, retrieved at May 11, 2017. 12http://www.stopthebreaks.com/50-greatest-rappers-of-all-time/, retrieved at May 11, 2017 13http://www.ranker.com/list/best-underground-rappers/kevin, retrieved at May 12, 2017. 14https://www.thoughtco.com/best-rap-groups-2857977, retrieved at May 12, 2017.

19 Table 2: Twelve structure (meta-)tokens used in pre-processing the data. These are used to add structure tokens and to count occurrences of each part.

< start song >, < end song > < start intro >, < end intro > < start chorus >, < end chorus > < start bridge >, < end bridge > < start verse >, < end verse > < start outro >, < end outro >

Next, several parts that were labelled with ”talking” were transformed into the corresponding tokens of ’intro’, ’bridge, or ’outro’, using the position of a part in a song as the main heuristic. Moreover, the manual addition of several tokens for ’intro’, ’bridge’, ’chorus’, ’verse’ and ’outro’ was applied, using human knowledge of the structure of rap songs. These tasks turned out to be impractical to do manually, since the dataset contained over 2 million lines of text. To solve this, some algorithms were written to automatically add structure tokens to unlabelled parts. These are simplified, rule-based versions of the structure extraction algorithm from Mahedero et al. (2005). The first was to automatically detect choruses in songs. This was done by comparing the intersection of words in each part in a song. Whenever the intersection was at least 40 percent of one of the parts, and the lengths of the parts did not differ more than 3 times, the parts were labelled as chorus. The remaining labels (corresponding to ’intro’, ’bridge’, ’verse’, and ’outro’) were added using the following heuristics: 1) position in song, 2) amount of lines of part, 3) label of previous/following part (if present). A manual evaluation was performed to ensure the labeling process went correctly, which was the case in over 80% of 200 randomly sampled parts. Using the methods described above, all parts have been labelled, as shown in Table 3.

Table 3: Characteristics of the training dataset containing the lyrics of full songs, including structure tokens.

Type of structural part Absolute amount Proportion Intro 14.982 8.1% Chorus 60.115 32.6% Bridge 10.541 5.7% Verse 82.601 44.8% Outro 16.265 8.8% Total 184.503 100%

Since we use a sequence-to-sequence model that maps one source sequence to a target sequence, without knowledge of earlier sequences (lines), the use of only the structure tokens as described in Table 2 would not be enough. When the model would be in the middle of a part, it would have no information about the current part type. Therefore, a specific token that contains information about the part type and position in part/song is added to the beginning of each line. Moreover, three different levels of specificity will be

20 used for these tokens, resulting in three datasets, as shown in Table 4. The first only contain information about the part type. The second and third contain information about the part type, part number, and line number, each with a different notation. The second notation is less specific because it consists of two tokens, while the third consists of one token.

Table 4: Three levels of specificity for the structure tokens, used in three different datasets.

Type Level 1 Level 2 Level 3 Intro < i > < i >< ln > < i ln > Chorus < c > < c >< ln > < c ln > Bridge < b > < b bn >< ln > < b bn ln > Verse < v > < v vn >< ln > < v vn ln > Outro < o > < o >< ln > < o ln >

An example of the use of the tokens in Table 4 is shown in Table 5, with the first and second line from the second verse of Dr. Dre - What’s The Difference, sampled from the training data.

Table 5: Example of use of different structure tokens in each dataset. Lyrics are the first and second line of the second verse of Dr. Dre - What’s The Difference.

Level 1 < v > Yo, I stay with it, while you try to perpetrate and play with it < v > Never knew about the next level until Dre did it

Level 2 < v 2 >< l 1 > Yo, I stay with it, while you try to perpetrate and play with it < v 2 >< l 2 > Never knew about the next level until Dre did it

Level 3 < v 2 1 > Yo, I stay with it, while you try to perpetrate and play with it < v 2 2 > Never knew about the next level until Dre did it

21 6.3 Preparing the data for the model This subsection describes the steps that were taken to prepare the data for the model, including tokenization, transformation of lyrics to source-target pairs, the creation of word embeddings, and the creation of a vocabulary for each training dataset.

6.3.1 Tokenization After the gathering, selection, cleansing and enrichment of the data, tokenization on the word-level is applied to all lyrics. This is followed by the lower-casing of all lyrics, which is done for unification, to decrease the amount of different spellings of the same words). For example, it is desired to consider the tokens ”Amsterdam” and ”amsterdam” as the same token, since this will result in more training examples and a smaller vocabulary, which is likely to result in a better learned model. As desired, these transformation steps do not influence the occurrences of ’slang’ words in the training dataset. However, words that are misspelled by mistake will be viewed as different tokens than their associated correctly spelled word. Nevertheless, with low presence of misspelled words, they will simply be excluded from the vocabulary and training dataset by replacing them with the < UNK > token, representing ”unknown” words, as described in section 6.3.3.

6.3.2 Source-target pairs This is followed by a transformation to source-target pairs, where the source (input) consist of one line in a song, and the target (output) consists of the following line, as discussed in section 5. Furthermore, each line gets a < s > and < /s > token at the beginning and ending of each line (encapsulating the structure token if present). Moreover, each sentence that is shorter than the maximum allowed length of sentences is supplemented with < P AD > tokens, representing ”padding”. These are ignored by the model, but enable the possibility of processing input lines with a variable length. The resulting datasets of pre-processed lyrics consist of 1.167.474 source-target pairs for the datasets that only contain lyrics from verses, and 1.645.822 for the ones that contain lyrics from whole songs, along with added structure tokens. These are transformed to a binary format to be able to be used by the model.

6.3.3 Vocabulary Besides the source-target pairs datasets, a vocabulary is created for each dataset by counting the occurrences of each word, sorting the words based on their occurrences (from high to low), and removing all words that are not within the 50.000 most occurring words, as inspired by Neural Machine Translation research (Bahdanau et al., 2014; Cho et al., 2014). Limiting the amount of words in the vocabulary has the following two reasons: 1) the softmax for choosing the word with the highest probability is computationally expensive, so increasing the vocabulary would make the program much slower, and 2) words with low occurrences are hard to understand, because of the lack of different contexts. This makes it almost impossible for the model to get a good representation of these words. Thus, all words that are not in the vocabulary, the ’out-of-vocabulary’

22 words, are replaced with an < UNK > (”unknown”) token. The resulting problem is that the model’s performance is less good when the input contains words that are not within the vocabulary. Moreover, the model could generate < UNK > tokens itself. Due to limited time, this problem will not be addressed during this research. Nevertheless, the degree of this problem will be examined in section section 8, along with possible solutions in section 10. The resulting vocabulary is used by the model for managing the weights of each word during decoding, where words with a higher occurrence get a higher weight during the decoding phase.

6.3.4 Word embeddings Before training, the model transforms every word into a unique id, followed by the transformation to a word embedding matrix, which is an abstract representation of all words. The size of the word embedding matrix is 128 (the dimensions for each word) by 50.000 (the amount of different words in the vocabulary). During each training step, the values of this matrix are updated based on the calculated loss, resulting in a step-wise improvement of the model’s abstract representation of each word. A visualization of a learned word embedding is showed in section 10.3.

7 Experiments

This section describes the experiments that are related to the training of the models using the datasets of the previous section, along with the generation of lyrics using the trained models.

7.1 Training the model Each dataset (two without structure and three with structure) was divided into 90% training and 10% test data. Each training dataset was used to train the model on a GPU-accelerated computer, which was terminated after slightly more than five days, due to time-limitations. The losses of each model are shown in Figure ??. The loss for the three structural datasets is shown as one loss function, since these were almost equal. For applications such as Neural Machine Translation, a loss of 0 would be the optimal goal, because this would mean the generated translations are equal to the ones of human translators. However, for the generation of rap lyrics, this would not be desirable, since then the model would always generate the exact same as the target sequence, meaning it would not be able to generate novel or unique lyrics. The final losses might be optimized with more data, longer training, or an improved loss function, but should not be zero, since this would decrease the uniqueness of generated lyrics.

23 12 11 Verses 1 10 Verses 2 Structure 9 8 7 6 Loss 5 4 3 2 1 0 0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 Iterations

Figure 15: Decreasing loss during training each model. 1000 iterations correspond with 24 hours of training.

To reduce the run-time of training the model, the network’s parameters are not updated after each training example. Instead, the model receives a batch (a ’group’) of source-target pairs, and calculates the average loss of the batch. Following, the parameters of the encoder RNN, decoder RNN, attentional layer, and word embedding are updated using this average loss. During training, in time intervals of 30 minutes, the most recent trained model was used to generate new lyrics based on randomly selected source sequences in the test dataset (containing 10% of the lyrics), which are paired off with the correct (target) sequence to be able to evaluate each aspect. For the final evaluation, the most recent datasets of generated lyrics were used, containing 2900 source-target-generated triplets per trained model.

7.2 Generation of lyrics Besides the generation of lyrics during training, a few complete songs are generated after training, using the following method: first, a line from intro or verse is randomly selected from the training dataset and fed to the decoder. The output of the decoder is used as the ’real’ first line of the new song. For models that were trained on data with structure, the token of the first generated sequence is replaced by < i >, < v >, < i 1 >, < v 1 1 >, < i >< l 1 >, or < v 1 >< l 1 >, corresponding to the trained model and the type of generated token (intro or verse). Using this method, the model has a ’starting point’ for a new song, but does not literally use any line from the training datasets. Next, each line is generated by taking the previous output and inputting it to the decoder. During the generation of a song, the structure tokens are not modified. Whenever the model generates an incorrect token, it is still used for the generation of the next line. Since the datasets do not contain tokens for the start and ending of a song, the model

24 will not generate such tokens. Therefore, the ending is manually determined based on amount of generated lines and generated structure token. The generated songs are not used for evaluation, since the amount of lyrics is very small. However, one example of a complete song can be examined in Appendix 10.3.

8 Results

This sections discusses the results of the evaluation of rhyme, structure, idiom, and novelty, using the methods described in section 4. Furthermore, the problem of generating the ”unknown” token is shortly discussed. For all tables in this section, the following abbreviations are used to refer to a specific dataset.

T V The training datasets with only verses, containing 1.167.474 source-target pairs.

T S The three training datasets containing complete songs with structural tokens. Each has 1.645.822 source-target pairs. Since these datasets contain the same lyrics (but different structural tokens), separate evaluation for rhyme, idiom, and novelty is not needed. Furthermore, these training datasets are not used for the evaluation of structure.

Gen V1 The dataset generated during the last steps of training on Train V (with the ’normal’ model), containing 2900 source-target-decoder triplets.

Gen V2 The dataset generated during the last steps of training on Train V (with the modified model, using two loss functions, one for the source and one for the target), containing 2900 source-target-decoder triplets.

Gen S1 The dataset generated using the last checkpoint of the model trained on Train S, which contains the ’level 1’ (least specific) structural tokens from Table 4. The resulting dataset has 2900 source-target-decoder triplets.

Gen S2 The dataset generated using the last checkpoint of the model trained on Train S, which contains the ’level 2’ (more specific) structural tokens from Table 4. The resulting dataset has 2900 source-target-decoder triplets.

Gen S3 The dataset generated using the last checkpoint of the model trained on Train S, which contains the ’level 3’ (most specific) structural tokens from Table 4, containing 2900 source-target-decoder triplets.

Gen S Refers to the combination of the three datasets above (Gen S1, Gen S2, and Gen S3). For the evaluation of rhyme, idiom, and novelty, the structural tokens are irrelevant. Therefore, the tables related to these evaluations will show Gen S instead of the three datasets above, and the corresponding scores are the averages of the three (the averages over 8700 source-target-decoder triplets).

25 8.1 Evaluation of rhyme density Table 6 and Table 7 show the results of evaluating the rhyme density for all training and generated datasets. The scores of Table 6 are calculated by allowing the same words to rhyme, while the scores of Table 7 are calculated by disallowing this. Each table shows the following three aspects: 1) average amount of rhyming vowels in a line, 2) average amount of total tokens in a line, and 3) average rhyme density, using the methods described in section 4.

Table 6: Average (avg) rhymed vowels, total tokens, and resulting rhyme density of training and generated datasets. These results are based on allowing the same words to rhyme.

T V T S Gen V1 Gen V2 Gen S Avg rhymed vowels 2.15 2.44 1.72 5.14 1.58 Avg total words 9.85 10.45 8.54 10.34 9.84 Avg rhyme density 0.211 0.252 0.219 0.501 0.168

Table 7: Average (avg) rhymed vowels, total tokens, and resulting rhyme density of training and generated datasets. These results are based on disallowing the same words to rhyme.

T V T S Gen V1 Gen V2 Gen S Avg rhymed vowels 1.98 1.85 1.42 0.56 1.35 Avg total tokens 9.85 10.45 8.54 10.34 9.84 Avg rhyme density 0.204 0.190 0.183 0.061 0.145

The tables above lead to the following observations:

• All datasets have higher average rhyme densities when allowing the same words to rhyme.

• Model Gen V2 has a higher rhyme density than the training datasets (T V and T S) when allowing repetition, but a lower score when disallowing this.

• The structural training datasets (T S) score higher than the datasets with only verses (T V) when allowing the same words to rhyme, but lower when disallowing this.

8.2 Evaluation of end rhyme score Table 8 shows the average and maximum end rhyme score for each dataset, using the methods described in section 4. Following, Figure 16 shows the distribution over end rhyme scores as a bar plot. The size of each bar represents the percentage of lines that has a end rhyme score within the corresponding interval of the bar, with the first representing a score of 0.

26 Table 8: Average (avg) and maximum (max) end rhyme score of training and generated datasets, by comparing the last 2 words of each line with the last 2 words of the previous and next line, and choosing the highest score, as described in section 4.

T V T S Gen V1 Gen V2 Gen S Avg end rhyme score 0.392 0.367 0.024 0.399 0.030 Max end rhyme score 4.5 4.5 1.0 2.0 1.5

100 96.2 94.9 Train V Train S 80 Gen V1 Gen V2 60 55.458.2 56.2 Gen S 43.4 40 38.436.4

Percentage of lines 20 3.8 5.1 6.3 5.38 0 0.3 0 0 0 0.01-1 1.01 < End rhyme score intervals

Figure 16: Distribution over end rhyme scores. Each bar represents the percentage of lines with corresponding end rhyme score in a dataset.

The table and figure above lead to the following observations:

• The average end rhyme score of the training data with only verses (T V) is higher than the ones with structure (T S), but they have the same maximum score.

• The generated dataset from the model with modified loss function (Gen V2) has the highest average end rhyme score (but not maximum).

• The other generated datasets (Gen V1 and Gen S) have an average end rhyme that is more than 10 times lower than the ones from the training datasets.

• All generated datasets are incapable of generating rhymes with a score higher than 1, while Gen V2 generates more words with a score between 0 and 1 than the training datasets.

8.3 Evaluation of structure Table 9 shows the accuracy of correctly generated structural token for each generated dataset with structure (Gen S1, Gen S2, and Gen S3), as described in section 4. Furthermore, based on observing 600 generated structure tokens, two types of errors can be distinguished: the first type is a generated token that is not equal to the target

27 token, but still is logically valid, for example for the generation of a new song. The second type is a generated token that is completely invalid for the generation of a new song. Examples of both types, sampled from different datasets, are shown in Table 10 (type 1) and Table 11 (type 2).

Table 9: Accuracy of correctly generated structural token for each generated dataset with structure (Gen S1, Gen S2, and Gen S3).

Gen S1 Gen S2 Gen S3 Correct Type 93.6% 83.7% 80.7% Correct Type Number n/a 90.7% 89.1% Correct Line Number n/a 80.9% 58.4% Perfectly correct 93.6% 76.2% 71.3%

Table 10: Examples of type 1 error: the generated structural token is not equal to the target sequence token, but is still logically valid.

Source Target Generated Sampled from < i > < c > or < v > < i > Gen S1 < v > < b > or < c > or < o > < v > Gen S1 < i >< l 6 > < i >< l 7 > < c >< l 1 > Gen S2 < c >< l 1 > < v 2 >< l 1 > < c >< l 2 > Gen S2 < v 3 21 > < v 3 22 > < c 1 > Gen S3 < c 8 > < v 2 1 > < c 9 > Gen S3

Table 11: Examples of type 2 error: the generated structural token is logically invalid.

Source Target Generated Sampled from < b > < b > or < v > < i > Gen S1 < b > < b > no token Gen S1 < v 1 >< l 60 > < v 1 >< l 61 > < v 1 >< l 39 > Gen S2 < v 3 >< l 24 > < v 3 >< l 25 > < v 3 >< l 18 > Gen S2 < b 3 2 > < v 4 1 > < v 11 2 > Gen S3 < o 18 > < o 19 > < o 4 > Gen S3

The tables above lead to the following observations: • The accuracies decrease in the same order as the specificity of the structure tokens increases. • The prediction of line number has the lowest accuracy on average. • The distribution of both errors is about 50/50, as measured with a manual evaluation of 600 source-target-generated triples (200 per dataset). • The wrong generation of structure token often occurs with relatively high type or line numbers.

28 8.4 Evaluation of idiom Table 12 shows the amount of sequences containing at least one of the 125 characteristics words, as described in section 4. Furthermore, Table 13 shows the total amount of characteristic words in the datasets used for evaluation.

Table 12: Proportion of sequences that contain at least one characteristic word for each dataset.

T V T S Gen V1 Gen V2 Gen S Percentage 84.7% 87.2% 65.1% 83.6% 72.2%

Table 13: Proportion of characteristic words in each dataset.

Gen V1 Gen V2 Gen S1 Gen S2 Gen 3 Source/Target 5.25% 5.19% 5.17% 4.68% 5.07% Generated 2.61% 5.30% 2.80% 2.89% 2.55%

The tables above lead to the following observations:

• The amounts of sequences containing a characteristic word is similar to the amounts of the training datasets.

• Model Gen V2 has a percentage in similar range of the training datasets.

• The proportions of characteristic words for source-target pairs in the datasets used for evaluation are in similar range of the ones from the full training datasets, which are 4.72% for T V and 5.07% for T S.

8.5 Evaluation of novelty Table 14 shows the division of uniquely generated sequences, sequences that are generated before by this trained model (within the data of 2900 source-target-decoder triplets), and generated sequences that are exactly present in the training data, as described in section 4. Furthermore, Table 15 shows some examples of generated sequences that were exactly present in the training dataset. Lastly, Table 16 shows the average, minimum, and maximum cosine-similarities of the source-target and source-generated pairs, as described in section 4.

Table 14: Amount of unique sequences, duplicates compared to generated data, and duplicates compared to training data in the generated datasets.

Gen V1 Gen V2 Gen S1 Gen S2 Gen S3 Unique gen. 28.3% 99.3% 39.3% 82.1% 93.1% Duplicate gen. 71.7% 0.7% 60.7% 17.9% 6.9% Duplicate train 0.5% 1.5% 15.2% 20.3% 17.4%

29 Table 15: 11 examples of frequently generated sequences, which are exactly present in the training datasets.

Generated sequence Sampled from ” do n’t be sweatin ” Gen V1 ” i do n’t know you ” Gen V1 ” uh ! ” Gen V2 ” bumrush ” Gen V2 ” i got the ganja ” Gen S1 ” i ai n’t ” Gen S1 ” i ’m the beneficiary ” Gen S2 ” i ’m telepathic ” Gen S2 ” i ’m leanin like a sixty-fo’ ” Gen S3 ” ( ) ” Gen S3 ” yeah ” or ” ( yeah ) ” All datasets

Table 16: Average (avg), minimum (min), and maximum (max), cosine-similarities of each generated dataset, by comparing each source-target and source-generated pair.

Gen V1 Gen V2 Gen S1 Gen S2 Gen 3 Avg Source-Target 0.127 0.126 0.145 0.128 0.141 Avg Source-Generated 0.123 0.697 0.117 0.098 0.097 Min Source-Target 0.0 0.0 0.0 0.0 0.0 Min Source-Generated 0.0 0.0 0.0 0.0 0.0 Max Source-Target 1.0 1.0 1.0 1.0 1.0 Max Source-Generated 0.72 1.0 0.873 1.0 1.0

The tables above lead to the following observations: • The lyrics of the model with modified loss function (Gen V2) have the highest uniqueness. • Because the uniqueness of Gen S1, Gen S2, and Gen S3 differs more than expected, they are shown separately instead of only their average. These have a higher uniqueness than the normal model (Gen V1) within the generated dataset, but more duplicates compared to the training dataset. • As the specificity of the structural model increases, the uniqueness of generated lyrics within the generated dataset increases, while the amount of duplicates compared to the training data remains in a similar range. • The cosine-similarities of all models except the one with modified loss function are in similar range of the ones of the training data, with the one from only verses approaching it the best. • The greatest part of observed duplicates compared to the training dataset consists of small phrases (a few tokens), while only a few are full lines.

30 8.6 Unknown token As explained in section 6, < UNK > (”unknown”) tokens are used for all out-of- vocabulary words, resulting in the possibility that the model generates these tokens during the generation of lyrics. As observed during the evaluation of each characteristic, the ”unknown” token was indeed generated in some sequences. Therefore, a distribution is made for each generated dataset to show the occurrences of this token, as shown in Table 17. Furthermore, a possible solution to this problem is discussed in section 10. Lastly, one example is seen in Appendix10.3, where the ”unknown” token is marked in red.

Table 17: Proportion of ”unknown” tokens in generated in each dataset used for evaluation.

Gen V1 Gen V2 Gen S1 Gen S2 Gen 3 0.96% 0.40% 2.29% 0.72% 0.54%

9 Conclusion

This thesis has described the research with respect to the automated generation of novel English rap lyrics, using sequence-to-sequence learning. Based on the characteristics of rap lyrics, and the research goal to generate lyrics with similar qualities to those of human rappers, evaluation has focused on rhyme, structure, idiom, and novelty. The most important results of each of these are discussed in the following.

Rhyme The models are able to generate some internal and end rhyme, but do not have scores within a similar range of human lyrics, except for the lyrics Gen V2: these have higher scores for rhyme density (with allowance of repetition) and average end rhyme score.

Structure The models are able to learn and generate structure tokens, with a decrease in accuracy when the specificity of the structure tokens increases. As shown in Appendix 10.3, the generation of structure tokens could be used to interpret the structure of a generated song.

Idiom The models are able to learn and generate words that are in this research considered characteristic for rap lyrics, with about half of the proportion in the training datasets. However, the lyrics from Gen V2 contain the same proportion of characteristic words as the training datasets.

Novelty The models are able to generate some unique lyrics, but often generate lyrics that they have already generated, or lyrics that were equal to ones from the training datasets. However, the average cosine-similarity of the source-generated pairs is in a similar range of those of the source-target pairs, except for model Gen V2, which shows a high cosine-similarity (and high uniqueness).

31 Our hypothesis is partially confirmed: the sequence-to-sequence model shows capabil- ities of generating rap lyrics with similar qualities to those of human rappers, with the best results for idiom and structure. Nevertheless, rhyme and novelty might still need improvement to come in a similar range of human lyrics, with a focus on generating longer (end) rhymes, and less lyrics that the model has already generated. In the discussion, several improvements to the model are mentioned.

10 Discussion

This section discusses the interpretation of results, possible limitations of the research, and suggestions for future work.

10.1 Interpretation of results This subsections discusses several interpretations of the results of each evaluation, includ- ing rhyme, structure, idiom, and novelty.

Rhyme The high scores of rhyme density might be caused due to the flexibility of the calculation: it does not consider separate tokens and consonants in-between vowels. This was one of the reasons an additional evaluation method was proposed, and this method indeed shows a greater difference between training and generated data. The lack of end rhyme may be due to the fact that the model only gets one source sequence and one target sequence, which often do not contain rhyme. Therefore, it is less likely the model learns parameters for rhyme. Furthermore, it has no formal criteria for rhyme during training, since this was not implemented in the loss function.

Structure As expected, the simpler structure tokens have a higher accuracy. This could be explained due to the fact that these simpler structure tokens have more training examples per token, resulting in a better understanding of each token. As the specificity increases, the amount of training examples decreases, making it harder to learn. However, for the generation of a new song, the use of the most simple structure token (earlier referred to as ’level 1’) is not desirable with the current architecture, since it excludes type and line number, which are essential for the generation of a full song (unless randomness is added to the model).

Idiom The fact that model Gen V2 has a high performance might be explained due to the fact that this model often copies words from the source sequence, resulting in a high occurrences for these characteristic words. Furthermore, the learning and generation of characteristic words might be considered the easiest of the four tasks, using this architecture and data.

Novelty The most notable value is the high uniqueness of Gen V2. However, this is likely due to the fact that this model often generates a subset of the source sequence, which is supported by the very high average cosine-similarity for this model. An explanation for this could be that the loss was influenced too much by the source

32 sequence, and too little by the target sequence. It could be the case that during training, the model learns to copy parts of the source sequence, because it gets a low loss when it does so, resulting in a model that is good at copying source sequences. This is also supported by the graph of loss based on the source sequence, which has a much steeper decline than the one from the target sequence, and ends in a loss around 1, while the loss of the target end around 4. Another notable result is the increase of uniqueness as the specificity of the structure tokens increase. While this could be due to randomness of the relative small datasets, it could also be explained by the fact that a structure token might have a large impact on the sequence that will be generated. This is supported by several manual experiments, where a source sequence is given to the model three times, each with a different structure token (from ’level 1’, ’level 2’, or ’level 3’). As observed, the generated sequences are often different, while the lyrics (excluding the structure tokens) of the source sequence are equal. This might also explain why the lyrics generated by models trained on structural datasets (Gen S1, Gen S2, and Gen S3) have a higher uniqueness than the ones from the normal model with only verses (Gen V1). Additionally, this might explain why these models have more duplicates compared to the training dataset (17.6% averaged over the three), while the models with only verses have very little duplicates (1% averaged over the two).

10.2 Possible limitations This subsection discusses the main limitations of the research, including the relatively small training dataset, short training time, small generated dataset, and problem of ”unknown” tokens. Firstly, the size of the dataset was relatively small, given the complexity of the task. In the research of Neural Machine Translation, similar models are trained with datasets of 12 million sentences (Sutskever et al., 2014) or 348 million words (Bahdanau et al., 2014; Cho et al., 2014). However, due to the difficulty of acquiring a large dataset containing correctly notated rap lyrics, this was not feasible within the scope of this thesis. Another limitation was the short time of training each model. As shown in Figure 15, the loss of each model had not yet converged. When each model would be trained longer, the loss might become lower, which likely results in the generation of lyrics that score better within each evaluation. Thirdly, the size of the generated datasets was relatively small compared to the size of the training datasets. For a more robust evaluation, larger datasets of generated lyrics are desirable to reduce the impact of randomness. Furthermore, it is preferred to evaluate completely generated songs, instead of source-target-generated triplets. As described in section 8, the model often generates the < UNK > token, which is not addressed in this thesis. In Neural Machine Translation, the problem could be addressed by aligning tokens from the source sequence to the generated sequence (Hashimoto et al., 2016; Jean et al., 2015; Luong et al., 2014). However, in the generation of rap lyrics, there is often no one-to-one alignment (only rarely in the case of a repeated sequence). Therefore, other approaches such as replacing the ”unknown” tokens with similar in- vocabulary tokens based on a trained similarity model (Li et al., 2016) might achieve better results, and could be tested in a follow-up research.

33 10.3 Future work Besides reducing the limitations discussed above, the most interesting future research might be one, or a combination of the following four: 1) using a hierarchical model, 2) modifying the loss function, 3) using a sampled probability, and 4) including more evaluation criteria. Each is shortly discussed below. The current model use one source sequence, consisting of one line in a song, to generate a new sequence. Using this approach, the model is not able to learn rhyme schemes and the building of story throughout a part in a song. Thus, a preferred way would be to show the model multiple lines instead of one, prior to the prediction of the next line. Such an architecture could be implemented with a Hierarchical Recurrent Encoder-Decoder (Serban et al., 2017; Sordoni et al., 2015) or a Hierarchical Attention Network (Yang et al., 2016), which would allow to learn complex structures on the word-level and line-level, instead of only the word-level. Such architectures might improve the generation of rhyme, rhyme schemes, narration/storytelling, and structure. For such approaches, a larger dataset is required, since it is harder to encode information from multiple lines than from a single one. Another research could focus on the effect of modifying the model’s loss function. The standard loss function, which uses the log perplexity as a measure for quality of the generated sequence, might not be the best measure for rap lyrics generation. Furthermore, the combination of target-generated and source-generated loss, using the log perplexity as measure, resulted in a model that tends to copy the source sequence. Further research should focus on using different measures for loss, such as the amount of rhyme, and quality of idiom and structural tokens in the generated sequence, compared to the source sequence and target sequence. This way, the model gets an indication that it is performing well on rhyme, idiom, and structure, which might steer the model in a preferable direction during training. The third modification is a sampled probability for the selection of words during generation. Currently, the word with the highest probability is selected based on a softmax function. Instead, the selection of words could be done by sampling them using their probabilities, enabling the selection of words that have lower probabilities compared to the word with the highest probability. This method might increase the uniqueness of the generated lyrics, since the decoder model would no longer be deterministic. Finally, research could focus on extending the evaluation methods for rap lyrics, for example using the automatic detection of rhyme schemes (Addanki and Wu, 2013; Hinton and Eastwood, 2015; Hirjee and Brown, 2010; Hirjee, 2010; Mayer et al., 2008; Reddy and Knight, 2011; Tanasescu et al., 2016), alliteration (Khoury and Hayes, 2015), punch lines / puns (Kao et al., 2016; Miller and Gurevych, 2015; Miller and Turkovi´c,2016), humor (Mihalcea and Strapparava, 2006; Rafal et al., 2015; Taylor, 2010; Yang et al., 2015), emotion / sentiment (Calvo and Kim, 2013; Li and Xu, 2014; Mohammad, 2015; Munezero et al., 2014), and topic / semantics (Kleedorfer et al., 2008; Logan et al., 2004), or manual evaluation for narration (Potash et al., 2016).

34 References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., et al. (2016). Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.

Addanki, K. and Wu, D. (2013). Unsupervised rhyme scheme identification in hip hop lyrics using hidden markov models. In International Conference on Statistical Language and Speech Processing, pages 39–50. Springer.

Alim, H. S. (2009). Hip hop nation language. Linguistic anthropology: A reader, pages 272–289.

Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

Bird, S. (2006). Nltk: the natural language toolkit. In Proceedings of the COLING/ACL on Interactive presentation sessions, pages 69–72. Association for Computational Linguistics.

Calvo, R. A. and Mac Kim, S. (2013). Emotions in text: dimensional and categorical models. Computational Intelligence, 29(3):527–543.

Cho, K., Van Merri¨enboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.

Chopra, S., Auli, M., Rush, A. M., and Harvard, S. (2016). Abstractive sentence summarization with attentive recurrent neural networks. Proceedings of NAACL- HLT16, pages 93–98.

Edwards, P. (2009). How to rap. Review Press.

Escoto, A. M. and Torrens, M. B. (2012). Rap the language. http:// publicacionesdidacticas.com/hemeroteca/articulo/028008/articulo-pdf. Ac- cessed: June 22, 2017.

Hashimoto, K., Eriguchi, A., and Tsuruoka, Y. (2016). Domain adaptation and attention- based unknown word replacement in chinese-to-japanese neural machine translation. In Proceedings of the 3rd Workshop on Asian Translation (WAT2016), pages 75–83.

Hinton, E. and Eastwood, J. (2015). Playing with pop culture: Writing an algorithm to analyze and visualize lyrics from the musical hamilton.

Hirjee, H. (2010). Rhyme, rhythm, and rhubarb: Using probabilistic methods to analyze hip hop, , and misheard lyrics.

Hirjee, H. and Brown, D. (2010). Using automated rhyme detection to characterize rhyming style in rap music.

35 Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–1780.

Jean, S., Firat, O., Cho, K., Memisevic, R., and Bengio, Y. (2015). Montreal neural machine translation systems for wmt’15. In WMT@ EMNLP, pages 134–140.

Kao, J. T., Levy, R., and Goodman, N. D. (2016). A computational model of linguistic humor in puns. Cognitive science, 40(5):1270–1285.

Khoury, R. and Hayes, D. W. (2015). Alliteration and character focus in the york plays. Digital Medievalist, 10.

Kleedorfer, F., Knees, P., and Pohle, T. (2008). Oh oh oh whoah! towards automatic topic detection in song lyrics. In Ismir, pages 287–292.

Li, W. and Xu, H. (2014). Text-based emotion classification using emotion cause extraction. Expert Systems with Applications, 41(4):1742–1749.

Li, X., Zhang, J., and Zong, C. (2016). Towards zero unknown word in neural machine translation. In IJCAI, pages 2852–2858.

Logan, B., Kositsky, A., and Moreno, P. (2004). Semantic analysis of song lyrics. In Multimedia and Expo, 2004. ICME’04. 2004 IEEE International Conference on, volume 2, pages 827–830. IEEE.

Luong, M.-T., Sutskever, I., Le, Q. V., Vinyals, O., and Zaremba, W. (2014). Addressing the rare word problem in neural machine translation. arXiv preprint arXiv:1410.8206.

Mahedero, J. P., Mart´Inez, A.,´ Cano, P., Koppenberger, M., and Gouyon, F. (2005). Nat- ural language processing of lyrics. In Proceedings of the 13th annual ACM international conference on Multimedia, pages 475–478. ACM.

Malmi, E., Takala, P., Toivonen, H., Raiko, T., and Gionis, A. (2015). Dopelearning: A computational approach to rap lyrics generation. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1–10.

Mayer, R., Neumayer, R., and Rauber, A. (2008). Rhyme and style features for musical genre classification by song lyrics. In ISMIR, pages 337–342.

Mihalcea, R. and Strapparava, C. (2006). Learning to laugh (automatically): Computa- tional models for humor recognition. Computational Intelligence, 22(2):126–142.

Miller, T. and Gurevych, I. (2015). Automatic disambiguation of english puns. In ACL (1), pages 719–729.

Miller, T. and Turkovi´c,M. (2016). Towards the automatic detection and identification of english puns. The European Journal of Humour Research, 4(1):59–75.

Mohammad, S. M. (2015). Sentiment analysis: Detecting valence, emotions, and other affectual states from text. Emotion Measurement, pages 201–238.

36 Munezero, M. D., Montero, C. S., Sutinen, E., and Pajunen, J. (2014). Are they different? affect, feeling, emotion, sentiment, and opinion detection in text. IEEE transactions on affective computing, 5(2):101–111.

Nallapati, R., Xiang, B., and Zhou, B. (2016a). Text summarization. arXiv preprint arXiv:1602.06023.

Nallapati, R., Zhou, B., Gulcehre, C., Xiang, B., et al. (2016b). Abstractive text summa- rization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023.

Paterson, M. and Danˇc´ık, V. (1994). Longest common subsequences. Mathematical Foundations of Computer Science 1994, pages 127–142.

Potash, P., Romanov, A., and Rumshisky, A. (2015). Ghostwriter: Using an lstm for automatic rap lyric generation. In EMNLP, pages 1919–1924.

Potash, P., Romanov, A., and Rumshisky, A. (2016). Evaluating creative language generation: The case of rap lyric ghostwriting. arXiv preprint arXiv:1612.03205.

Rafal, R., Yusuke, A., Motoki, Y., and Kenji, A. (2015). Automatic narrative humor recognition method using machine learning and semantic similarity based punchline detection. In International Workshop on Chance Discovery, Data Synthesis and Data Market in IJCAI2015. IJCAI.

Reddy, S. and Knight, K. (2011). Unsupervised discovery of rhyme schemes. In Pro- ceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2, pages 77–82. Association for Computational Linguistics.

Rush, A. M., Chopra, S., and Weston, J. (2015). A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685.

Serban, I. V., Sordoni, A., Lowe, R., Charlin, L., Pineau, J., Courville, A., and Bengio, Y. (2017). A hierarchical latent variable encoder-decoder model for generating dialogues. In Thirty-First AAAI Conference on Artificial Intelligence.

Sordoni, A., Bengio, Y., Vahabi, H., Lioma, C., Grue Simonsen, J., and Nie, J.-Y. (2015). A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pages 553–562. ACM.

Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.

Tanasescu, C., Paget, B., and Inkpen, D. (2016). Automatic classification of poetry by meter and rhyme. In FLAIRS Conference, pages 244–249.

Taylor, J. M. (2010). Ontology-based view of natural language meaning: the case of humor detection. Journal of Ambient Intelligence and Humanized Computing, 1(3):221–234.

37 Yang, D., Lavie, A., Dyer, C., and Hovy, E. H. (2015). Humor recognition and humor anchor extraction. In EMNLP, pages 2367–2376.

Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., and Hovy, E. (2016). Hierarchical attention networks for document classification. In Proceedings of NAACL-HLT, pages 1480–1489.

38 Appendix A. List of 348 rappers and rap groups used for training data.

0-9 (G-Unit), 504 Boyz, 2Pac (), A Ab-Soul, , , , , Afrika Bambaata, Afro- man, Afu-Ra, , Ali Vegas, (Tha) Alkaholiks, Anderson .Paak, (Dre Dog), , Angel Haze, , A+, Army Of The Pharoahs, Arrested Development, A$AP Ferg, A$AP Mob, A$AP Rocky, Asher Roth, Atmosphere, AZ B Bas, , Beanie Sigel, (few songs manually selected), , , Big K.R.I.T., Big L, (isher), , , Bizzy Bone, Blackalicious, , Black Rob, BlackStarr, (& ), B.o.B, Bone Thugs-N-Harmony, , , Bow Wow, , , , , C , Cam Meekins, Cam’ron, , Cannibal Ox, Capital Steez, -N- Noreaga, Cappadonna, Casey Veggies, Cassidy, , , , , Chief Keef, Childish Gambino, Chingo Bling, Chingy, Chino XL, , , , Classified, Clipse, Common, Company Flow, , , Cormega, Count Bass D, , CunninLynguists, Curren$y, Cyhi the Prynce, D , Da Brat, , Das EFX, Dave East, , Death Grips, , , , , , , , , , DJ Jazzy Jeff (& The Fresh ), DJ Premier, DJ Quik, DMX, (The) D.O.C., , Dr. Dre E E-40, Eazy-E, , Eightball (& MJG), , Eminem, EPDM, Erik B (& ), E.S.G., , , , , F , , , Flo Rida, , , Fredo Santana, , French Montana, (The) G (The) Game, (& ), G-Eazy, , , , , (few songs manually selected), Grandmaster Flash, (The) , , GZA (The Genius) H (MC) Hammer, Heavy D, Heavy Metal Kings, , , Hieroglyph- ics, , I , Ice-T, , Illogical, , , J (The) Jacka, Ja Rule, , , Jay-Z, J. Cole, , , , Jehst, , Jidenna, J-Live, , Joey Bada$$, Joel Ortiz, , K , Kendrick Lamar, , , Kid ’N Play, , , , Kilo Ali, King Tee, K Koke, K’naan, , (& Ultramagnetic MCs), Kool Keith (& Dr. Octagon), Kool Moe Dee, K-Rino, , KRS-One, Kurtis Blow, (& ) L Large Professor, , , Lil Dicky, , Lil Kim, , Lil Wyte, , Living Legends, LL Cool J, Logic, (The) LOX, , , M , , Machine Gun Kelly (MGK), , , , Mad Child, , , Max B, MC Lyte, MC Ren, , Memphis

39 Bleek, , MF Doom (including alter ego’s), Missy Elliot, , , , Mystikal N , , , Naughty By Nature, , , NF, , , Non Phixion, (The) Notorious B.I.G., N.W.A O , Oddisee, (including solo’s), Ol’ Dirty Bastard, , , , P Papoose, (& CL Smooth), , (The) Pharcyde, , N.E.R.D., , , Professor Green, , Prozak, Public Enemy, Puff Daddy (P Diddy), Q R , Rah Digga, Rakim, , R.A. the Rugged Man, , , , , , , Roll Deep, (The) Roots, Roots Manuva, Royce Da 5’9”, Run-D.M.C, RZA S , Salt-N-Pepa, Saukrates, , Schoolly D, , , , Smif-N-Wessun, , Soulja Slim, , , , , , Stetsasonic, , Suga Free T , , , , T.I., , , Too $hort, Tragedy Khadafi, Travis Scott, (A) Tribe Called Quest, T-Rock, Truck North, , , Tyga U (The) Underachievers, (UGK) V Vakill, W , Warren G, Whodini, Wiley, Wu-Tang Clan X X-Clan, Y Young Bleed, , Young , Z Z-Ro

40 Appendix B. Example of generated lyrics.

The following lyrics are generated with model Gen S3, the model with the most specific structure tokens. The words in small caps (marking the structure of the song) are manu- ally added for readability. Furthermore, incorrectly generated tokens and the ”unknown” token are marked in red. Warning: the lyrics contain explicit content. intro< i 1 > i do n’t wan na do n’t fuck < i 2 > and i-i-i-i-i < i 3 > i ’m a muthafucka < i 4 > and expanding the farmers < i 5 > i ’m just a funky bomb , and i ’m on the venue verse< v 1 1 > i ’m a dog , i ’m a monster < v 1 2 > i ’m the tip-top , i ’m a loser ) < v 1 3 > i ’m a five base < v 1 4 > what.. < v 1 5 > i ’m bionic < v 1 6 > i ’m a paperboy < v 1 7 > i appreciate the p.d < v 1 8 > i ’m leanin < v 1 9 > i ’m a dog < v 1 10 > and the < UNK > < v 1 11 > i ’m a playa coast < v 1 12 > i ’m a fuck you down < v 1 13 > ai n’t no shit , i ’m a real man < v 1 14 > i do n’t know what you know i ’m a fuck < v 1 15 > i ’m a motherfucking motherfucking crypt < v 1 16 > i ’m on the circle to the end < v 1 17 > and i ’m synonymous of the alpine box < v 1 18 > i do n’t know it , i ’m a lot to be < v 1 19 > i ’m on the middle of the inner eye < v 1 20 > i ’m fed up with a killa alpine < v 1 21 > you ai n’t no shit < v 1 22 > i do n’t fuck it verse< v 6 13 > i know what you know < v 1 18 > i do n’t know < v 1 19 > paulette was the abdomen < v 1 20 > i ’m a loser outro< o 8 > yeah , yeah < o 9 > yeah , yeah < o 10 > what.. < o 11 > woah woah < o 12 > okay

41 Appendix C. Example of learned word embedding.

Figure 17 shows a simplistic 3D visualization (the real amount of dimensions is 128) of the learned word embedding for the structure token < v1 1 >, sampled from dataset Gen S3. As expected, the structure token is related to other structure tokens. This is also supported by measuring the 100 most related tokens, which include 28 other structure tokens.

Figure 17: Visualization of learned word embedding for < v1 1 >.

42