Evaluating Creative Language Generation: the Case of Rap Lyric Ghostwriting

Evaluating Creative Language Generation: The Case of Rap Lyric Ghostwriting Peter Potash, Alexey Romanov, Anna Rumshisky University of Massachusetts Lowell Department of Computer Science fppotash,aromanov,[email protected] Abstract which no task is well-defined, and no progress can be made. The goal of this paper is to develop an eval- Language generation tasks that seek to mimic uation methodology for one such task, ghostwriting, human ability to use language creatively are or more specifically, ghostwriting of rap lyrics. difficult to evaluate, since one must consider creativity, style, and other non-trivial aspects Ghostwriting is ubiquitous in politics, literature, of the generated text. The goal of this paper and music. As such, it introduces a distinction be- is to develop evaluation methods for one such tween the performer/presenter of text, lyrics, etc, task, ghostwriting of rap lyrics, and to pro- and the creator of text/lyrics. The goal of ghostwrit- vide an explicit, quantifiable foundation for ing is to present something in a style that is believ- the goals and future directions of this task. able enough to be credited to the performer. In the Ghostwriting must produce text that is sim- domain of rap specifically, rappers sometimes func- ilar in style to the emulated artist, yet dis- tinct in content. We develop a novel evaluation as ghostwriters early on before embarking on tion methodology that addresses several com- their own public careers, and there are even busi- 1 plementary aspects of this task, and illustrate nesses that provide written lyrics as a service . The how such evaluation can be used to meaning- goal of automatic ghostwriting is therefore to create fully analyze system performance. We pro- a system that can take as input a given artist’s work vide a corpus of lyrics for 13 rap artists, anno- and generate similar yet unique lyrics. tated for stylistic similarity, which allows us to Our objective in this work is to provide a quantifi- assess the feasibility of manual evaluation for generated verse. able direction and foundation for the task of rap lyric generation and similar tasks through (1) develop- ing an evaluation methodology for such models, and 1 Introduction arXiv:1612.03205v1 [cs.CL] 9 Dec 2016 (2) illustrating how such evaluation can be used to Language generation tasks are often among the most analyze system performance, including advantages difficult to evaluate. Evaluating machine translation, and limitations of a specific language model devel- image captioning, summarization, and other similar oped for this task. As an illustration case, we use tasks is typically done via comparison with existing the ghostwriter model previously proposed in ex- human-generated “references”. However, human ploratory work by Potash et al. (2015), which uses beings also use language creatively, and for the lan- a recurrent neural network (RNN) with Long Short- guage generation tasks that seek to mimic this abil- Term Memory (LSTM) for rap lyric generation. ity, determining how accurately the generated text The following are the main contributions of this represents its target is insufficient, as one also needs paper. We present a comprehensive manual eval- to evaluate creativity and style. We believe that one 1http://www.rap-rebirth.com/, of the reasons such tasks receive little attention is http://www.precisionwrittens.com/ the lack of sound evaluation methodology, without rap-ghostwriters-for-hire/ uation methodology of the generated verses along ever, this may indicate that it tends to form ‘average’ three key aspects: fluency, coherence, and style verses, which are then more likely to be matched matching. We introduce an improvement to the with existing verses from an artist rather than an- semi-automatic methodology used by Potash et al. other random verse from the same artist. Overall, (2015) that automatically penalizes repetitive text, the evaluation methodology we present provides an which removes the need for manual intervention and explicit, quantifiable foundation for the ghostwrit- enables a large-scale analysis. Finally, we build a ing task, allowing for a deeper understanding of the corpus of lyrics for 13 rap artists, each with his task’s goals and future research directions. own unique style, and conduct a comprehensive evaluation of the LSTM model performance using 2 Related Work the new evaluation methodology. The corpus in- cludes style matching annotation for select verses in In the past few years there has been a significant dataset, which can form a gold standard for future amount of work dedicated to the evaluation of nat- work on automatic representation of similarity be- ural language generation (Hastie and Belz, 2014), tween artists’ styles. The resulting rap lyric dataset dealing with different aspects of evaluation method- is publicly available from the authors’ website. ology. However, most of this work focuses on sim- Additionally, we believe that the annotation ple tasks, such as referring expressions generation. method we propose for manual style evaluation can For example, Belz and Kow (2011) investigated the be used for other similar generation tasks. One ex- impact of continuous and discrete scales for gener- ample is ’Deep Art’ work in the computer vision ated weather descriptions, as well as and simple im- community that seeks to apply the style of a partic- age descriptions that typically consist of a few words ular painting to other images (Gatys et al., 2015; Li (e.g., ”the small blue fan”). and Wand, 2016). One of the drawbacks of such Previous work that explores text generation for work is a lack of systematic evaluation. For ex- artistic purposes, such as poetry and lyrics, gener- ample, Li and Wand (2016) compared the results ally uses either automated or manual evaluation. In of the model with previous work by doing a man- terms of manual evaluation, Barbieri et al. (2012) ual inspection during an informal user study. The have a set of annotators evaluate generated lyrics presence of a systematic formal evaluation process along two separate dimensions: grammar and se- would lead to a clearer comparison between models mantic relatedness to song title. The annotators rated and facilitate progress in this area of research. With the dimensions with scores 1-3. A similar strat- this in mind, we make the interface used for style egy was used by Gervas´ (2000), where the author evaluation in this work available for public use. had annotators evaluate generated verses with regard Our evaluation results highlight the truly multi- to syntactic correctness and overall aesthetic value, faceted nature of the ghostwriting task. While hav- providing scores in the range 1-5. Wu et al. (2013) ing a single measure of success is clearly desirable, had annotators determine the effectiveness of vari- our analysis shows the need for complementary met- ous systems based on fluency as well as rhyming. rics that evaluate different components of the over- Some heuristic-based automated approaches have all task. Indeed, despite the fact that our test-case also been used. For example, Oliveira et al. (2014) LSTM model outperforms a baseline model across use a simple automatic heuristic that awards lines for numerous artists based on automated evaluation, the ending in a termination previously used in the gener- full set of evaluation metrics is able to showcase the ated stanza. Malmi et al. (2015) evaluate their gen- LSTM model’s strengths and weakness. The co- erated lyrics based on the verses’ rhyme density, on herence evaluation demonstrates the difficulty of in- the assumption that a higher rhyme density means corporating large amounts of training data into the better lyrics. LSTM model, which intuitively would be desirable Note that none of the work cited above pro- to create a flexible ghostwriting model. The style vide a comprehensive evaluation methodology, but matching experiments suggest that the LSTM is ef- rather focus on certain specific aspects of gener- fective at capturing an artist’s general style. How- ated verses, such as rhyme density or syntactic correctness. Moreover, the methodology for generat- actual verse, but rather dialogue or chorus lines. We ing lyrics, proposed by the various authors, influ- therefore filter out all verses that are shorter than 20 ences the evaluation process. For instance, Barbi- tokens. Statistics of our dataset are shown in Table 1. eri et al. (2012) did not evaluate the presence of rhymes because the model was constrained to pro- 4 Evaluation Methodology duce only rhyming verses. Furthermore, none of the We believe that adequate evaluation for the ghost- aforementioned works implement models that gen- writing task requires both manual and automatic ap- erate complete verses at the token level(including proaches. The automated evaluation methodology verse structure), which is the goal of the models we enables large-scale analysis of the generated verse. aim to evaluate. In contrast to previous approaches However, given the nature of the task, the automated that evaluate whole verses, our evaluation method- evaluation is not able to assess certain critical as- ology uses a more fine-grained, line-by-line scheme, pects of fluency and style, such as the vocabulary, which makes it easier for human annotators, as they tone, and themes preferred by a particular artist. In no longer need to evaluate the whole verse at once. this section, we present a manual methodology for In addition, despite the fact the each line is annotated evaluating these aspects of the generated verse, as using a discrete scale, our methodology produces a well as an improvement to the automatic methodol- continuous numeric score for the whole verse, en- ogy proposed by Potash et al. (2015). abling better comparison. 4.1 Manual Evaluation 3 Dataset We have designed two annotation tasks for manual For our evaluation experiments, we selected the fol- evaluation.

Load more